Home > Blog > The Joel Archives > Capturing and Using Text from Old Documents in New Book Production

Capturing and Using Text from Old Documents in New Book Production

by | Sep 2, 2009

This post is the second in a series about the creation of a new book. To see all the articles in the series, click “The Journey of a Book” tab at the top of the page.

Well, I had a “manuscript” for my next book (see previous post here). But the old word processing files were long gone. I had to find a way to make all this potentially valuable text usable. When documents like this are scanned, the text is actually treated like a “picture” so you cannot edit the text, just place it in a document in exactly the form it was in when it was scanned. This wouldn’t work for me, so I started looking for another solution.

Heading to Kinkos

At one time there was a major market for OCR scanning software (Optical Character Recognition, “intelligent” software that interprets the letterforms in the “pictures” and makes estimations of what the word is), and I thought they must have kept developing that software. But it turns out that the increasing proportion of documents that are created on computers has lessened the need for this software. In the Macintosh space, there were only two programs available, and both were quite a bit more than I was willing to spend, since I only had the one project to convert to text.

Next I got on the phone to our local Fedex Kinko’s to see if they offer this service, but no luck. They only do scanning to PDF, which is where I had already arrived. The next two days I spent trying to find shareware or freeware OCR software and thought I would have to boot up my old Wintel box just to get this done, but it was an unappetizing prospect. Finally, I stumbed on VelOCRapter, an inexpensive program that put an “inbox” on the desktop. Just drop the files in the inbox and, after a few minutes of churning, out popped a file with the image “translated” into real text! It looked like I was in business. Only one problem. Here is what the original file and the OCR’s text looked like:

A page as it looked after being scanned to PDF
A page as it looked after being scanned to PDF
The same page after OCR scanning
The same page after OCR scanning

In some places it was hard to actually see the text because there was so much “junk” in the file from the scanning process. You can also see how the OCR software interpreted each line as a separate paragraph, so there would be thousands of “returns” to take out and tough to automate the process.

It looked like we would need some word processing “heavy lifting” before this project could really get off the ground. But, considering that I hadn’t had to write a word, I was still ahead of the game.

tbd advanced publishing starter kit