Capturing and Using Text from Old Documents in New Book Production

by Joel Friedlander on September 2, 2009 · 2 comments

This post is the second in a series about the creation of a new book. To see the other articles in the series, click on the following links:

Well, I had a “manuscript” for my next book (see previous post here). But the old word processing files were long gone. I had to find a way to make all this potentially valuable text usable. When documents like this are scanned, the text is actually treated like a “picture” so you cannot edit the text, just place it in a document in exactly the form it was in when it was scanned. This wouldn’t work for me, so I started looking for another solution.

Heading to Kinkos

At one time there was a major market for OCR scanning software (Optical Character Recognition, “intelligent” software that interprets the letterforms in the “pictures” and makes estimations of what the word is), and I thought they must have kept developing that software. But it turns out that the increasing proportion of documents that are created on computers has lessened the need for this software. In the Macintosh space, there were only two programs available, and both were quite a bit more than I was willing to spend, since I only had the one project to convert to text.

Next I got on the phone to our local Fedex Kinko’s to see if they offer this service, but no luck. They only do scanning to PDF, which is where I had already arrived. The next two days I spent trying to find shareware or freeware OCR software and thought I would have to boot up my old Wintel box just to get this done, but it was an unappetizing prospect. Finally, I stumbed on VelOCRapter, an inexpensive program that put an “inbox” on the desktop. Just drop the files in the inbox and, after a few minutes of churning, out popped a file with the image “translated” into real text! It looked like I was in business. Only one problem. Here is what the original file and the OCR’s text looked like:

A page as it looked after being scanned to PDF

A page as it looked after being scanned to PDF

The same page after OCR scanning

The same page after OCR scanning

In some places it was hard to actually see the text because there was so much “junk” in the file from the scanning process. You can also see how the OCR software interpreted each line as a separate paragraph, so there would be thousands of “returns” to take out and tough to automate the process.

It looked like we would need some word processing “heavy lifting” before this project could really get off the ground. But, considering that I hadn’t had to write a word, I was still ahead of the game.

Be Sociable, Share!

    { 2 comments… read them below or add one }

    Joseph Gregory April 26, 2011 at 9:26 pm

    Ten years ago while taking the B train back to Brooklyn after a long day at work my muse kicked in and a short, but important exchange between two characters in my story started to flow through me. Without any paper (shameful) I ripped off a piece of a paper bag I was carrying and began to write. Would you believe after all this time I came across this ripped bag with my scribbled convo on it.
    Moral of the story: Save everything you write!


    Joel Friedlander April 26, 2011 at 10:55 pm

    Nice story, Joseph, and a valuable lesson, thanks for sharing.


    Leave a Comment