A Long Slog through the OCR Swamp

by Joel Friedlander on September 5, 2009 · 2 comments


This post is the third in a series about the creation of a new book. To see all the articles in the series, click “The Journey of a Book” tab at the top of the page.

While contemplating the hours it was going to take to make all the corrections to the files produced by VelOCRaptor OCR—the optical character recognition software I had used to turn the original PDF files into Word files—I started to mentally calculate how long it would be before this sudden “new” book I had discovered was going to take to get into print.

This type of correction is frustratingly slow. If, like me, you are constantly interrupted during your work day by ringing phones, pinging emails, kids who think they ought to have lunch, and dogs who apparently believe that their home comes with a private doorman, it can become so frustrating that it slides toward the bottom of the “to-do” pile and just stays there.

Luckily, I had a friend—Arisha Wenneson of Wenneson Services—who is meticulous and had time to do the job. We soon arranged a fee that was agreeable, and I happily zipped the files and sent them off. Placed side by side on the screen, the PDF showing what should be in the file, and the OCR text in a Microsoft Word document would at least be convenient for correction. As the email progress bar clicked down I breathed a sigh of relief. Now I knew there was one big obstacle that wouldn’t be holding me back.

Sure enough, in a few days the new Word files, clean and shiny, started to show up in my inbox. Here’s the result:

Before correction and after

Arisha had done an outstanding job. From the mess of OCR mayhem she had produced beautiful, accurate, junk-free Word files. I began to think this book would become a reality after all.

Of course, while looking over the files (which were now much easier to read) I realized that it would not be possible to publish the book without editing. The words, phrases, interjections, hemming and hawing that fill up our spoken communication had all been preserved by the transcript. But who wants to read a lot of filler? The things you say when you’re standing in front of a room of people trying to remember the point you were making?

There was no avoiding it. I would have to sit and edit every paragraph to get rid of the remaining “junk” that made the lectures tough reading. If I was going to be kind to my prospective readers–a goal all authors should aspire to–I would have to get out the “blue pencil” and get to work.

Next up: Editing, editing, editing

Be Sociable, Share!

    { 1 comment… read it below or add one }

    Joseph Gregory April 26, 2011 at 9:32 pm

    “kids who think they ought to have lunch”
    LMAO

    Reply

    Leave a Comment


    six + 7 =

    { 1 trackback }