A Long Slog through the OCR Swamp

by | Sep 5, 2009


This post is the third in a series about the creation of a new book. To see all the articles in the series, click “The Journey of a Book” tab at the top of the page.

While contemplating the hours it was going to take to make all the corrections to the files produced by VelOCRaptor OCR—the optical character recognition software I had used to turn the original PDF files into Word files—I started to mentally calculate how long it would be before this sudden “new” book I had discovered was going to take to get into print.

This type of correction is frustratingly slow. If, like me, you are constantly interrupted during your work day by ringing phones, pinging emails, kids who think they ought to have lunch, and dogs who apparently believe that their home comes with a private doorman, it can become so frustrating that it slides toward the bottom of the “to-do” pile and just stays there.

Luckily, I had a friend—Arisha Wenneson of Wenneson Services—who is meticulous and had time to do the job. We soon arranged a fee that was agreeable, and I happily zipped the files and sent them off. Placed side by side on the screen, the PDF showing what should be in the file, and the OCR text in a Microsoft Word document would at least be convenient for correction. As the email progress bar clicked down I breathed a sigh of relief. Now I knew there was one big obstacle that wouldn’t be holding me back.

Sure enough, in a few days the new Word files, clean and shiny, started to show up in my inbox. Here’s the result:

Before correction and after

Arisha had done an outstanding job. From the mess of OCR mayhem she had produced beautiful, accurate, junk-free Word files. I began to think this book would become a reality after all.

Of course, while looking over the files (which were now much easier to read) I realized that it would not be possible to publish the book without editing. The words, phrases, interjections, hemming and hawing that fill up our spoken communication had all been preserved by the transcript. But who wants to read a lot of filler? The things you say when you’re standing in front of a room of people trying to remember the point you were making?

There was no avoiding it. I would have to sit and edit every paragraph to get rid of the remaining “junk” that made the lectures tough reading. If I was going to be kind to my prospective readers–a goal all authors should aspire to–I would have to get out the “blue pencil” and get to work.

Next up: Editing, editing, editing

tbd advanced publishing starter kit

3 Comments

  1. jedidiah manowitz

    there is no tab at top of page for journey of a book

    Reply
    • Joel Friedlander

      jedidiah,

      Well, there was one in 2009 when this article was published. It’s moved to the right sidebar in the list of “Topics.”

      Reply
  2. Joseph Gregory

    “kids who think they ought to have lunch”
    LMAO

    Reply

Trackbacks/Pingbacks

  1. Is It Worth Converting an Old Book Into an eBook? | ARCHITAMENT - [...] clean copy will need to go to an OCR (optical character recognition) scanning service. They will scan each page…

Submit a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.