Should You Hire a Computer to Narrate Your Audiobook?

by Joel Friedlander on November 27, 2013 · 21 comments

Post image for Should You Hire a Computer to Narrate Your Audiobook?

by Becky Parker Geist

I’ve been hearing recently about companies that specialize in automatic translation of printed text into audio, for use in audio books and other purposes. Many authors would like to do audiobook versions of their books, but have been put off by the cost of the process. I asked my colleague and fellow BAIPA board member Becky Parker Geist to look into this idea. Becky is a “book audiologist” who produces audio books and coaches actors, and who always impresses me with her short, memorable, and beautifully-sounding announcements at our meetings. Here’s her report.



As authors, we choose our words carefully and craft the way we put words together to convey as clearly as possible just exactly what we mean and what we want our readers to receive. There is no automation in the process. Yet we live in a fast-paced technological world that calls out to us to find ways to work faster, cheaper, and more technologically advanced.

Those of us in the self-publishing world face questions involving “tree books”, ebooks, audiobooks, apps, enhanced ebooks, and whatever is coming next. Joel invited me to look into text-to-speech companies like iSpeech.org about their ability to produce computer-generated audiobooks. As a voiceover professional, a large part of my work is creating audiobooks and soundtracks for enhanced ebooks, so I wondered what iSpeech might be bringing to the table.

iSpeech has been around since 2007 and provides a cloud-based speech technology that creates “human quality” text to speech (TTS). They also have a patent pending on a voice cloning technology.

Voice Cloning

The idea of voice cloning is to be able to clone a voice that can then be applied to any text. So if you want to hear Grandma read to little Billy, this technology would make that possible. Interesting concept, but let’s dig deeper.

The voice cloning benefits are said to include (aside from the Grandma example), being able to revise an audiobook at a later time without having to find the same narrator who created the original, or someone whose voice matches closely enough to work.

iSpeech says that there is demand for this technology in e-learning, and with institutional designers who are looking for a quicker and cheaper alternative to hiring voice actors.

TTS in E-Learning

I found evidence online to support this idea in this blog showing that e-learning programs may benefit from TTS.

An example is Pearson, who experimented with the iSpeech technology and received a ‘good enough’ response to the computerized voices with younger listeners to warrant going ahead with a project.

This technological approach allows them to have an audiobook completed in just hours—instead of weeks—at about one tenth the cost of hiring a narrator. iSpeech provides voices of both genders in most of the 27 languages included in their system.

What about non-fiction, where there may be greater application than in fiction. My experience, however, suggests that non-fiction can be even more dependent on the clarity and emphasis an actor brings to make sense of complex content for listeners.

While applications such as TTS conversion might work well as a fast and inexpensive way to deal with materials an individual needs for study, broader audiences will expect and demand, I believe, the clarity only a human can deliver.

I return to my opening: if we take such care with the crafting of our text, is it in the best interest of the author or the listener to then deliver an audiobook through the voice of a computer?

Checking It Out

Listening to samples of TTS technology is not reassuring. Human voice conveys so much more than just words—words form the skeleton that carries the flesh and blood of feeling and of meaning.

And isn’t it really that—feelings and meanings—that the words are there for in the first place? The flesh requires the skeleton as the feelings require the words. An actor brings life and understanding to the text through means as mysterious as the life force itself.

It may be that iSpeech “opens many opportunities in application development,” and they are working on trying to get more emotion into the computer-generated voices.

I’m uncertain how the computer will know which emotion is called for, however, and when to switch from one to another and how to mix emotions and in what quantities, or even how they will manage the implantation of emotion.

With the growing audiobook market and hundreds of thousands of new titles being published each year in print, is “good enough” good enough to be effective in selling the finished audiobook?

Consider going to the theatre: we will go see Romeo and Juliet more than once, even though we know the story. It is in the telling that the magic happens.

From a quick search, I found other companies offering TTS programs, including IVONA.com and naturalreaders.com, the latter being a free downloadable program.

Agnieszka Szarkowska’s study with the blind and visually impaired, reported in The Journal of Specialised Translation (Issue 15, Jan 2011) says that their results agreed with a report stating, “…’listeners prefer natural sounding speech, both in comparing natural speech to synthetic speech and in comparing different synthetic voices” (Cryer and Home 2008: 7). It is worth noting, however, that while the visually impaired viewers in this study find natural speech preferable, many of them would find synthetic speech acceptable.”

My Conclusions

I’m certain there are some great applications for TTS programs, especially in light of the fact that about 11 million people have visual impairments and 1.5 million are totally blind in the U.S. alone.

I, personally, remain unconvinced that audiobooks, in general, would profit from this technology. You’ll have to decide for yourself whether your books would profit from this technology.

Are there ways you would use an automatic text-to-speech converter? I’d love to hear them, please let me know in the comments.

audiobooksBecky Parker Geist is a professional actor, voiceover talent, director, producer, writer, solo-performer, and acting coach. She has toured stages internationally with Chaucer Theatre and served as its Executive Director 1997-2013. After receiving her MFA in Acting from University of Illinois in 1981, she began narrating Talking Books for the Blind for Library of Congress, narrating over 70 titles in two years before moving to the San Francisco Bay Area. She is the founder of Pro Audio Voices, narrating and producing audiobooks full time. Becky is also a self-published author (Game Plan for Educators) and a produced playwright (Joy with Wings: A Daughter’s Tale) and is currently working on a series of children’s books and her first novel. For more info see Pro Audio Voices.

Photo: bigstockphoto.com. Amazon links contain my affiliate code.

Be Sociable, Share!

    { 17 comments… read them below or add one }

    Greg Strandberg November 27, 2013 at 2:23 am

    Wow, lots of stuff to think about here, Becky.

    From what I’ve heard it’s quite expensive to have someone sit down in a studio and read your novel for an audio book. If you can cut that down from days to hours I think a lot of people would be interested.

    I’m going the other route right now, myself. I’m thinking more of things like Dragon that will let me go speech-to-text, hopefully getting rid of that problem of my fingers not keeping up with my thoughts.

    Reply

    Becky Parker Geist December 2, 2013 at 10:42 am

    Expensive is, of course, a relative term. If you consider the time it takes for an actor to prepare, narrate, and then to do the post-production (editing, mastering), realistically you’re looking at about 8-10 times the finished length of the book for an experienced narrator. But even if you are not looking at the time cost for the actor, let’s just consider the cost of the finished product. If you have a great finished audio product that really conveys your story (or content) and moves your audience emotionally or intellectually, then you will have succeeded as a writer in the fundamental intent, it seems to me. If you have succeeded, then your chances of financial success are exponentially higher: people will spread the word about it, sales will increase, positive reviews will boost sales. If you have a poor or mediocre finished audio that says the words but may not effectively convey the story or message (and mind you, this can happen with a poor narrator as well as a computer), then people may not even listen to the whole thing, are likely to warn others away from it or give it lousy reviews. Down go sales. So while it costs more to hire a narrator, you’re almost certain to earn considerably more. There are lots of questions to ask before jumping into audio, just as with publishing – questions about your intention and expectations, your author platform, how it will be most effectively distributed. And again, “expensive” is a relative term. If and when you’re ready to look at entering the audiobook market, feel free to get in touch and we can look at those questions together. It’s not the perfect direction for every book or every author, but it can be great in many ways not just the actual audiobook sales alone.

    Reply

    Diane Tibert November 27, 2013 at 6:28 am

    I’ve been thinking about turning my books into audio books, but I haven’t taken the leap. I don’t see a great demand for it, but I could be wrong. I have the equipment to do it myself, but I don’t know if I want to venture into it.

    Thanks for the information in the blog. It gives me more to think about. I don’t think automated voices are great for reading either. There’s no emotion, and when I hear them in text messages or wherever, I find them really annoying. I can’t imagine listening to an entire book like that.

    Reply

    Becky Parker Geist December 2, 2013 at 11:01 am

    The demand for audiobooks is growing fast. According to an Audio Publishers Assn report, dollar sales increased about 14.5% from 2011 to 2012, and the audiobook market was estimated at about $1.2 billion in 2012. The trends are not surprising. With so many people multi-tasking and listening while commuting, there is a huge appeal to audiobooks. In addition, in the U.S. alone, there are over 11 million people with visual impairments and 1.5 million totally blind (2011).
    Recording yourself is an option, and can be a good option if you are skilled with audio narration recording and understand how to do the post-production. If that’s a direction you decide to consider, I’d encourage you to do a sample recording and get some feedback. Listen to it yourself critically, and have others listen who will be honest with you. I’m not saying you won’t be good at it – just that you want to know in advance before putting in all the time it takes to get the books done. You can also request samples from professional narrators to compare with your own, to give you a better feel of which way you want to go.

    Reply

    Jason Matthews November 27, 2013 at 7:33 am

    Since the computers still lose a lot in just translating a book to another language, it seems like a real stretch to get all the nuances and emotions correct in any book.
    However, I love my navigation app when driving even though the voice is a computer. If it helps me accomplish something important–no problem here. TTS leaps and bounds will probably will start with non-fiction and evolve from there.

    Reply

    Becky Parker Geist December 2, 2013 at 11:04 am

    I also appreciate my navigation app voice and have no need for it to convey any emotion. It does the job perfectly. I definitely think TTS has its place. I just don’t think doing an audiobook is it.

    Reply

    Denise Gaskins November 27, 2013 at 9:19 am

    Text-to-speech may not be much use as an audio book, but it’s great for editing. I send my file to my old Kindle and let it read to me. (I don’t think the newer ones do this.) It’s amazing how the mistakes jump out at me!

    Reply

    Joel Friedlander November 27, 2013 at 4:24 pm

    That’s a terrific tip, Denise, I’m going to try that one myself, thanks.

    Reply

    Becky Parker Geist December 2, 2013 at 11:05 am

    I agree!

    Reply

    August Gardiner November 27, 2013 at 11:57 pm

    This is the first I’ve thought much about this type of program. I know people are working on text-to-speech (and have been for some time) but this is the first time I’ve actually read up on them. Looking at the sample languages, something amazing just clicked into place.
    This would be a brilliant tool for working on language learning. Don’t know how to pronounce a word? Type it in. Don’t know to spell a word? Try to spell it phonetically, improving your understanding of sounds and character combinations. Just one more tool for learning Swedish – love it!
    Okay that’s not exactly book – related but definitely an area I think there’s a lot of potential for that kind of software.
    And no – please don’t use this kind of stuff for making audiobooks…unless you’re writing a sci-fi book that specifically calls for an early generation android narrator…hmmm, maybe I will use this software for an audiobook after all…or maybe even a whole series!

    Reply

    Becky Parker Geist December 2, 2013 at 11:12 am

    I just recorded a sci-fi (AsterIce by B.L. Bates) and there are lines by the computers. But narration without character – I just can’t see it being of interest for very long. Think about a teacher or anyone really who drones on and how quickly we zone out. I expect you’d still need an actor to do the computer voice, just so it sounds interesting. And of course, that’s how they get the voices to start with. They start with actors and have the computer generate the sounds they copy. But they are just sounds at that point, disconnected from meaning.

    Reply

    James November 28, 2013 at 10:37 pm

    Honestly, most professional audiobook publications already sound robotic to me. And those are real people.

    –The narrator talks too slowly
    –They also tend to be devoid of inflection.
    –And don’t talk like real people.

    The problem is that the quickest way to produce the audio versions of a book is for the actor to flub as few lines as possible. This results in the slower boring as hell narrator style, because the focus isn’t on performance, rather on speech.

    One of the few narrators that I think does an excellent job is James Marsters. Check out his work on the Dresdin Files. He strives for a conversational feel, and keeps a faster pace than 99% of what is on the market. He also mentioned that he tends to flub lines quite frequently, to the dismay of producers, but feels the conversational nature is of utmost importance.

    I agree.

    Very few audiobooks have this. And I’m talking, the latest James Patterson and Stephen King releases are lacking this conversational feel and instead have the slowest grandpa sounding guy perfectly enunciating every word with a sleep inducing rhythm.

    Reply

    Becky Parker Geist December 2, 2013 at 11:19 am

    You’re absolutely right that being a human doing the narration is not enough! And I agree that there are a LOT of narrators out there doing work that is mediocre at best. There are also a lot of narrators who are not actors, and I think having trained, experienced, and talented actors at the mic is essential to the high quality we want in our listening experience. That’s the quality authors and publishers should be looking for, even if it costs a little bit more at the front end (it might not, but even if it does).

    Reply

    David Hooper December 7, 2013 at 2:36 pm

    Believe me, if you had an audiobook narrated by a “real person,” you’d know it. Most people mumble, talk way too fast, and can’t read aloud. :)

    Reply

    Becky Parker Geist December 7, 2013 at 3:30 pm

    That’s true about how most people talk. That is one of the many factors that set the great professional narrators apart. If you’re listening to an audiobook and it’s sloppy, as most people are in regular conversation when someone can respond with a “what was that you said?”, then the audiobook was poorly produced, in addition to being poorly narrated. The goal, in my opinion, is to transport the listener into the story so effectively that you forget about the narrator because you are IN the story. Same thing with stage performances. If you’re sitting in the audience thinking about how well (or poorly) a line was delivered, for example, then you’re not engaged in the story, you’re focused on the craft. I recently listened to End of the Affair, narrated by Colin Firth, which won an Audie Award. He’s great and most of the time I was IN the story. But there were times when I had to replay sections because I got confused about who was speaking (he had not fully differentiated the voices) or because I couldn’t make out what he’d said. Those experiences pulled me out of the story and I was bummed because it was otherwise great. I have also watched too many painful videos of authors reading their own work when they are not good at it. I can hear the words and realize how good it COULD sound, and it is sad because I know they are hurting their potential readership. It’s easy to compare to singing. We expect people who sing in performance to be rehearsed/trained/on pitch/enunciating, etc. Narration/storytelling in performance (live or recorded) is just as much an art, but some assume that if you can say words into a mic then you have what it takes to create an audiobook.

    Reply

    David Hooper December 7, 2013 at 2:39 pm

    I’ve purchased some ebooks directly from their authors that have come with text-to-speech audio books as bonuses. They’re ok, and I imagine would be appreciated by somebody who was unable to read text, but it’s doubtful anybody with a major release would use this option.

    With that said, I know some of the major voice actors (like Don LaFontaine and Rod Roddy) all recorded various words and phrases before they died, so we’d be able to use their voices when technology catches up to the idea…

    Reply

    Becky Parker Geist December 7, 2013 at 3:42 pm

    Ah, but the great voice actors were great not because of the quality of the voice they happened to be born with or grew into, but because they knew how to draw us into the stories. I just can’t see computers ever catching up with humans in our ability to subtly navigate the mix of emotions, knowing how far to go in their expression without going over the top and losing the audience, knowing how to get the audience to feel the tension of withheld emotion – I mean, scientists haven’t figured out how our emotions work, how are computers going to be able to replicate the life force? I get that TTS has applications, but I remain unconvinced that audiobooks are among them, no matter whose voice is brought back from the dead to be cloned by a computer.

    Reply

    Leave a Comment


    + one = 9

    { 4 trackbacks }