Perception sometimes lags behind reality. Case in point: text to speech (TTS).
Perception: Clunky, soulless, robotic readings that take anyone listening out of the conversation. Think Benny Benassi’s song, “Satisfaction.”
Reality: Modern text to speech has improved in leaps and bounds. In fact it's at a point, both in quality and ubiquitous usage, where users seem to not care whether the narration of a video is human or TTS.
In short, text to speech has gotten so realistic that SaaS companies can use it in their how-to and support videos without sacrificing listener engagement.
The concern about text to speech voices
The main concern about TTS narrators is that they don’t perform as well as a living, breathing human being delivering the dialogue. They’re not as warm. They don’t understand intonation. Or pacing. Or, in some cases, proper enunciation. In short, people are worried that it just doesn’t sound natural.
And these concerns are absolutely valid. For SaaS training videos, the point is to keep viewers engaged. And using sonic branding to find the right voice to represent your brand is extremely important when delivering audio-visual content to your audience.
After many years of awkward, sometimes frustrating text to speech (think any time you are dealing with a robot on a customer service call…), it’s natural to be mistrustful of text to speech for your own projects.
But here’s the thing – we can now quantifiably state that TTS voices perform just as well as human voices when it comes to engagement.
Text to speech voices and learning
In their study, The Relevancy of the Voice Effect for Learning, by Scotty D Craig and Noah L. Schroeder, participants were asked to watch a 2-minute video that had either a human or a TTS voice. Then they were given a learning assessment in the form of a multiple choice test.
“There were no significant differences found in the participant’s perceptions of how well the voice facilitated learning… or its credibility.”
As Craig and Schroeder write in their abstract:
In most respects, those who learned from the modern text-to-speech engine were not statistically different in regard to their perceptions, learning outcomes, or cognitive efficiency measures compared with those who learned from the recorded human voice.
Our results imply that software technologies may have reached a point where they can credibly and effectively deliver the narration for multimedia learning environments.
This study was conducted in 2018, and TTS technology has only improved since then.
To be clear, nobody is making the argument that TTS is totally indistinguishable from a human voice.
But if TTS and human voices deliver the same information with the same level of engagement, without a drop-off in viewer learning? That's a huge deal for SaaS companies.
Benefits of an AI voice generator
So, if TTS and real voice are the same in terms of engagement, why should a SaaS company use TTS in their software training videos?
Money, time, and bandwidth.
Hiring a professional voice actor can be expensive, which is why most companies don’t use them. Without an external hire, the voiceover falls to internal employees – but this task almost always falls outside of their job description, and keeps them from doing what they are being paid to do.
Both options cost the company money.
Unlike recording voice actors, an AI voice generator records the script immediately, in a single take, without the need for breaks, and without vocal flubs. This saves a tremendous amount of time.
Yet another time saver: video producers don’t have to scramble to block already busy calendars of your subject matter experts with recording times. And what if your voice talent had to take an emergency meeting? Or got sick? That costs time as well.
In a recent Videate survey, 73% of respondents say that keeping videos up-to-date is the most challenging aspect of video for their team. And 2-out-of-3 companies say their videos are out of date.
Simply put, companies don’t have the bandwidth to create a video for every update of every feature of every product. This becomes nearly impossible if companies want to globalize with multiple languages.
The only way to keep up is to automate as much as possible, and realistic text to speech is a critical component.
Additionally, TTS technology can be customized to suit the needs of different audiences. For example, many text to speech programs have different voices available, such as male and female voices, or voices with different regional accents. This allows content video producers to tailor the voice to the specific needs of their audience, making the content more engaging and accessible.
Many text to speech programs also have settings enabling users to adjust the tone and inflection of the voice, making it sound more natural and expressive. Additionally, some text-to-speech programs use artificial intelligence and machine learning algorithms to analyze text and determine the appropriate tone and emotion for each sentence or phrase.
Videate automates how-to videos using realistic text to speech
Videate is the only video automation platform that keeps how-to and training videos up to date with every software release.
We incorporate text to speech into our automated video production, making it fast and cost-effective.
Request a demo and we’ll show you how the platform works with your software to generate fast, professional software how-to videos, complete with voiceover, in just minutes.