Generated Speech Has Come of Age

Our big bet in launching Videate was that text-to-speech engines would improve rapidly, and in the short time since we started the company, Amazon, Microsoft and Google have improved the quality of generated speech by several orders of magnitude. This trend will continue as many companies are investing billions of dollars in this technology, and we’re now observing a “Space Race” for generated speech that is moving very fast.

If you read our initial white paper you saw the reference to the Turing Test, a scenario designed by Alan Turing in 1950 that evaluated natural language conversations between humans and computers using text-only responses. If a human evaluator could not tell the difference after five minutes, the computer passed the test. Seventy years later, the text-only version of the Turing Test is being replaced by computer generated speech.

Videate uses text-to-speech (TTS) technologies as part of its overall platform. It is one of the fundamental pieces of our patent-pending solution. The text comes from your documents, the scripts which drive great videos. We can start with your existing product documentation written in DITA, AsciiDoc, Google docs, or Word. You don’t need to write down every detail of how your software works, you just need to follow a consistent format as if you were speaking the words aloud. As Ridley Scott said, “Once you crack the script, everything else follows.”

We then use AI and automation to learn about your products. When you say click on this icon or go to this menu, we know where the icon or menu is in every part of your application. We use your scripts to navigate your software. We synchronize the movement as if you were moving the mouse and speaking the words. Again, it doesn’t have to be mechanical. You can add context and animation instructions to enrich the experience, and Videate will use natural language processing to make further improvements.

At the same time, we use the TTS technology to generate the voice. It is synchronized with the movement as we record. What is produced are videos that were done through automation rather than humans. The benefits are clear. There’s no post-production processing to edit out pauses, stammers, breathing, noise or errors. And since it’s based on your script, you can quickly make changes, fix typos, and generate new videos in minutes. You can easily deliver up to date videos whenever you release software, even with last minute UI changes.

We’re used to hearing Alexa, Siri, and Google Assistant in our daily lives. And yet, when it comes to using computer generated voice in software videos, there is still skepticism that enterprise software users will find it acceptable.

We surveyed a wide range of B2B software end-users and asked this question:

Given the choice of having up to date software videos with computer generated voices or out of date software videos with human voices, which would you prefer?

The preference has clearly shifted to always up to date videos with generated voices.

The ability to create your own personalized brand voice is just around the corner. Amazon announced this in February 2020 and Google has recently launched a similar capability which is now in beta. It’s not quite there in terms of cost except for larger companies, but it will be soon for everyone. Generated speech has come of age.

“To make a great film you need three things – the script, the script and the script.” - Alfred Hitchcock

