
As we have said many times, the “space race” for text to speech (TTS) technology keeps accelerating. One of the primary advancements is the introduction of neural voices, which has replaced the original generation of voices that many people felt were too robotic for sophisticated voice applications.
The original voice technology was based on concatenative synthesis (where you string together the phonemes of the words), which was limited in handling variations in speech based on context. A new generation of neural text to speech (NTTS) voices use deep learning to produce more natural and human-like voices.
The old text to speech method
Before neural voices, the only way to adjust speech attributes was through the use of tags in the Speech Synthesis Markup Language (SSML), the standard used by all providers of voice services. Much like formatting a document, you could add tags for emphasizing words <emphasis>, adding pauses <break>, and changing the rate, pitch, or volume <prosody>.
For example, you can instruct a TTS engine to speak louder or softer by injecting tags specifying the +/- amplitude or by using auto-adjusting volume options such as “soft,” “loud,” or “extra loud.”
Neural voices do this for you. They are trained and learn conversational speech in order to make the voice sound natural. We have studied a wide range of text to speech technology and the need for tagging text is clearly retreating. Instead, the latest NTTS engines are providing “auto-inflection.”
By reading and understanding the main point(s) of the sentence structure, they properly inflect the voice for attributes such as emphasis and prosody. As the engines get more intelligent, tagging text to improve the outcome will become a very rare activity.
So what changed so suddenly?
In December 2020, IBM (Watson) replaced all of its standard voices with neural voices. Microsoft (Azure Cognitive Services) had earlier launched their new neural voices in September 2020. Both are now firmly atop the leaderboard alongside Amazon (Polly), giving you an amazing selection of beautiful voices you can use with Videate.
Neural voices are also available in dozens of languages which makes voice translation better as well. You can check out a few examples of neural voices at this link.
Just last week, Microsoft announced that their Custom Neural Voice feature has now reached General Availability, joining Amazon and Google in the custom voice race.
Custom voices allow you to create your own brand voice. Microsoft’s custom neural voice capability is being used by companies such as Disney, Progressive, and Duolingo, a language learning company. With more competition (thanks Microsoft), it is available at a much lower cost than ever before.
We are going to test out the new custom voice feature from Microsoft in the next couple of weeks.
And with the expanding list of neural voices from the various providers, the voice quality of your Videate generated videos is growing by an order of magnitude every few months.
We believe 2021 will be the year we will pass the Video Turing Test.