Can you tell the difference between a human voice actor and a synthetic voice using text-to-speech technology? Before you answer, you’ll want to listen to some of the samples we’ve included and see what you think.
Hint: It’s harder than you think.
We’ve seen a massive rise in demand for text to speech (TTS) technology in recent years. It started with accessibility– services for vision-impaired people, for example. Then, with the increase of voice-enabled devices, TTS became more common. Now, TTS is widely used in ads and video content.
At Videate, we use TTS voices from IBM Watson, Amazon Polly, and Microsoft Cognitive Services. You can listen to samples here to see how realistic some of these voices have become. It's pretty incredible.
The technology has evolved significantly to the point where many AI-generated voices are relatively indistinguishable from human speech. Listen to these two samples from Google’s tests provided by Futurism. Which one is human?
In this case, Google didn't disclose which was the human voice. It might also surprise you that these two samples are from nearly five years ago. Think how far technology has advanced since then. Today, TTS leverages AI, machine learning, and neural networks to mimic speech patterns. Today, TTS leverages AI, machine learning, and neural networks to mimic speech patterns.
The Speech Synthesis Special Interest Group held a competition in 2021 known as the Blizzard Challenge, where they tested AI voices using panels of independent human judges. The judges detected no significant differences between the natural speech and AI voices.
You be the judge: which one among these submissions by Microsoft’s Azure TTS service is human?
Sample 3 is the human voice, but many participants chose sample 4.
Want to try again? Here’s another set.
Sample 5 is the human voice in this case.
As research into text to speech continues to evolve, it will become even harder to tell the difference between synthetic voices and natural speech.
Here are some of the top companies emerging as leaders in the TTS industry, both major players and emerging companies.
Google Cloud Text-Speech features more than 220 voices across 40 languages, including language-specific variants and dialects. Users can create unique voices for projects, capable of tuning voices by adjusting pitch, cadence, and rate in various ways. Speech Synthesis Markup Language (SSML) allows for further customization of its TTS.
Microsoft's Text to Speech product also produces life-like synthetic speech that can capture intonation and emotion. Azure has more than 330 neural voices and variants in 129 languages. It offers custom voice creation, and the TTS supports different speaking styles, such as newscast reader, customer service rep, shouting, whispering, cheerful, and sad. Azure also has SSML for voice tuning.
IBM Watson Text to Speech is another leading API cloud service that creates custom voices for brands, offering 35 synthetic voices in 16 languages and dialects. Not only are speech attributes controlled by SSML, but users can also personalize by specifying other attributes such as strength, pitch, timbre, rate, and breathiness. There are also multiple tones, such as GoodNews, Apology, and Uncertainty.
Amazon Polly offers 68 voices across 34 languages and variants. Some voices are bilingual, offering the same voice quality in multiple languages, such as English and Hindi. Voices exhibit natural speech patterns, although some can also speak in what Amazon calls the Newscaster Speaking Style. You can also build customer brand voices exclusive to your company brand and customers.
Number of voices available: 50 (depending on subscription plan)
WellSaid Labs is an emerging player in the text to speech industry, allowing organizations to quickly add TTS voices to productions. Users can choose from a lineup of voices or create their own voice avatars for branded products. Like other solutions, WellSaid Labs allows you to store company and industry jargon, terms, and names in a phonetic library to help train the synthetic voice AI module.
Murf.ai is another TTS company for enterprises. It creates text to speech in 20 languages with more than 130 voices. Not only can synthetic voices from scripts be generated, but users can also convert recordings into professional voiceovers. For example, you can convert subject matter experts' discussions, say from a webinar you produced, and turn them into more polished presentations. The ability to tweak pitch, emphasis, and punctuation to customize voices further is also enabled.
These are just a few of the text to speech companies to watch. As technology continues to evolve, we’re excited about the possibilities.
Our powerful platform automatically records video tutorials within your SaaS platform to create training and demo videos at scale.
At Videate, we incorporate text to speech and neural voices into our automated video production, making it fast and cost-effective to produce huge volumes of training and support videos across multiple languages in just minutes.
Did a software update make your videos outdated? Simply update a few words in the video's script and re-render the video. Videate generates the new version in real-time– no need to block out your calendar to re-record or re-edit.
Want to produce videos in multiple languages? Videate generates text to speech voiceover in multiple languages. Never worry about hiring multiple voice actors to support different languages for your global company again.
To learn more about how you can automate video production, reach out to our team for a demo!