3 min read

Unmasking the controversial reality of text to speech voices

Videate May 31, 2022

Text-to-Speech AI

Unmasking the controversial reality of text to speech voices

The Netflix documentary profiling Andy Warhol features an AI voice reading his diary. It’s almost spooky how life-like and realistic the synthetic voice sounds.

Another documentary film, RoadRunner, faced ethics complaints in its telling of the life of Anthony Bourdain. Without providing full disclosure, the filmmakers used an AI text to speech voice to emulate Bourdain’s voice which intermixed with real clips during the film. Viewers were unable to tell the two apart.

In their defense, the filmmakers say they had permission from the estate and literary agents to use the TTS voice to narrate words Bourdain had actually written. The same goes for the Warhol series, but it begs the question: What’s real and what’s not, and what are the ethics involved.

The (text to speech) voice of the company

Many companies employ a voice actor or spokesperson to embody their brand. Once they make this investment, it’s in their best interest to maintain that consistency. However, with text to speech (TTS), many readily available voices can be used by multiple brands. If you want to lock up a voice that’s unique, or create a custom voice for TTS, there are ethical considerations.

Text to speech voices are rapidly improving

It’s becoming increasingly difficult to tell where companies are using a voice actor or a text to speech solution these days. It will be nearly impossible to hear the difference at some point in the not-too-distant future.

We’ve already seen significant advancements in both audio and video, ushering in concerns about deep fakes where someone can essentially make anyone say whatever they choose. With advances in TTS, the potential for misuse grows even more.

text to speech ethics

Who Owns a Voice?

When companies hire voice actors, they pay a fee to use that voice for specific uses and periods. But what if they could use TTS to mimic that voice actor's voice and speech patterns so they can pay once and use that voice forever?

The technology already exists.

While voices cannot be protected by copyright, the law does provide some legal protection for public figures and recognizable voices. So, it’s unlikely someone could use a custom text to speech voice that sounds like Morgan Freeman or Barack Obama.

However, a custom brand voice can be created to drive your TTS engine. To reduce liability, companies will likely need to establish a legal and ethical framework for compensating a voice actor or someone used to train the model, which will become part of the brand’s assets moving forward.

Affordability: text to speech voices win over human actors

Another factor is the affordability of creating custom voices. As technology has advanced over just the past two years, the price has decreased at a staggering rate.

When Amazon first went to market with their custom brand voice products in 2020, creating a custom text to speech voice was extremely expensive, which put the cost beyond the reach of all but a handful of companies. Today, a similar program offered through Microsoft is around $5,000. This makes creating a custom text to speech voice more affordable in many cases than paying for voice actors repeatedly.

Ethical guidelines for text to speech voices

Companies such as Google and Microsoft have put strict guidelines for ethical uses of custom text to speech voices. Customers must apply with lengthy intake forms and agree to terms, including:

Approval of voice talent
Approved use cases
Disclosure of synthetic voice use
Permission from voice talent for use to create and train AI voice models

Voice talent is required to read several predefined statements acknowledging consent to terms. Neural voice creation using AI and machine learning models from Microsoft even reserves the right to do biometric testing on recorded statements to verify their authenticity.

Governmental regulations for AI voices

Most governmental regulations haven't caught up to evolving technologies like AI and machine learning. Text to speech's use is largely unregulated unless it runs afoul of current laws. However, governments may require the disclosure of TTS voices in the future. For example, they may require proactive audio watermarking — inaudible identifiers embedded with the text — to allow detection tools to identify it as text to speech.

Some custom TTS creators already require audio watermarking and disclosure as part of their user agreements.

Creating an ethical framework

The establishment of ethical standards has already led to the creation of several industry trade groups, such as the Open Voice Network, a non-profit association with the mission of developing standards and ethical use guidelines.

As neural and custom voice creation becomes even more sophisticated, it will be important for organizations to create an ethical framework for managing and controlling how custom voices are used. While aiding in the creation of audio and video content, companies need to be proactive to avoid misuse.