Google develops Tacotron 2, a human-like text-to-speech AI system

Taking a giant leap towards its "AI first" dream, Google has come up with a text-to-speech AI system. This new system is sure to confuse you with its human-like articulation.

The new text-to-speech artificial intelligence system developed by Google is called Tacotron 2 and delivers an AI-generated computer speech that matches with the voice of humans, claims a report by Inc.com. At the Google I/O 2017, the company's CEO Sundar Pichai announced that it will start focusing on "AI first" and launch several products and features such as Smart Reply for Gmail, Google Lens, and Google Assistant for iPhone.

As per a paper that was published at arXiv.org, Tacotron 2 creates a spectrogram of the text that is a visual representation of how the speech should actually sound. This image is put through the existing WaveNet algorithm of Google that uses the image and brings artificial intelligence close to mimicking human speech. The WaveNet algorithm can learn different voices and generate artificial breaths easily.

The researchers of Tacotron 2 were quoted stating that their model achieves a mean opinion score (MOS) of 4.53 while the professionally recorded speech achieves a MOS of 4.58. From the audio samples, Google claims that the Tacotron 2 can detect the difference between nouns and verbs (such as desert and present as these words play the role of both noun and verb) based on the context and alter the pronunciation accordingly. The AI system can capitalize words and apple proper inflection when a question is asked instead of making a statement, claims the company.

However, the Google engineers have not revealed a lot of details about the Tacotron 2 text-to-speech artificial intelligence system. But they have left a clue for the developers to figure out their progress in developing the system.

Auto Expo 2018: How Artificial Intelligence can change the face of Mobility

As per the report, each '.wav' file sample has a specific filename that is either 'gen' or 'gt'. The paper that is published points out that 'gen' is the speech generated by Tacotron 2 and 'gt' is that of the real human speech. To be specific, 'gt' stands for 'ground truth', which is a machine learning term meaning
'the real deal'.