Several tools created with artificial intelligence and machine learning techniques have managed to surprise us in recent years.

Dall-E, Midjourney, or Stable Diffusion

, for example, are capable of creating images from text descriptions.

ChatGPT

can converse like a human, explain any concept and produce summaries in a coherent way.

Now, Microsoft has also managed to apply these techniques to create a speech synthesis engine capable of imitating the voice of any person just by listening to it speak for three seconds.

The tool is called VALL-E and it is capable of imitating pitch and inflection with

amazing

accuracy .

Speech synthesis models that use machine learning techniques to achieve a realistic result are not new.

Companies like

Google or Meta

have been perfecting them for years.

Some are capable of imitating voices, but they need extensive training with texts that contain most phonemes, and this often requires reading several minutes of predefined texts.

VALL-E, on the other hand, is capable of capturing the essence of a voice by listening to any fragment of three seconds, even if what it says has nothing to do with the text that it is asked to synthesize.

Microsoft has achieved this by training the language model with more than

60,000 hours

of recordings and more than 7,000 different voices that are part of the LibriLight catalog.

The more similar the person's voice is to one of those

7,000 reference voices

, the easier it is for VALL-E to offer a convincing result, although at the moment, yes, it only works in English.

In addition to preserving intonation and timbre, VALL-E also mimics other environment variables.

If the three second sample is from a phone call, for example

, the results will sound like a phone call

too.

Microsoft hopes that these types of tools will be used in the future to correct errors in audio recordings, generate more realistic virtual assistants or recover the voice of someone who has died.

The company has also created a tool to detect the use of VALL-E in a recording and thus prevent it from being used to

impersonate

a person or circumvent a biometric identification system.

According to the criteria of The Trust Project

Know more

  • Microsoft

  • Artificial intelligence