The ambisonic dome in Ircam studio 1.

Illustration -

G. ARESTEANU / IRCAM AMPLIFY

  • The Ircam Amplify team - a subsidiary of the Acoustic / Music Research and Coordination Institute (Ircam) - unveiled the most advanced work on artificial voice.

  • Amplify, created a few months ago, relies on the work of around a hundred IRCAM researchers to reflect on the uses of tomorrow.

  • In a few years, we will be able to imagine artificial voices capable of adapting their responses according to our intonations and, in real time.

“When you talk to a human, you adapt to the way they speak.

The same message, the same content pronounced differently triggers a different reaction on the interlocutor.

Nathalie Birocheau, Managing Director of Ircam Amplify, could debate for hours on the interaction between man and machine.

With Marion Laporte, brand and communities director, and Vincent Meurisse, project manager at Amplify, she received

20 Minutes

this Wednesday near the famous Igor-Stravinsky square, in Paris, where the future uses of synthetic voices are taking shape. .

A few steps from the fountain of Jean Tinguely and Niki de Saint Phalle, near the Center Pompidou, sits the Institute for Research and Acoustic / Music Coordination (IRCAM) founded in 1977 by the musician Pierre Boulez.

It is in this completely isolated formerly underground building that the musical future comes to life.

The Amplify subsidiary, created a few months ago, relies on the research of around a hundred techno-scientists from IRCAM to reflect on the uses of tomorrow.

The team revealed the most advanced work on artificial voice and what can be considered for intelligent voice assistants.

The subtlety of intonations

Between two “demos”, the debate quickly takes over the subject of prosody [the characteristics of the voice that make emotions and intentions intelligible].

A key research axis for the next generations of voice assistants.

To understand others, it is not enough to say words in a monotonous tone.

The number of misunderstandings in our written exchanges is proof of this.

The intonation, the sound volume, the timbre give as much information on the emotional state of the interlocutor as the semantics.

Maybe even more.

That's why voice assistants still have their work cut out for them before they sound like the sweet voice of Scarlett Johansson, the artificial intelligence, in the movie

Her

by Spike Jonze.

Vincent Meurisse types lines of code on his computer to open the doors to the future.

He addresses the machine in a cheerful and catchy voice.

The latter answers him, without formulating any intelligible words, reproducing exactly the same tone as him.

“You can imagine a little robot that has an animal voice in the form of onomatopoeias,” he explains.

It will adapt to the way I speak and to my voice intonations and then reproduce them ”.

If the demonstration is confined to copying and pasting prosody, it suggests a new form of interaction with artificial voices.

“We can imagine, with machine learning and with a

fairly large

data set

[set of data] of responses, an interaction that will be built according to the evolution of the intonation of the discussion,” he anticipates. -he.

The machine will be able to adapt to the emotion of the human being addressed to it.

If he's angry or exhausted, she won't answer him the same way.

Fight against certain cognitive biases

Without prosody, it is difficult to go beyond the somewhat primary interactions that we observe with today's intelligent personal assistants.

"Many laboratories of the American tech giants have started using purely mathematical methods with 100% technical engineers," observes Nathalie Birocheau.

The emotional dimension was not always present.

The result: vocal assistants incapable of deluding themselves, both in their capacity to grasp common sense and in their way of reacting to emotion.

However, to understand others, there are a whole lot of external elements to take into account.

When we interact, we adapt to a context.

We increase or decrease the sound level, we modify the tone according to a multitude of elements: the level of stress of the interlocutor, his age, his gender.

Is there noise around?

Is he alone or accompanied?

What is his level of concentration?

The way we say "hello" plunges the interlocutor into a certain emotional state.

"Tests have been carried out on emergency medical call centers", underlines Nathalie Birocheau.

They have shown that the same message does not trigger the same reaction.

"If I am a man, if I speak softly with a certain tone, a certain prosody, I have a nine out of ten chances of having help, whereas if I have a shrill, quavering voice, not sure of herself. , I have a one in 10 chance of having help, ”she points out.

The voice carries information beyond words.

Hyperpersonalization and generation in real time

Amplify mainly works on human machine companionship.

An intelligent assistant capable of spotting the level of anxiety, anger, fatigue, could warn the operator of the call centers and help him adopt the best possible behavior in a given situation.

And, above all, it could prevent him from falling into the trap of cognitive biases such as those of the example of medical emergency calls.

How can the analysis of a prosody induce a particular treatment on the voice?

IRCAM is not far from the goal.

The institute has knowledge in signal processing, signal analysis, voice synthesis.

He has the multicompetence in psycho-acoustics, sound design, perception, construction of the voice, so as to provoke an emotion.

Because it is a question of improving the user experience using artificial intelligence.

"These technological bricks are available to Amplify to establish them in the market and find real uses that have meaning for the greatest number," continues Nathalie Birocheau.

It is being invented.

Artificial intelligence has the capabilities ”.

It is a question of time, of computing power, of quality and quantity of the starting data.

The director of Amplify expects the arrival of this technology in two to three years.

And, eventually, we could even imagine hyper-personalized technologies that generate the right voice and intonation adapted to the situation in real time.

It remains to be determined whether the human will want to speak in an artificial voice that resembles that of his parent.

Culture

"We could believe that the machine is at our command, in reality, we will be its slaves", assures Serge Tisseron

Culture

Gender fluid: Alexa, Cortana, Siri… What if intelligent voice assistants don't have a gender?

  • Artificial intelligence

  • Laboratory

  • Research

  • Future (s)

  • Innovation

  • Paris

  • Culture