From Virtual Singing to Deepfakes: How is Speech Used Today

2022-04-273 Min read
From Virtual Singing to Deepfakes: How is Speech Used Today

IRCAM’s first steps into research on human speech stemmed from the music world, but today the applications for audio research and vocal analysis extend far beyond the music industry. Focus on the new use cases of voice.

Playing with Voice: A New Instrument

From the production of synthetic voices to the re-creation of voices that have disappeared to the transformation or analysis of voices in real time, researchers are developing new solutions that are ever more realistic and adapted to usages made possible by technological innovation.

However, the first projects giving rise to this work came from culture. Many artists regularly collaborate with audio specialists to rethink composition, using the voice as a new instrument. More recently, the artist DeLaurentis, in collaboration with Ircam Amplify, has designed an entirely new sound experience: a virtual choir that enables the creation of a choir effect in real time and the harmonization of voices on different musical scales, via connected gloves.

**Listening to the Voice: The Societal Challenge of Deepfakes

By putting the best of IRCAM’s audio research and sound creation at the service of markets, new applications and organizations, Ircam Amplify is at the forefront of technological trends related to voice.

For the MGEN Vocal’iz application, Ircam Amplify has developed an algorithm to analyze the user’s voice and then suggest exercises adapted to the user’s profile and state of health. A real pocket vocal coach used for preventive purposes to take better care of one’s voice.

Connected speakers are also developing into daily support systems, along with all the devices that are rapidly becoming more ingrained in our lives and consumer habits.

At this stage, most technology is building towards more natural human-machine interactions but there are often still hurdles at the level of user requests. As Nathalie Birocheau, CEO of Ircam Amplify explains: “First a machine must handle the issue of voice analysis (intelligibility, comprehension of the content, analysis of the sound environment, etc.) before even thinking about establishing a more qualitative and emotional human-machine dialogue. Next will come the improvement of the interaction, to go towards more contextualisation and personalisation (if this is useful and relevant for the expected applocation!). This is where Ircam Amplify is positioning itself and developing technologies for the coming years.”

**Listening to the Voice: The Societal Challenge of Deepfakes

The digital revolution is now moving towards a post-digital period where sound is taking back an essential place in usage, for sociological, economic and technological reasons.

Human/voice machine interfaces are developing at breakneck speed, with usage being driven by the younger generation, who are ditching keyboards and starting to converse with their computers and smartphones. In 2020, nearly 30% of Web browsing will be done without a screen, and voice assistants will be present on 8 billion devices by 2023.

These figures are dizzying, but they prove that the coming century will be the century of sound and voice (and above all the century of multi-sensoriality). This also means that over the course of a few centuries we will have gone from a society of oral tradition to a world of visual dominance.

On the model of image-based faking, audio deepfakes are the logical continuation of the work initiated in speech synthesis. The question is not if audio deepfakes will become ubiquitous, but when. The work on these audio filters designed thanks to Artificial Intelligence and Deep Learning is already well advanced, but the obvious risks for manipulation and misuse remain.

This underscores the emotive power of human speech. Last year, with the help of Opinion Way, Ircam Amplify established that nearly 70% of people believe that how easily convinced they are depends on someone’s voice.

The trend is clear: the 21st century will be one of speech, not just conversations between humans, but also with machines. And then the machines have to hear us and understand what we are saying. That’s a big part of the challenge for Ircam Amplify.

Check out Voice Cloning, the solution that allowed Thierry Ardisson to reconstruct the voice of personalities from the past.

Think we're on the same wavelenght?