Top trends at SXSW23
Check out the report on the 3 top trends of SXSW 23 for the music...
The increasing realism of speech synthesis
Since its inception in 1977, IRCAM has created important and exciting connections between scientific research, artistic creation, and technological development. As the true bridge between IRCAM’s state-of-the-art audio research and the larger global world of industry, IRCAM Amplify is at the heart of the 21st century sound revolution and allows IRCAM to promote its work to the general public. In the age of AI, deep learning, and voice assistants, IRCAM is a pioneer in the field of synthetic voices. IRCAM’s research now makes it possible to humanize voices by infusing them with emotion, individuality, and subtlety. Whether intended for artistic composition or for multisensory interfaces between humans and machines, the uses of this technology are infinite.
The Risk of Deep Fakes
Face Swap, FaceApp, Reface, and other deep fake applications are growing by the day and allowing users to embed any face on a different video or image with confusing precision. These artificial-intelligence-based digital effects are already challenging the reliability of digital images, and now they are also attacking speech and sound with growing realism. For Frederic Amadu, CTO of IRCAM Amplify, “Speech takes on a much greater importance than before. Internet videos, podcast, and music all bear witness to this. A fake can happen quickly. At IRCAM Amplify, we think as much about these risks as we do about the positive possibilities.”
**IRCAM at the Forefront of Speech Synthesis
Nicolas Obin, has been working on speech synthesis for a decade as a sound analysis and synthesis researcher at the Music and Sound Sciences and Technologies Laboratory, a collaboration between IRCAM, CNRS, and the Sorbonne. Their exceptional work stands out because it focuses on the human ability to express ourselves through sound characteristics that go beyond language. Unlike artificial speech with neutral and standardized voices, IRCAM is creating cutting-edge speech synthesis infused with emotion and prosody.
Unique Technological Innovations
IRCAM is developing algorithms and software for voice transformation that can sculpt the voice and modify attributes such as age, gender, or emotional tone. For example, Nicolas Obin and his team recreated the voice of the castrato Farinelli, of which no recordings exist (Corbiau, Farinelli, 2008). More recently, the voices of Marilyn Monroe (Parreno, Marilyn, 2012), Marshal Pétain (Saada, Juger Pétain, 2012) and Louis de Funès (Debbouze, Pourquoi j’ai pas mangé mon père, 2015) have been recreated.
IRCAM Amplify: Gateway to New Uses of Speech Synthesis
IRCAM Amplify extends IRCAM’s work around speech in numerous ways. One example is the Vocal’iz application, developed by MGEN. This technology is based on work from IRCAM, which extracts attributes of speech, such as expressiveness or monotony, as well as syllabic flow. It analyzes frequency information, then offers the user daily tips and practical exercises to improve their speech quality. According to Frederic Amadu, “This technology makes it possible to characterize a voice, and identify emotions such as stress, fear, and pleasure. We are able to show directly on the waveform where there is hesitation in the voice. «
Working on a Deep Fake Antivirus
But can these solutions make it possible to identify deep fakes in order to better guard against them? For Frederic Amadu, the answer is yes: “Our technologies identify and use emotion, not just simple text transcription. Our solutions would be much more difficult to detect. To remedy this, we are working on the development of an application similar to Shazam, which will be able to spot whether voices are real or fake. » In this way, IRCAM seeks to provide positive and relevant solutions for speech synthesis, while also creating safeguards against potential malicious use.
**Want to learn more about the power of the voice?
Watch the Forum on the Power of Sound in Industry 2022, to better understand the new uses of audio for a shared world.
Think we're on the same wavelenght?