Microsoft’s new speech cloning AI can accurately replicate a speaker’s voice; all it needs to get started is a three-second clip of them talking. A new text-to-speech AI model called VALL-E from Microsoft researchers can accurately imitate a person’s voice when given a three-second audio sample. Once familiar with a particular agent, VALL-E can create sounds of that person speaking in any situation while aiming to capture their overall sentiment. Similar to the AI Story Generators for Writers or AI Art Generators we explored previously, this voice-mimicking tool excited us enough to share some intriguing facts through this article.
The developers of VALL-E suggest that it may be utilized for elevated text-to-speech applications, voice alteration, which would allow a person’s voice recording to be edited and changed from something like a text transcript, and sound content development when paired with some other dynamic AI models like GPT-3.
Neural Codec Language Model
Based on Encode, which Meta unveiled in October 2022, Microsoft claims that VALL-E is basically a “neural codec language model.” Unlike text-to-speech technologies, VALL-E synthesizes speech by manipulating discrete audio codec codes from text and acoustic triggers. To match what it “knows” about how that voice may sound if it said more sentences beyond the three-second sample, it first evaluates how a person communicates, then utilizes Encode to break down the pertinent data into individual modules (referred to as “tokens”) that it is able then to resemble.
Researchers gave VALL-E the three-second “Speaker Prompt” sample and a text string (what they wanted the voice to say) to obtain those results. The purpose of the model is to provide results that can be mistaken for human speech, even though some VALL-E outputs may appear computer-generated. Microsoft has not made the VALL-E code accessible for others to examine since it can potentially encourage misconduct and deception. The researchers seem to be aware of the significant potential harm that this innovation may have.
It should be highlighted that the AI model is capable of simulating not only the pitch, husk, or texture of the voice but also the speaker’s emotional tone and the room’s acoustics. Therefore, VALL-E will mimic the target voice if it has a disturbance, just as it would if it didn’t. The study team claims that “experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system regarding speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”
Training in Imitation
Researchers claim to have trained VALL-E on 60,000 hours of English linguistic speech from 7,000+ people on Meta’s LibriLight audio library, which is countless times more than existing systems.
The target speaker’s voice must closely resemble the training data to be mimicked. This will enable the AI to read a specified text aloud while attempting to replicate the target speaker’s voice using its ‘training’. It must closely resemble a representative from the training set in the voice it is trying to imitate. In such a circumstance, it uses the training data to speculate on how the target speaker might sound if speaking the intended text input.
It produces a range of outcomes, some of which sound artificial while others are shockingly natural. The successful ones are more likely to be purchased since it keeps the emotional tone of the original samples. The VALL-E output also accurately replicates the acoustic environment; thus, if the speaker recorded their voice in an echo-filled auditorium, that is also how the speaker’s voice sounds when played back there. Microsoft intends to increase the size of its training data to improve the model’s performance from the angles of prosody, speaking style, and speaker similarity.
Additionally, it is looking into eliminating terms that are overlooked or confusing. Microsoft intends to increase the model’s training data scale to enhance model performance across inflections, speech patterns, and voice resemblance perspectives. It is also looking into ways to cut down on missing or ambiguous terms.
Microsoft decided against making the technology open source, potentially due to the dangers associated with AI that can manipulate speech. It also said that on any future developments, it would adhere to its “Microsoft AI Principals.” The AI model can be applied to robotics, media production, or customized text-to-speech applications. In the event of abuse, it could pose a threat. The company stated that while VALL-E may synthesize speech while maintaining speaker identity, problems may be associated with improper usages of the model, such as impersonation or voice identification spoofing.
“Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating,” they also claimed. For instance, VALL-E might make spam calls appear legitimate to scam people. As we’ve witnessed with deep fakes, politicians or persons with good social presences are likewise susceptible to impersonation. Applications that require voice passwords or commands can pose a threat. Additionally, voice actors’ jobs could be eliminated by VALL-E.
The business has also issued an ethical statement stating that “the trials in this work were carried out under the assumption that the model used is the target speaker and has been accepted by the speaker.” But it said that when the model was applied to speakers who couldn’t be seen, “essential components should be accompanied by speech editing models, including the protocol to confirm that the speaker agrees to execute the alteration and the system to detect the altered speech.”