While the world is still struggling to soak in the incomprehensible generative capabilities of text-based AIs like ChatGPT or Bard, and image generators like Midjourney and Dall.E, Meta comes with its own first-of-its-kind counterpart for audio, Voicebox.
Meta AI’s Voicebox is a state-of-the-art generative AI for speech that can be used for speech generation-related tasks like audio synthesis, editing, sampling, and stylizing.
Apart from being efficient and twenty times more powerful than its closest models, as Meta claims it to be, Voicebox has some enticing features that put it at a big leap forward in the AI race, the first of which is the diverse text-to-speech.
From a text input, this audio AI model, as of right now, can produce audio output in six different natural-sounding voices. This may sound like every other text-to-speech software in the market, but Meta’s Voicebox brings it one step forward in making computer-generated audio more natural and customized, suited to the need.
By feeding the AI an audio sample, along with the text input, Voicebox can generate natural-sounding voice output that matches the provided voice sample. Theoretically, by giving the model your voice recording, you can generate audio that sounds surprisingly like you, which Meta is calling Style Transfer. And as expected, the more you feed the model, the more accurate representation you would be able to extract from this, like any other generative AI model.
Besides personal use, the audio model can realistically be used for commercial purposes, especially in the entertainment industry. Studios like Disney already have the technology to reanimate late actors for screen or even de-age older performers. Now, with the addition of AI models like Voicebox and the abundance of audio samples from actors, throughout their acting careers, having them say any dialogue in any context wouldn’t be a problem.
Think of how many Stan Lee cameos Marvel could produce with a tool like Voicebox. Whether or not it will be ethically accepted is a debate. Nevertheless, with Voicebox, it is something that can be done and done well that wasn’t possible before, at least not with such minimal efforts.
But the most interesting feature of this audio AI model is the cross-lingual style transfer. Not only can this AI model copy voice tone or style from an audio sample, but it can also use the extracted tone or style and implement it in other languages.
As of right now, Voicebox has been trained in six languages – English, Spanish, French, German, Portuguese, and Polish. With cross-language style transfer, Voicebox can take samples from, say, an English speaker, copy the style, and give output in a different language, let’s say in Spanish, and the Spanish output would seem like something that the English speaker would say, even if the English speaker had no knowledge of Spanish language, words or even pronunciation.
For content creators, it is a great tool to reach out to a broader audience. YouTubers and content creators like MrBeast, put out their videos in multiple languages for their global audience. But for those region-specific videos, they have to use other voice artists to dub over the English audio. With AI tools like Voicebox, they will be able use the original audio in English and generate audio in Spanish, French, German, or literally any language that Voicebox supports right now or would support in the future. And the generated audio would sound very similar to how the creator would have sounded had he been fluent in that language.
If Voicebox extends its training dataset with audio from Bangla, Bangladeshi creators would be able to make copies of the audio in other languages for their overseas audience without needing to learn the language themselves, increasing the audience for their content with minimal effort.
An even more practical use of this feature would be a universal translator of sorts that has been glorified by Sci-Fi movies of the West since forever. If this conversion can be speed-up to work in real-time, speakers of two languages could communicate with each other with in-ear translations that sound very much like the person sitting in front, making the conversation much more natural than the prototype-esc robotic-sounding two-way translators often used.
Since this audio model can extract style and tone from audio samples, it can also extract specific noise or even words from a long uninterrupted audio. It can even swap out noise or words within audio for its extraction or style-learning capabilities.
For content creators, be it video format or audio podcasts, mispronunciation and audio moderation will be as easy as pointing to the right part of the audio that needs to be changed and typing out the word or phrase that will sit in its place. This content correction can be used in broadcast media as well as personal correspondence if implemented properly.
But Voicebox is still in its development phase, and these are just some of the features that Meta disclosed during the announcement on June 16th. Even though Voicebox’s voice synthesis, cross-lingual style transfer, and content moderation are some of the most interesting tools to come for audio in a long time, it is very easy to predict its misuse to threaten, harm, or even wrongfully indict someone.
As of right now, Meta intends Voicebox to lend a natural-sounding voice to virtual assistants and non-player characters in the Metaverse. From the concerns of potential risks of misuse, Meta has stated that they will not be making it public at this moment. But it is safe to assume that an interesting and useful AI like this, which can save millions of work hours, will definitely be available to the public, even if it is in its restricted or curated form.
Post by @rifat5670View on Threads
My thoughts on Meta’s new Voicebox AI…https://t.co/PUFWKMnimV
— Rifat Ahmed (@Rifat5670) August 11, 2023