Earlier this year, pioneering machine learning experts at Dessa shocked the world with an artificial intelligence (AI) audio program that sounded exactly like podcaster and MMA commentator Joe Rogan.
Now the Toronto-based outfit has created the “deepest deepfake yet,” and claims it is the world’s first to combine audio and video. The result is technologically amazing and freaky as hell.
The video, featured in The New York Times‘ online series The Weekly, begins with audio only. “Welcome to another episode of I’m Not Joe Rogan. This is Joe Fauxgan speaking, an artificial intelligence created by Dessa.”
“Fauxgan” then introduces four-time Pulitzer Prize winner David Barstow as the Times reporter who first suggested that Dessa attempt to create the “world’s first combination of realistic AI; voice and video.”
At around 27 seconds, the deepfake video begins with Rogan’s digital doppelganger saying that he’ll conclude his podcast on its ten-year anniversary.
“It’s been almost 10 years since the first episode of the podcast. That’s fucking crazy. This show has become my life. That’s why I decided to go out with a bang.”
The real person whose face is swapped with Rogan’s is briefly revealed, then he says, “On December 24, we’ll be doing our last episode ever. That’s exactly ten years since the first episode came out. It’s been a good run and we’re all going out on top.”
Rogan’s clone approaches a perfect impersonation, but some may have a reaction described by the “uncanny valley.” This theory, first identified by robotics professor Masahiro Mori in 1970, hypothesizes that the weirdness of extremely human-like computer animations can cause feelings of unease and even revulsion.
In addition to the video’s jaw-dropping visuals, the audio even accounts for the room’s echoing acoustics.
In a Medium article Dessa co-founder and Chief of Machine Learning Ragavan Thurairatnam said that the next-level deepfake was created to raise public awareness of the power and danger of artificial intelligence.
The unfortunate reality is that at sometime in the near future, deepfake audio and video will be weaponized. Before that happens, it’s crucial that machine learning practitioners like us who can spread the word do so proactively, helping as many people as possible become aware of deepfakes, and the potential ways they can be misused to distort the truth or harm the integrity of others. Ultimately that is why we decided we had to be the first to reach the finish line.
Thurairatnam added that Dessa will not be releasing the model, code or data used in the video. They want to “mitigate the risk that sharing this deepfake example presents.” He did give a technical overview of how Dessa created this illusion:
Audio: To synthesize the audio used in our final version of the video, we used the same RealTalk model we developed to recreate Joe Rogan’s voice with AI back in May. RealTalk uses an attention-based sequence-to-sequence architecture for a text-to-spectrogram model, also employing a modified version of WaveNet that functions as a neural vocoder.
The final dataset our machine learning engineers used to replicate Rogan’s voice consists of eight hours of clean audio and transcript data, optimized for the task of text-to-speech synthesis. The eight-hour dataset contains approximately 4000 clips, each consisting of Joe Rogan saying a single sentence. The clips range from 7–14 seconds long respectively. You can find a more in-depth blog post about the technical underpinnings of RealTalk we released earlier this summer here.Video: For the video portion of the work, we used a FaceSwap deepfake technique that many in the machine learning community are already familiar with. The main reason we chose the FaceSwap instead of other techniques is that it looked realistic and was pretty reliable, without requiring many hours of training data to employ properly. This was an important factor for us due to our expedited timelines. To use the FaceSwap technique, we would only need one high-quality and authentic video of Joe Rogan to use as source data, and an actor who resembled him for the video we would use to swap Joe Rogan’s face onto. We ultimately picked Paul, shown below, because his physique closely resembled Joe Rogan’s.
Rogan hasn’t reacted publicly to Dessa’s latest deepfake. However, his response to the company’s original voice program is even more prescient now.
This could become a real problem. I’m flattered and honored that they chose my voice as an example to let us know that we’re fucked 🙂 https://t.co/ODGxa872bN
— Joe Rogan (@joerogan) May 17, 2019
“This could become a real problem,” Rogan tweeted back in May. “I’m flattered and honored that they chose my voice as an example to let us know that we’re fucked.”