How the company that traced fake Biden robocall identifies a synthetic voice

There is an hour-long video still available on YouTube of an artificial intelligence model’s attempt to create a new special in the voice of late comedy legend George Carlin. This, despite the fact that Carlin’s estate filed suit against the human podcasters who use the model. 

The podcast took it down, but the video is still out there. 

For days in January, AI-generated, explicit images of Taylor Swift circulated on X before the social media platform was able to better moderate the content. The images, created using one of Microsoft’s models, are emblematic of the older, more widespread problem of deepfake porn, something that last year impacted New Jersey high school students

Related: How one tech company is tackling the recent proliferation of deepfake fraud

And even as AI-generated text and deepfaked images and videos continue to spread across the internet, instances of AI-powered deepfake vocal fraud are steadily expanding.

Last year, scammers attempted to convince Jennifer DeStefano that they had kidnapped her 15-year-old daughter. DeStefano’s daughter was safe at home with her dad, but her AI-generated screams on the other end of the line were more than convincing. 

Similar scams involving voice cloning have escalated to the point that the Federal Communications Commission (FCC) this month adopted a new ruling making AI-generated robocalls illegal. 

Such scams can go beyond individual fraud and into electoral fraud; an AI-generated robocall of President Joe Biden circulated at the end of January, urging voters not to participate in the New Hampshire primary. 

Microsoft’s Vall-E voice cloning tool can clone a voice using only a three-second audio recording; Meta’s Voicebox can do it with a two-second recording. 

As Lisa Plaggemier, Executive Director at the National Cybersecurity Alliance told TheStreet earlier in February, AI models were not built with security in mind. 

“If you don’t think about how they can be abused, they’re going to be abused,” she said. “It’s human nature.”

And even as AI is fueling the threat, the same technology is strengthening the shield. 

Related: Deepfake porn: It’s not just about Taylor Swift

Pindrop Pulse and vocal authentication

Cybersecurity firm Pindrop has been employing AI tools to identify and defend against this rising tide of deepfake voice fraud for years. 

The company on Tuesday launched Pindrop Pulse, a deepfake detection tool that is specifically trained to identify AI-generated audio over the phone. 

The enterprise system, which, according to Pindrop, has achieved a detection rate above 90%, in January identified the source of the Biden robocall: ElevenLabs. 

The model works in real-time to determine the probability that a given voice is synthetic; with as little as two seconds of audio, the system immediately offers a “liveness score” between 0 and 100 — if a piece of audio scores above 60, it’s likely a genuine human. The closer the number is to 0, the more likely it is that the voice was generated by AI. 

Related: Cybersecurity expert says the next generation of identity theft is here: ‘Identity hijacking’

How it works

Pindrop co-founder and CEO Vijay Balasubramaniyan demonstrated the software to me over a video call; an AI-generated cloned recording of his voice made with ElevenLabs — which I found to be entirely indistinguishable from his voice — immediately received a liveness score of 0. 

His own voice received a score of 99.97. 

The software, which is the result of close to a decade of research in the space, is trained to look for spatial and temporal anomalies that, while inaudible to the human ear, serve as red flags to the software. 

Humans, owing to mouths and other vocal organs, have a way of speaking that machines can synthesize but have trouble replicating, Balasubramaniyan said. Certain letters — “S” and “F,” for example — are often mistaken by machines for noise.

The synthesis of that noise in words like “San Francisco” might be close to impossible for a human to pick out, but acts as one of several red flags Pindrop’s system is trained to identify. With 8,000 samples of a given voice available over the phone every second, AI is suited to scanning for and flagging those anomalies.  

More deep dives on AI:

Think tank director warns of the danger around ‘non-democratic tech leaders deciding the future’George Carlin resurrected – without permission – by self-described ‘comedy AI’Artificial Intelligence is a sustainability nightmare — but it doesn’t have to be

Beyond authenticating voices, the model can additionally compare known text-to-speech signatures; Pindrop has 122 different text-to-speech engines within its dataset, and each has a specific signature Pindrop calls its “fake print.” 

The fake print technology is what enabled Pindrop to trace the Biden robocall back to ElevenLabs. 

“These tools each have their own signature on how they’re inhuman,” Balasubramaniyan said. “At the same time, we’re also looking at these temporal characteristics where the way your voice moves has a certain pattern to it because you’re human and these systems don’t get it right.”

It is the combination of its fake prints with its temporal and spatial analysis that allows Pindrop to determine the likelihood that a voice is an AI-generated clone. 

Related: Marc Benioff and Sam Altman at odds over core values of tech companies

The path toward consumer protections 

As it exists today, Pindrop’s solutions are enterprise solutions. 

The company works with banks, financial institutions, insurance companies and call centers at large to protect against the threat of deepfake audio fraud. 

To Balasubramaniyan, the path toward more widespread individual-level protections is multi-fold. 

First, the problem needs to be addressed at its source: the text-to-speech generators themselves. 

“Every AI app needs to responsibly allow for the creation of these voice clones, because it wasn’t President Biden who gave consent to create a clone of his voice,” he said. 

Balasubramaniyan added that safeguards must also be introduced within transmission channels so that phone carriers can identify the legitimacy of a call and “flash it on grandma’s screen, saying ‘this is not your grandkid talking, this is a deepfake of his voice that they got from a three-second TikTok video.'” 

The third path involves shielding the destination: phone manufacturers baking similar software into their devices. 

Balasubramaniyan said Pindrop is currently exploring partnerships on all three fronts, though timing remains unclear. 

“I would love for them to take the technology like this because it just works,” he said. “But you need partnerships like that, and those take a long time. But now that we have something that we think is really meaningful for the consumer, we are working aggressively to try and make those happen.”

Contact Ian with AI stories via email,, or Signal 732-804-1223.

Related: Veteran fund manager picks favorite stocks for 2024

Related Posts

Union Capital Financial Group Ltd, registered in the British Virgin Islands, does not provide investment services inside the United States. The company only provides consulting, advisory and educational services.