13 Videos From the Cerebral Valley Voice Summit: Sierra's Bret Taylor, Wispr Flow's Tanay Kothari, MiniMax's Linda Sheng & More
Watch our panels from the first Cerebral Valley Voice Summit
We’re fresh off the inaugural Cerebral Valley Voice Summit.
Over 200 founders, investors, and operators gathered to hear from voice AI leaders like Bret Taylor, the co-founder of Sierra; Emergence Capital’s Jake Saper; a16z’s Olivia Moore; Lux Capital’s Grace Isford; OpenAI’s head of realtime AI Justin Uberti, Runway co-founder Anastasis Germanidis; and the founders of Abridge, Wispr Flow, Deepgram, Cartesia, and more.
We’re sharing all 13 talks from the summit in the newsletter as well as on our Newcomer Summits YouTube channel.
The Cerebral Valley Voice Summit is co-hosted by Eric Newcomer, Max Child, and James Wilsterman.
Many thanks to our sponsors Baseten, Felicis, Nebius, AssemblyAI, and Weekend.
Bret Taylor (Sierra)
Sierra founder Bret Taylor spoke on-stage at the Cerebral Valley Voice Summit just two days after raising a fresh $950 million in funding, surpassing $165 million in revenue, and reaching 40% of the Fortune 50 using his company’s voice agents.
Despite all this progress, he still thinks its early innings for voice AI’s entrance into the global economy. “It feels like we’re in this stage of the internet before broadband,” he said.
Justin Uberti (OpenAI)
Panelists were divided on whether voice models should sound as close to “human” as possible, or if a more accurate but less conversational agent would be a better fit for certain use cases.
Justin Uberti, head of real-time at OpenAI, made a case for the former. “The agent has to be a great talker, a real conversationalist,” he noted. “There’s always a little bit of a tradeoff… how do you add intelligence without sacrificing the experience?”
The following day, OpenAI released its newest batch of realtime voice models, which can use reasoning while in the middle of a conversation to take better action from a prompt.
Anastasis Germanidis (Runway)
There was a hearty debate at the Summit on whether voice will be the primary way humans interact with computers going forward.
Anastasis Germanidis of Runway — primarily a video model maker — expressed excitement around promise of world models, which utilize voice in conjunction with motion and interactive video.
Linda Sheng (MiniMax)
Linda Sheng at MiniMax declared that voice model advancements were coming in combination with video and other multimodal channels.
Jake Saper (Emergence Capital), Olivia Moore (Andreessen Horowitz), and Grace Isford (Lux Capital)
Venture capitalists on-stage for the Cerebral Valley Voice Summit acknowledged it was still early days for the technology, but were excited about rapid improvements in the AI models.
Grace Isford of Lux Capital said we were still in “the Microsoft Co-Pilot era” of voice and much was still to be done in building end-to-end products.
While consumer voice applications haven’t broken out quite like enterprise ones, Olivia Moore of Andreessen Horowitz said it was on the horizon: “Consumer just takes longer to mature,” she said. Jake Saper of Emergence Capital pointed to the music app Suno, and predicted that “an increasing percent of music will be created or co-piloted with AI.”
Shiv Rao (Abridge)
Shiv Rao, founder of Abridge, noted that voice was a natural for a human-centric industry like healthcare.
The regulatory and privacy issues in healthcare, which have often scared away VCs, can be a competitive moat.
Tanay Kothari (Wispr Flow)
Tanay Kothari’s startup Wispr Flow was already widely used by the crowd at the Cerebral Valley Voice Summit, with people dictating everything from slack messages to coding workflows.
Unsurprisingly, Kothari was bullish on voice being the next paradigm shift for how we interact with computers. On-stage, he mentioned how strange people talking on the phone with AirPods looked a few years back, yet now we don’t question people’s public conversations in the street.
Eugenia Kuyda (Wabi)
Wabi and Replika founder Eugenia Kuyda predicted that we’ll see the rise of two kinds of voice agents: some for personal development, such as AI therapists or companions, and less personality-driven agents for work-related tasks.
Everyone in attendance at the Cerebral Valley Voice Summit may already be voice-pilled, but Wabi’s Kuyda offered a reality check on how far agent capabilities have actually reached mainstream users, noting a sharp divide between early adopters and everyone else.
Jeffrey Liu (Assort Health)
Jeffrey Liu of Assort Health, which builds AI voice agents for health systems to handle administrative calls like scheduling, billing, and medication refills, said his company had already seen 150 million patient interactions across 5,000 different providers, all handled with voice AI.
Dylan Fox (AssemblyAI)
Agents that sound human are important for some applications, but not for others — and there’s no consensus on whether people will embrace human-like agents as they proliferate.
Dylan Fox, founder of AssemblyAI, argued that people don’t necessarily want all agents to pretend to be alive.
Scott Stephenson (Deepgram)
Scott Stephenson of Deepgram said that as of today, voice AI models haven’t passed his personal “voice Turing Test,” where a person can have a five minute conversation with a voice agent and not realize it’s not human.
However, he predicted that this barrier will be broken by the end of the year with advancements in context memory.
Russ d’Sa (LiveKit)
Russ d’Sa, founder of LiveKit, which helps labs build, test, and deploy voice agents, pointed out that speeding up the communications layers of the process makes the agents smarter.
“Every single millisecond you can shave off between a round trip, or turn from a human to an AI gives you more time for inference,” said s’Sa.
Brandon Yang (Cartesia)
The startup Cartesia has built up a reputation for having some of the lowest latency text-to-speech models, but co-founder Brandon Yang noted that the evals for voice models are difficult to complete, given that what constitutes “natural speech” is quite subjective.
Many models are still constrained by language, with performance varying much more widely in non-English conversations.



