The world of voice AI, with Mati Staniszewski of ElevenLabs

John hosts Mati Staniszewski, co-founder of ElevenLabs, a leading research company dedicated to making audio universally accessible across languages and voices. They explore the fascinating and complex world of voice AI, examining its current capabilities and future trajectory.

The discussion centers on the elusive 'voice Turing Test' and the reasons behind AI's success with text versus its ongoing challenges with natural conversational speech. Mati and John explore the future of human-computer interaction, addressing common frustrations like phones struggling to read PDFs, and outlining the immense potential for voice agents in sectors ranging from farming to healthcare.

This conversation underscores the critical importance of advanced voice AI in shaping future interactions and services. Mati also provides an inside look at ElevenLabs' rapid ascent to an $11 billion valuation and details how their technology is being applied to power digital government services in Ukraine.

Key takeaways

ElevenLabs' innovation involves abstracting traditional audio processing steps and allowing models to deduce voice parameters like accent and emotion as emergent properties, rather than relying on hardcoded settings.
Achieving human-like audio quality requires high-quality, meticulously annotated datasets, prompting ElevenLabs to build its own data labeling team internally due to the limitations of existing public data.
ElevenLabs' platform enables businesses to deploy AI agents for horizontal functions like customer support, sales, and marketing, integrating these agents with existing business tools and safeguards.
Voice AI in consumer products like phones and cars currently operates a decade behind the advanced capabilities of text-based LLMs due to a significant product overhang.
The primary reason for this lag is a "deployment gap," as big companies like automotive manufacturers are slow to adopt and integrate recent breakthroughs in voice model technology.
Real-time, context-aware voice interaction, which enables understanding of specific user intent and content, has only become technologically viable in the last year.
Eleven Reader was developed to provide a distribution channel for AI-generated audiobooks when traditional platforms would not accept them.
ElevenLabs is developing person-specific transcription that achieves 'superhuman accuracy' by fine-tuning models to individual voices, expected to roll out within months.
Cascaded AI voice models, which convert speech to text and then text to speech via an LLM, offer superior reliability, accuracy, and visibility, making them the preferred choice for enterprise solutions.
Direct speech-to-speech models prioritize low latency for applications like AI companions, accepting reduced reliability and accuracy in exchange for faster responses.
The technology has a profound personal impact, enabling individuals who have lost their voices due to illness or injury to regain their ability to speak, fostering emotional connections and restoring personal milestones.
Voice agents function as persistent entities capable of both proactive (e.g., making outgoing calls) and reactive (e.g., customer service) interactions in real-world scenarios.
Voice models are considerably smaller (2-tens of billions of parameters) and cheaper to train than LLMs (hundreds of billions), but still require substantial investment in CapEx and expert researchers.
ElevenLabs charges per usage (text token or per minute) and provides new models at cost to accelerate customer adoption, gather feedback, and demonstrate new possibilities.
ElevenLabs achieved over $450 million ARR, with a record $100 million in net new ARR in one quarter, fueled by reliable and high-quality AI agent technology.
Small, agile teams (under ten people) are critical to ElevenLabs' hypergrowth, enabling deep market understanding and rapid execution across product, research, and go-to-market initiatives.
ElevenLabs is introducing a pay-as-you-go billing option to provide flexible usage beyond fixed subscriptions, allowing users to pay for additional usage without hitting limits.
ElevenLabs maintains an exceptionally flat organizational structure, with co-founders having over fifteen direct reports, challenging traditional management spans of control.
AI is leveraged to "LLMify" data, making it interactive and explorable for business insights, and to automate manual tasks, acting as a force multiplier for human effort.
In a critical application, ElevenLabs' voice AI powers Ukraine's "Dia" digital government platform, ensuring citizens can access essential services and information during wartime, especially when traditional means are unavailable.

00:02 - 08:04

How ElevenLabs' AI audio models achieve realistic emotional inflection

Early attempts at speech synthesis involved replicating the human vocal tract with analog machines or creating structured digital signals, as pioneered by Bell Labs. This progressed to stitching together individual phonemes, where the next sound was chosen probabilistically. Modern audio models now use neural networks, similar to other domains, to predict subsequent sounds based on previous audio context and accompanying text.

ElevenLabs, co-founded by Piotr, innovated by abstracting traditional speech processing steps, such as decoding text into mel spectrograms and then waveforms. A major breakthrough involved enhancing the prediction of the next phoneme and incorporating broader textual context. This allows the models to understand and reflect elements like dialect or emotional inflection that occur before and after a given segment, mimicking a human voice actor's awareness of the script.

Crucially, ElevenLabs' models move beyond hardcoded voice parameters for elements like accent or enthusiasm. Instead, the model deduces these characteristics as emergent properties from the input, allowing for more natural and diverse speech outputs. Achieving this level of human-like audio quality also required extensive investment in creating proprietary, highly-annotated audio datasets, as existing public data often lacked the detailed emotional and accent metadata necessary for training advanced voice models.

in our approach, effectively you would Give the, the, the model open-ended ability to select what those parameters should be. So, it's not gonna be British, Polish, Spanish, English speaker, but the model will deduce them themselves.

08:04 - 09:04

Von Kempelen's Mechanical Turk and the Illusion of Early AI

Mati Staniszewski recounts the story of von Kempelen, who is recognized for creating the first machine to represent speech. Kempelen spent decades working on an analog machine designed to mimic the human vocal tract and produce sounds.

Beyond his speech efforts, Kempelen also created the Mechanical Turk, a chess machine that gained viral fame. This automaton appeared to play chess intelligently, captivating audiences with its sophisticated moves.

The Turk's intelligence was an illusion; it was secretly operated by a human hidden inside. This historical example serves as an analogy, distinguishing between early attempts at simulating intelligence and modern, genuinely data-driven AI like that developed by ElevenLabs.

but the, but exactly, but the kind of crazy thing behind it, it was operated by a human.

09:04 - 12:05

ElevenLabs builds foundational audio models and an enterprise AI transformation platform

ElevenLabs operates as a research and product deployment company focused on creating foundational audio and voice models. Their technology includes advanced text-to-speech, speech-to-text capabilities in over 100 languages, and conversational models, alongside other audio domains like music.

Beyond the core models, ElevenLabs offers a platform designed to help businesses transform their communication with customers and employees. This involves deploying AI agents for horizontal use cases in areas such as customer support, sales, hiring, training, and marketing. The platform integrates these models with knowledge bases, telephony systems, and necessary integrations, while also providing tools for monitoring and safeguarding agent behavior, and creative tools for ad voiceovers and narrated articles.

ElevenLabs positions itself as a platform for horizontal business applications. They aim to power a broad ecosystem, encouraging other companies to develop vertical-specific applications built on ElevenLabs' underlying technology. This strategy ensures partners are always leveraging the latest and most advanced models as the technology evolves rapidly, rather than being stuck on outdated versions.

The company sees its role as providing the essential infrastructure for AI-driven communication, allowing domain-specific solutions to emerge from their partners. This approach helps them maintain focus on core model development and platform improvements, which are constantly being updated with new capabilities.

We build foundational audio and voice models. And then build a platform for businesses to transform how they communicate with their customers, with their employees.

12:04 - 16:05

Voice AI in Consumer Devices Lags Behind LLM Capabilities

Despite the widespread adoption and advanced capabilities of large language models like ChatGPT and Gemini, voice AI in everyday consumer devices such as phones and cars feels significantly outdated, operating as if a decade in the past. Users often struggle with basic tasks like getting a phone to accurately read a PDF or using a car's voice controls for navigation, which are widely advertised but generally perform poorly.

This perceived lag isn't primarily a technological limitation but rather a "deployment gap." While the underlying voice model technology has made considerable advancements, major companies, especially in sectors like automotive, have been slow to integrate these cutting-edge capabilities into their products. There are also numerous practical problems that need to be addressed during implementation.

The quality of voice models capable of narrating text has only reached a high standard in the last three years. Real-time versions of these models emerged about two years ago, and the crucial breakthrough for real-time, context-aware voice interaction—which understands user intent and specific content, like reading a particular document—is barely a year old. This recent innovation is driving adoption in enterprise settings.

Significant improvements are expected in consumer voice AI soon. On-cloud voice use cases in the automotive sector and other applications should see substantial upgrades this year. While fully in-car (without connectivity) capabilities will take a bit longer, likely within the next two to three years, the ability to perform tasks like having a phone competently read a PDF is anticipated to become widely functional.

we're like living ten years ago somehow.

16:05 - 18:05

Eleven Reader enables AI audiobooks and hints at a future consumer voice app

ElevenLabs developed 'Eleven Reader' to address a challenge faced by audiobook authors: the inability to sell AI-generated audiobooks on major platforms like Audible due to industry resistance. This tool was created to provide an avenue for authors to distribute their works when professional narration was financially out of reach or traditional channels were closed.

Eleven Reader functions by allowing users to upload PDFs or other text, which can then be read aloud using a selection of high-quality AI voices. Examples include celebrity impressions like Sir Michael Caine, offering an accessible way for authors to create narrated versions of their books without human voice talent.

The success of Eleven Reader prompts the question of a broader consumer-facing ElevenLabs app. Such an app could perform common voice tasks on a phone, like reading a PDF aloud, leveraging ElevenLabs' advanced voice technology. This concept addresses a perceived gap in current phone operating system capabilities, which often lack high-quality, customizable voice functions.

The potential for a popular ElevenLabs consumer app could eventually influence major OS makers like Apple and Google. A widely used third-party app for transcription and voice tasks might encourage these companies to integrate more advanced third-party voice engines directly into their systems, similar to how other rapid technological shifts have driven system-level changes.

Shouldn't you guys have a consumer app where I can just do the common voice things, like I want to be able to have an Eleven app on my phone, and then if I upload a PDF to it, it can do the common things that I would like, such as have it read it to me.

18:05 - 21:05

Voice AI faces complex orchestration challenges that prevent it from passing the Turing test.

Consumers frequently attempt to use voice mode in AI applications like Gemini and ChatGPT, or with assistants like Siri, but the functionality remains largely ineffective despite high demand. This suggests a significant gap between user expectation and current technological capability in voice interaction.

The primary difficulty lies in orchestrating complex conversational elements. This includes determining when a speaker has finished their turn, executing commands, asking clarifying questions, and integrating speech-to-text with context-aware responses. Unlike text-based large language models, which have already passed the Turing test, voice-based AI still falls far short of human-like conversation.

Developing voice AI that can handle intricate scenarios, such as authenticating users or retrieving information from databases, requires sophisticated orchestration. For example, a system needs to decide if it should immediately respond, wait for more input, or query external tools. ElevenLabs aims to conquer these "voice Turing test" challenges within specific domains like customer support within the next year.

we have passed the Turing test with text LLMs a long time ago, and we actually know we're near that on voice LLMs, and it's kind of interesting how that's a final frontier.

22:05 - 26:06

ElevenLabs is developing person-specific voice transcription for superhuman accuracy.

Current global voice recognition models often struggle with unique accents and individual speech patterns, as demonstrated by Mati Staniszewski's own 'tricky' accent, which makes parsing difficult even for advanced systems. This challenge highlights the limitations of a one-size-fits-all approach to voice transcription.

ElevenLabs is addressing this by developing person- or voice-specific detection models. This research aims to significantly improve accuracy for individual speakers, not just for accents but also in challenging environments like crowded rooms. The new models are designed to incorporate advanced features such as enhanced speaker diarization, noise reduction, and keyword detection, allowing the system to focus on specific terms or commands.

The company plans to roll out this 'superhuman' transcription capability, which can be fine-tuned to an individual's voice with approximately an hour of audio, within the next few months. This represents a significant breakthrough expected this year, moving beyond general voice recognition to highly personalized and precise transcription.

The applications for this technology are critical and far-reaching, particularly in high-stakes environments. For instance, in healthcare, it would enable perfect transcription of a doctor's commands in an operating room, while in smart home devices, it could ensure accurate responses by prioritizing recognition of specific family members' voices.

No, solvable. We think we can roll it out in one of the next versions, hopefully in the next month. This year, for sure, we are doing person-specific transcription.

26:05 - 30:06

ElevenLabs develops controllable AI voice generation and optimizes cascaded models for enterprise applications.

ElevenLabs introduced the V-free model, a significant innovation in controllable speech generation. Previously, AI models independently decided the best emotional performance for generated speech. Now, users can provide cues to control aspects like pace, dramatic pauses, and emotional delivery, enabled by architectural changes and extensive annotated data.

This breakthrough allows for advanced voice agent experiences, such as an "expressive mode" where an agent can detect a user's stress and respond with a reassuring tone. While this technology is still evolving, it marks a substantial step in making AI-generated speech more nuanced and responsive.

The company uses a cascaded approach for conversational AI agents, which involves transcription (speech-to-text), processing by an LLM, and then text-to-speech generation. This method provides high reliability, accuracy, and crucial visibility into each step of the pipeline, which is essential for business and enterprise applications.

In contrast, direct speech-to-speech models bypass the text layer entirely, going directly from speech input to speech output. While significantly faster due to lower latency, these models lose reliability, visibility, and are generally "dumber" than cascaded models. They are better suited for companion applications where latency is paramount and occasional deviations or "hallucinations" might even be a feature.

They are definitely dumber.

30:06 - 34:06

Voice AI Transforms Human Interaction and Restores Personal Connections

Voice AI fundamentally changes how people interact with technology and businesses. When Eleven Labs experimented with offering voice interaction instead of a form, they found that people were much more keen to leave details and were more open-ended in describing their use cases. This shift helps businesses gather richer information and allows for better clarification through follow-up questions.

Beyond direct interaction, voice AI is breaking down language barriers. The technology enables high-quality, AI-generated dubbing for media, vastly improving upon past methods like using a single voice actor for all parts in countries like Poland. This expands content accessibility across different languages and cultures.

The technology also holds immense personal significance by restoring voices to individuals who have lost them due to medical conditions or injury. Examples include patients with ALS or throat cancer regaining their ability to speak, a Neuralink patient speaking with their own voice, and a woman recreating her voice for her wedding vows, highlighting its profound emotional impact.

Looking forward, voice AI could enable real-time language translation for travelers and lead to the development of personal voice agents that can assist on an individual's behalf, further integrating the technology into everyday life.

For the first time, she could replicate the marriage ceremony and speak the vows together, which was such a heartfelt moment. Probably the most important from all the work that we do.

34:06 - 36:06

Voice Agents Handle Proactive and Reactive Real-World Tasks

Voice agents are evolving into persistent, long-running entities designed to interact with the world through voice. These agents can operate on both reactive fronts, like handling customer service inquiries, and proactive fronts, such as making restaurant reservations by actually calling establishments.

A practical example is the 'gindex' app, developed using Eleven Labs technology. This app proactively called pubs across Ireland to check the price of a pint of beer. It also allowed entities to report their prices, demonstrating both proactive information gathering and reactive reporting capabilities.

The utility of voice agents is further highlighted by their integration with other tools. For instance, Eleven Labs is a popular and recommended option for voice within OpenClaw, which frequently seeks out top tools for integration. This shows the growing adoption and importance of voice technology in various applications.

These applications demonstrate how voice agents move beyond simple commands to become interactive tools that can gather specific data and perform actions directly in real-world scenarios.

They were calling all the pubs in Ireland, checking the price of a pint of beer.

36:06 - 40:09

Voice Model Economics: Training, Pricing, and Future Scale

Voice models are substantially less expensive to train compared to large language models (LLMs) or image/video models, typically featuring parameter counts in the low tens of billions, unlike leading LLMs that reach hundreds of billions. Despite this, significant capital expenditure and top-tier research talent are still essential to develop and maintain cutting-edge voice synthesis technology.

ElevenLabs employs a usage-based pricing model, charging customers per text token for text-to-speech services and per minute for applications like voice agents or transcription. A key strategy involves offering access to newly developed models at cost. This encourages customers to experiment, provide crucial feedback, and help showcase the potential of the latest advancements, even if early versions are less reliable.

The future scale of voice models depends on their architecture. For cascaded approaches, where models are orchestrated for speed and reliability, dramatic increases in size are unlikely. However, fused models, which integrate both language model intelligence and voice capabilities, are anticipated to grow significantly, potentially reaching tens to hundreds of billions of parameters as they combine these complex functionalities.

In a cascaded approach, you probably will not see like dramatic size, changes. You inherently want the models to be quick and reliable, you want to orchestrate them in a smart way. In a fused approach, probably that will get into like tens, hundreds, billion parameter models because you kind of combine, of course, the, intel- the LM side and the voice side, so that will get bigger.

40:09 - 44:09

ElevenLabs Prioritizes Conversational AI for Full Business-Customer Interactions

ElevenLabs maintains a strong dual focus on both research and product development, enabling them to introduce technical breakthroughs that are then deployed to customers worldwide. This allows them to act as a partner in AI transformation, especially for new use cases in voice agent production, rather than solely a vendor-SaaS relationship.

The company's primary strategic focus is on bringing conversational agents to businesses globally. They aim to be a partner for full interactions between businesses and their customers or audience, extending beyond traditional support to include proactive sales (like AI SDRs) and various marketing applications.

ElevenLabs emphasizes its unique value proposition in optimizing for voice-first interactions. While they solve for text, integrations, and knowledge within their platform, they explicitly state they do not optimize for deep reasoning or complex financial analysis tasks, focusing instead on where voice interactions are predominant.

The places where we know we will, we will be able to provide the biggest value is like where Ultimately today you will have mo- either a big portion or most of the interactions coming through voice.

44:09 - 48:09

ElevenLabs Achieves $450M+ ARR with Agile Teams and Land-and-Expand Strategy

ElevenLabs has demonstrated explosive growth, reaching over $450 million in annual recurring revenue (ARR) and securing a remarkable $100 million in net new ARR in a single quarter. This rapid expansion is largely attributed to their AI agent technology, which has evolved to be consistently reliable and high-quality over the past year and a half, driving significant enterprise adoption.

The company's success is rooted in a strategic 'land and expand' approach for enterprise clients. ElevenLabs makes it easy for new customers to try and test their technology by offering attractive initial economics. This strategy fosters increased usage within departments over time as the technology proves its value, making further commitment a natural progression.

Beyond initial departmental adoption, ElevenLabs actively pursues cross-departmental innovation. A notable example is their engagement with Deutsche Telekom, which began with marketing efforts for podcast generation and subsequently expanded to customer support and other operational areas, illustrating clear step changes in the technology's application.

A key organizational factor enabling this hypergrowth is ElevenLabs' structure of small, agile teams, typically consisting of fewer than ten people. These dedicated teams focus on specific product or research initiatives and go-to-market strategies, allowing them to deeply understand target industries and markets, thereby executing quickly and independently.

this quarter was one of the best for enterprise growth where we had the first quarter hit one hundred million in additional ARR growth, which is crazy, in net new ARR.

48:09 - 52:10

ElevenLabs Prioritizes Self-Serve Adoption and Launches Pay-As-You-Go Billing

ElevenLabs champions a self-serve, product-led growth model, seeing it as crucial for obtaining immediate feedback on their technology's performance. This approach underscores their confidence in their AI models and allows a broad user base to experience the technology directly, fostering trust by minimizing friction in the adoption process.

This strategy enables developers and SMBs to access and experiment with cutting-edge AI features, even if the technology isn't yet fully compliant or scaled for large enterprise demands. These early adopters often reveal the future applications and trajectories for the technology, guiding ElevenLabs' product development.

To meet user demand for flexible consumption, ElevenLabs is launching a full pay-as-you-go billing system. This new model moves beyond traditional subscription plans with fixed limits, allowing users to pay for exactly what they use and overcome common frustrations of hitting usage caps when they desire to continue using the product more.

It's kind of very funny as a consumer to not have the option to pay more to use the product more.

52:10 - 56:13

ElevenLabs' AI-Native Organizational Design

ElevenLabs, founded in 2022, has adopted a distinctly AI-native organizational structure. Unlike traditional companies, it embraces a very flat hierarchy, with co-founders managing over fifteen direct reports each. This approach reflects a belief in maintaining small, efficient teams and a departure from the traditional narrower spans of control.

A core principle is fostering technical proficiency across all teams, even in non-technical functions. If teams lack direct technical expertise, they are supported by dedicated resources that automate workflows. This dual focus ensures that the entire organization can effectively leverage advanced tools and processes.

The company extensively uses AI to amplify human work. They employ AI to "LLMify" data, making it interactive and easily explorable for analysis, such as understanding sales pipelines or identifying effective strategies. This allows teams to quickly access insights and double down on successful approaches.

Furthermore, AI automates many manual tasks, bridging gaps where current agent skill sets might be insufficient. Examples include scraping profiles to identify suitable job candidates or detecting specific successful patterns for go-to-market strategies. AI acts as an additional amplifier, streamlining operations and boosting productivity.

LLMifying everything, like making the data explorable for you to be able to interact with it, of like who's in the pipeline, what worked, who does the best references, like all of that work so you can double down on that.

56:13 - 1:00:13

ElevenLabs' Agency Culture and Voice AI for Ukraine's Digital Government

ElevenLabs leverages its own voice AI internally to boost efficiency across various teams. This includes an AI-powered Sales Development Representative (SDR) experience, creating customized, pre-populated sales decks with relevant data, and a voice agent for new hires to explore company culture and prepare for interviews. These applications aim to amplify employee work and automate simpler tasks.

The company has also deployed its technology for critical public services in Ukraine. Amidst ongoing conflict, Ukraine's "Dia" digital government platform, which offers help with benefits, frontline information, education, and healthcare appointments, integrated ElevenLabs' voice AI. This ensures citizens can access vital information and services when traditional communication channels or physical access are disrupted.

Working with Ukraine's Dia platform validated ElevenLabs' organizational structure, which embeds technical resources within each team. Ukraine's ministries similarly have technical resources building "agentic" versions of their work, coordinated by a central digital transformation team. This parallel approach reinforced ElevenLabs' belief that decentralizing technical expertise across teams is an effective model.

A core tenet of ElevenLabs' success and scaling is fostering a culture of high agency. The company prioritizes individuals who embody first principles, take ownership, strive for excellence, and maintain humility. The emphasis on high agency is particularly vital in the rapidly evolving AI landscape, where individuals with the autonomy to explore and innovate are best positioned to leverage new advancements.

My biggest takeaway from all this has been that around agency, where I feel like high agency people are the winners of the advances in AI and within organizations, low agency people will lose out.

Follow the shows you care about.

Podbrew watches new episodes and turns them into concise briefs you can read in minutes.

Get your own briefs