The world of voice AI, with Mati Staniszewski of ElevenLabs

Name: The world of voice AI, with Mati Staniszewski of ElevenLabs
Uploaded: 2026-04-14T11:30:00.000Z
Duration: 2 h 33 s

How Does an Audio Model Work?

Overview of Audio Models

Mati Staniszewski, co-founder of Eleven Labs, discusses the evolution and mechanics of audio models, emphasizing their ability to replicate human speech with emotional inflection.

Early attempts at audio modeling involved creating analog machines that mimicked the human vocal tract, leading to structured digital signals for speech at Bell Labs.

Phoneme Integration

The process involves stitching together phonemes—distinct sounds in human speech—using probabilistic approaches to predict the next sound based on context.

Modern techniques utilize neural networks to predict sounds while considering both phonemic and contextual text elements.

Innovations in Voice Modeling

Piotr, Staniszewski's co-founder, introduced ideas from transformer models into voice modeling, enhancing reliability and quality in generating voice outputs.

The model operates across different representations: text, mel spectrogram (a visual representation of sound), and waveform.

Key Components of Voice Models

Contextual Understanding

Effective voice models must account for context; for instance, recognizing when a dialogue is happy or sad influences pronunciation.

Characteristics of Voice

Voice models incorporate characteristics such as accents and prosody. This allows for more natural-sounding speech by referencing how specific phrases should be spoken.

Advancements Over Traditional Methods

Parameter Flexibility

Unlike traditional methods that hard-code parameters (e.g., British accent), Eleven Labs' approach allows models to deduce these properties dynamically based on input data.

Predictive Capabilities

The system stores voice embeddings for speakers and operates on both phoneme levels and text levels to enhance prediction accuracy during real-time applications.

Challenges in Voice Technology Deployment

Current Limitations

Despite advancements in architecture and data collection methods, there remains a gap between technology readiness and deployment in consumer products like cars or mobile devices.

User Experience Gaps

Users experience frustration with existing voice technologies due to limitations in functionality (e.g., reading PDFs aloud effectively).

Future Directions for Voice Technology

Anticipated Improvements

By 2025, it is expected that real-time interactions will improve significantly as systems learn user preferences better through contextual understanding.

Personalized Experiences

There is potential for personalized transcription services that adapt to individual accents or speaking styles, enhancing user interaction with technology.

Speech Recognition and Voice Models: Innovations and Challenges

Advances in Speech-to-Text Technology

The discussion begins with the introduction of a voice-specific model that enhances speech recognition, particularly in noisy environments or crowded spaces.

Emphasis is placed on keyword detection, which allows systems to monitor specific words relevant to user interactions, such as ordering coffee.

The technology supports real-time keyword detection, improving transcription accuracy even when the user's voice isn't clearly identifiable.

Enhancements in Transcription Accuracy

There is potential for superhuman transcription performance by training models on specific audio samples before processing new inputs.

The speaker notes that achieving person-specific transcription is feasible and expected to be rolled out soon, enhancing accuracy through speaker diarization.

Diarization is highlighted as a challenging aspect of transcription; however, current capabilities are strong.

Applications in Healthcare and Home Devices

In healthcare settings like operating rooms, precise command recognition from individual doctors is crucial for effective communication amidst background noise.

The conversation touches on ongoing research challenges within voice modeling and the excitement surrounding upcoming breakthroughs.

Speech Generation Innovations

Discussion shifts to speech generation technologies, including features akin to photo editing but applied to voice modulation (e.g., de-accenting filters).

Recent advancements allow users more control over speech output characteristics such as pacing and emotional tone during text-to-speech conversions.

Emotional Intelligence in Voice Agents

New models can generate responses based on detected emotions from users, allowing for more empathetic interactions during conversations.

Cascaded vs. Direct Speech Models

A cascaded approach involving speech-to-text followed by text-to-speech is contrasted with direct speech-to-speech models aimed at reducing latency but potentially sacrificing reliability.

User Interaction Dynamics

Businesses require visibility into each step of the interaction process; thus, maintaining a cascaded approach offers better insights into user engagement metrics.

Second-order Effects of Voice Technology

The impact of voice interaction extends beyond simple tasks; it changes how users engage with businesses—people are more open when speaking rather than filling forms.

Language Accessibility Through Voice Technology

Ubiquitous good text-to-speech technology could break down language barriers globally, enabling seamless communication across different languages.

Personal Stories Highlighting Impact

Real-life examples illustrate how individuals who lost their voices due to medical conditions can regain their ability to communicate using advanced voice synthesis technologies.

Understanding Voice Agents' Functionality

Voice agents serve both reactive customer service roles and proactive outreach functions—demonstrated through innovative applications like checking pub prices via automated calls.

Business Model Insights

Training costs for voice models are lower compared to large language models (LLMs), with smaller parameter counts leading to reduced operational expenses while still delivering high-quality outputs.

This structured summary captures key discussions around innovations in speech recognition and generation technologies while providing timestamps for easy reference.

How Eleven Labs Achieved Rapid Growth in ARR

Business Model and Growth Strategy

Eleven Labs has a dual business model with both self-serve and enterprise components, contributing significantly to their rapid growth of over $450 million in Annual Recurring Revenue (ARR).

The technology behind their services has become more reliable and high-quality over the past year, facilitating user engagement and account expansion.

The company employs a "land and expand" strategy, making it easy for customers to start using their services, which often leads to increased usage across departments.

They offer attractive pricing for their technology, allowing users to test its value before committing to larger purchases based on usage.

Cross-departmental collaboration is evident in client engagements, such as with Deutsche Telekom, where initial projects expanded into broader applications across various departments.

Team Structure and Company Culture

Eleven Labs maintains small teams of fewer than 10 people for each product or research initiative, promoting agility and deep industry understanding.

Their approach combines self-service Product-Led Growth (PLG) with high-touch engineering support for enterprise clients, enhancing distribution and customer experience.

The decision to implement a self-service model was influenced by competitors' practices that often require lengthy sales processes; this allows immediate access to technology for users.

A strong feedback loop from users helps improve the technology continuously while reinforcing confidence in its capabilities.

They aim to provide the best version of their technology freely available for developers and SMBs while ensuring reliability for enterprise use cases.

AI Integration in Organizational Structure

Discussions around integrating AI into organizational structures focus on whether companies need more senior or junior staff due to AI's impact on job roles.

Questions arise about team sizes—whether smaller or larger teams are more effective—and how finance teams can leverage AI tools like Claude effectively.

Eleven Labs benefits from being newly established without legacy systems; they emphasize small teams with flat hierarchies that allow rapid decision-making and innovation.

Technical Resources Across Teams

Each team at Eleven Labs includes technical resources that help automate tasks traditionally done manually, enhancing efficiency across operations like talent acquisition.

They utilize advanced data analysis techniques to identify potential candidates effectively through automated scraping methods rather than manual searches.

Tools are developed internally that amplify employee productivity by automating routine tasks while still requiring human oversight for customization.

Cultural Insights from Collaboration with Ukraine

During collaboration efforts in Ukraine amidst ongoing conflict, Eleven Labs learned valuable lessons about embedding technical resources within government ministries for efficient service delivery.

This model demonstrated the effectiveness of having dedicated tech personnel within each ministry working towards common goals under a central digital transformation team.

Agency as a Key Factor in Success

At Eleven Labs, fostering agency among employees is crucial; individuals are encouraged to take ownership of their work which enhances creativity and productivity.

High-agency individuals tend to thrive amid advancements in AI technologies while those lacking agency may struggle within organizations adapting to these changes.