The world of voice AI, with Mati Staniszewski of ElevenLabs

The world of voice AI, with Mati Staniszewski of ElevenLabs

How Does an Audio Model Work?

Overview of Audio Models

  • Mati Staniszewski, co-founder of Eleven Labs, discusses the evolution and mechanics of audio models, emphasizing their ability to replicate human speech with emotional inflection.
  • Early attempts at audio modeling involved creating analog machines that mimicked the human vocal tract, leading to structured digital signals for speech at Bell Labs.

Phoneme Integration

  • The process involves stitching together phonemes—distinct sounds in human speech—using probabilistic approaches to predict the next sound based on context.
  • Modern techniques utilize neural networks to predict sounds while considering both phonemic and contextual text elements.

Innovations in Voice Modeling

  • Piotr, Staniszewski's co-founder, introduced ideas from transformer models into voice modeling, enhancing reliability and quality in generating voice outputs.
  • The model operates across different representations: text, mel spectrogram (a visual representation of sound), and waveform.

Key Components of Voice Models

Contextual Understanding

  • Effective voice models must account for context; for instance, recognizing when a dialogue is happy or sad influences pronunciation.

Characteristics of Voice

  • Voice models incorporate characteristics such as accents and prosody. This allows for more natural-sounding speech by referencing how specific phrases should be spoken.

Advancements Over Traditional Methods

Parameter Flexibility

  • Unlike traditional methods that hard-code parameters (e.g., British accent), Eleven Labs' approach allows models to deduce these properties dynamically based on input data.

Predictive Capabilities

  • The system stores voice embeddings for speakers and operates on both phoneme levels and text levels to enhance prediction accuracy during real-time applications.

Challenges in Voice Technology Deployment

Current Limitations

  • Despite advancements in architecture and data collection methods, there remains a gap between technology readiness and deployment in consumer products like cars or mobile devices.

User Experience Gaps

  • Users experience frustration with existing voice technologies due to limitations in functionality (e.g., reading PDFs aloud effectively).

Future Directions for Voice Technology

Anticipated Improvements

  • By 2025, it is expected that real-time interactions will improve significantly as systems learn user preferences better through contextual understanding.

Personalized Experiences

  • There is potential for personalized transcription services that adapt to individual accents or speaking styles, enhancing user interaction with technology.

Speech Recognition and Voice Models: Innovations and Challenges

Advances in Speech-to-Text Technology

  • The discussion begins with the introduction of a voice-specific model that enhances speech recognition, particularly in noisy environments or crowded spaces.
  • Emphasis is placed on keyword detection, which allows systems to monitor specific words relevant to user interactions, such as ordering coffee.
  • The technology supports real-time keyword detection, improving transcription accuracy even when the user's voice isn't clearly identifiable.

Enhancements in Transcription Accuracy

  • There is potential for superhuman transcription performance by training models on specific audio samples before processing new inputs.
  • The speaker notes that achieving person-specific transcription is feasible and expected to be rolled out soon, enhancing accuracy through speaker diarization.
  • Diarization is highlighted as a challenging aspect of transcription; however, current capabilities are strong.

Applications in Healthcare and Home Devices

  • In healthcare settings like operating rooms, precise command recognition from individual doctors is crucial for effective communication amidst background noise.
  • The conversation touches on ongoing research challenges within voice modeling and the excitement surrounding upcoming breakthroughs.

Speech Generation Innovations

  • Discussion shifts to speech generation technologies, including features akin to photo editing but applied to voice modulation (e.g., de-accenting filters).
  • Recent advancements allow users more control over speech output characteristics such as pacing and emotional tone during text-to-speech conversions.

Emotional Intelligence in Voice Agents

  • New models can generate responses based on detected emotions from users, allowing for more empathetic interactions during conversations.

Cascaded vs. Direct Speech Models

  • A cascaded approach involving speech-to-text followed by text-to-speech is contrasted with direct speech-to-speech models aimed at reducing latency but potentially sacrificing reliability.

User Interaction Dynamics

  • Businesses require visibility into each step of the interaction process; thus, maintaining a cascaded approach offers better insights into user engagement metrics.

Second-order Effects of Voice Technology

  • The impact of voice interaction extends beyond simple tasks; it changes how users engage with businesses—people are more open when speaking rather than filling forms.

Language Accessibility Through Voice Technology

  • Ubiquitous good text-to-speech technology could break down language barriers globally, enabling seamless communication across different languages.

Personal Stories Highlighting Impact

  • Real-life examples illustrate how individuals who lost their voices due to medical conditions can regain their ability to communicate using advanced voice synthesis technologies.

Understanding Voice Agents' Functionality

  • Voice agents serve both reactive customer service roles and proactive outreach functions—demonstrated through innovative applications like checking pub prices via automated calls.

Business Model Insights

  • Training costs for voice models are lower compared to large language models (LLMs), with smaller parameter counts leading to reduced operational expenses while still delivering high-quality outputs.

This structured summary captures key discussions around innovations in speech recognition and generation technologies while providing timestamps for easy reference.

How Eleven Labs Achieved Rapid Growth in ARR

Business Model and Growth Strategy

  • Eleven Labs has a dual business model with both self-serve and enterprise components, contributing significantly to their rapid growth of over $450 million in Annual Recurring Revenue (ARR).
  • The technology behind their services has become more reliable and high-quality over the past year, facilitating user engagement and account expansion.
  • The company employs a "land and expand" strategy, making it easy for customers to start using their services, which often leads to increased usage across departments.
  • They offer attractive pricing for their technology, allowing users to test its value before committing to larger purchases based on usage.
  • Cross-departmental collaboration is evident in client engagements, such as with Deutsche Telekom, where initial projects expanded into broader applications across various departments.

Team Structure and Company Culture

  • Eleven Labs maintains small teams of fewer than 10 people for each product or research initiative, promoting agility and deep industry understanding.
  • Their approach combines self-service Product-Led Growth (PLG) with high-touch engineering support for enterprise clients, enhancing distribution and customer experience.
  • The decision to implement a self-service model was influenced by competitors' practices that often require lengthy sales processes; this allows immediate access to technology for users.
  • A strong feedback loop from users helps improve the technology continuously while reinforcing confidence in its capabilities.
  • They aim to provide the best version of their technology freely available for developers and SMBs while ensuring reliability for enterprise use cases.

AI Integration in Organizational Structure

  • Discussions around integrating AI into organizational structures focus on whether companies need more senior or junior staff due to AI's impact on job roles.
  • Questions arise about team sizes—whether smaller or larger teams are more effective—and how finance teams can leverage AI tools like Claude effectively.
  • Eleven Labs benefits from being newly established without legacy systems; they emphasize small teams with flat hierarchies that allow rapid decision-making and innovation.

Technical Resources Across Teams

  • Each team at Eleven Labs includes technical resources that help automate tasks traditionally done manually, enhancing efficiency across operations like talent acquisition.
  • They utilize advanced data analysis techniques to identify potential candidates effectively through automated scraping methods rather than manual searches.
  • Tools are developed internally that amplify employee productivity by automating routine tasks while still requiring human oversight for customization.

Cultural Insights from Collaboration with Ukraine

  • During collaboration efforts in Ukraine amidst ongoing conflict, Eleven Labs learned valuable lessons about embedding technical resources within government ministries for efficient service delivery.
  • This model demonstrated the effectiveness of having dedicated tech personnel within each ministry working towards common goals under a central digital transformation team.

Agency as a Key Factor in Success

  • At Eleven Labs, fostering agency among employees is crucial; individuals are encouraged to take ownership of their work which enhances creativity and productivity.
  • High-agency individuals tend to thrive amid advancements in AI technologies while those lacking agency may struggle within organizations adapting to these changes.
Playlists: Cheeky Pint
Video description

Mati Staniszewski is the co-founder of ElevenLabs, the research company making audio accessible across languages and voices. He sits down with John to discuss the "voice Turing Test" and why AI has conquered text but still struggles with conversational speech. They discuss the future of human-computer interaction, including why we still can't get our phones to read a PDF properly and the massive potential for voice agents in everything from farming to healthcare. Mati also opens up about ElevenLabs’ rapid ascent to an $11 billion valuation and gives a behind-the-scenes look at how Ukraine is using their tech for digital government services. Full transcript on Substack: https://open.substack.com/pub/cheekypint/p/the-world-of-voice-ai-with-mati-staniszewski Subscribe to Cheeky Pint Spotify: https://open.spotify.com/show/2IHbGJJ... Apple Podcasts: https://podcasts.apple.com/us/podcast... Key moments 00:00:27 How audio models work 00:08:52 ElevenLabs business model 00:17:50 The conversational Turing Test 00:21:01 Link by Stripe 00:26:02 Cascaded vs speech-to-speech 00:31:53 Universal translation 00:51:41 Designing an AI-native org