# 136 Deepgram: From Dark Matter to Deep Learning Speech API With Scott Stephenson

# 136 Deepgram: From Dark Matter to Deep Learning Speech API With Scott Stephenson

The Impact of AI on Translation and Content Generation

Overview of AI in Translation

  • Good data labeling processes and customer satisfaction can boost investor confidence in a company's market potential.
  • AI is making translation more accessible, allowing users to generate specific content based on their prompts.

Introduction to SlatorPod

  • Hosts Esther and Florian welcome listeners back to SlatorPod, setting the stage for an engaging discussion.
  • Upcoming guest Scott Stephenson, CEO of Deepgram, is introduced as a leader in machine learning and automatic speech recognition.

Google Translation Hub Launch

  • Discussion shifts to Google's recent launch of its Translation Hub at a Cloud conference.
  • Google is packaging existing components into a suite for enterprises, including APIs and customizable machine translation solutions.

Key Features of Google’s Offering

  • The hub includes tools like the Google Translate API and auto ML translation as part of a fully managed solution.
  • CEO Sundar Pichai highlights 135 target languages, emphasizing that "Translation is one way AI is becoming more accessible."

Competitive Landscape

  • Google's offerings are set to compete with Microsoft, Amazon Translate, and potentially DeepL in the language technology sector.
  • Language service providers (LSPs) should monitor these developments closely as they could impact industry dynamics.

Insights from Industry Experts

  • David McNamee from Straker emphasizes the importance of watching Google's advancements despite not covering all industry complexities.

Future Implications

  • Enterprises may build sophisticated workflows using Google's Translation Hub; it offers more capabilities than before but still has limitations.
  • Speculation about Google's market entry will be addressed over time; investors frequently inquire about this potential competition.

Netflix's Content Strategy and Performance

Netflix's Recent Successes

  • Transitioning from Google to Netflix, shares have rebounded as the company produces significant amounts of content.

Highlights from Q3 Earnings Call

  • Analysis focuses on successful non-English original content; "Extraordinary Attorney Woo" leads with over 400 million viewing hours.

Comparison with English Content

  • In contrast, "Stranger Things" Season 4 amassed approximately 1.35 billion hours watched—three times that of the top non-English show.

Local Titles with Global Appeal

  • CFO Spencer Neiman discusses local titles' impact across various regions: 'Sintonia' (Brazil), 'The Empress' (Germany), 'High Water' (Poland), and 'Narco-Saints' (Korea).

Netflix's Global Strategy and Market Insights

Netflix's Focus on Local Content

  • Netflix is emphasizing the importance of local content to reach global audiences, showcasing a strategy that aligns with their vision for diverse programming.

Success of "Extraordinary Attorney Woo"

  • Co-CEO Ted Sarandos highlighted the potential of shows like "Extraordinary Attorney Woo," stating they can transform content perceived as culturally specific into globally appealing narratives.

Subscriber Growth Trends

  • In Q3, APAC emerged as Netflix's fastest-growing market, adding 1.43 million subscribers, while North America saw slower growth with only 100,000 new subscribers.

Dubbing and Language Availability

  • The discussion suggests exploring the availability of dubbing options across different regions, noting discrepancies in language offerings between the US and Europe.

ZOO Digital's Market Performance

Stock Price Surge

  • ZOO Digital has seen its share price nearly triple recently, reaching a market cap close to $160 million due to profitability and growth driven by increased demand for dubbing and subtitling services.

Investor Engagement

  • ZOO Digital held an investor meeting recently; a recording is available on Vimeo for those interested in deeper insights into their business strategies.

Jasper: A Multilingual AI Writing Tool

Overview of Jasper

  • Jasper is valued at $1.5 billion after raising $125 million in Series A funding. It positions itself as a multilingual AI writing assistant capable of generating various types of content based on user prompts.

Technology Foundation

  • Built on GPT-3 technology, Jasper aims to help users overcome language barriers by producing creative content in over 25 languages.

Market Positioning and Valuation Concerns

  • The valuation raises questions about whether Jasper can dominate the B2B writing aid category given its early-stage status (Series A).

Future Prospects for AI Writing Tools

Industry Dynamics

  • There are concerns regarding the sustainability of high valuations in early-stage companies like Jasper amidst competition from other technologies such as Whisper.

Job Market Insights

  • Despite fluctuations in tech valuations, job opportunities within the language industry remain robust, indicating ongoing demand for linguistic services.

Job Index Trends and Market Insights

Job Index Performance

  • The job index experienced a decline in September 2022, marking its first significant drop of the year, aside from the typical January decrease. However, it rebounded to reach its highest level since records began in mid-2018 by November 2022.
  • In November alone, the index climbed six and a half points, contributing to a total increase of 15 points for the year and over 87 points since 2018.

Market Dynamics

  • Despite challenges such as fluctuating employment rates in the US and Europe, many language companies are facing difficulties with share performance. However, unemployment remains low across various countries.
  • There is an ongoing sense of anticipation regarding potential negative impacts on the economy; nevertheless, current indicators show positive trends in employment and GDP.

Deepgram's Innovations in Speech Recognition

Introduction to Deepgram

  • Scott Stephenson introduces Deepgram as a speech AI company specializing in multilingual automatic speech recognition (ASR) and real-time transcription services.
  • Deepgram provides API-based solutions for developers creating voice applications across diverse sectors including call centers, podcasts, streaming platforms, and food ordering.

Technical Capabilities

  • The platform supports over 30 languages with low-latency processing that allows for near-human interaction speeds. This efficiency is particularly beneficial for handling large volumes of audio content quickly.

Understanding Beyond Transcription

  • Deepgram emphasizes not just transcribing words but also understanding context within audio—such as identifying topics discussed, speaker counts, sentiment analysis, and locating specific clips based on user interest.

Challenges in Automatic Speech Recognition

Key Challenges Identified

  • Scott highlights two main challenges:
  • Technical: The need for labeled data is critical; acquiring quality labeled audio data is costly compared to text data available online.
  • Market: Unlike self-driving cars or image recognition technologies that are perceived as complex tasks by consumers, automatic speech recognition does not carry the same weight of difficulty perception among users.

Data Acquisition Issues

  • Obtaining sufficient labeled audio data requires substantial financial investment (ranging from tens to hundreds of millions), making it one of the most significant barriers to advancing ASR technology.

Understanding the Evolution of Audio Technology and Market Dynamics

The Challenge of Perception in Audio Technology

  • The ease of understanding software capabilities is often undervalued, as people are more entertained by visual elements than audio functionalities.
  • Visual features like bounding boxes can captivate users, while accurate transcription and sentiment analysis are expected without much appreciation for their complexity.

Advancements in Transcription and Sentiment Analysis

  • Recent advancements have led to near-human accuracy in transcription, punctuation, sentiment analysis, automatic language detection, and translation.
  • After five challenging years, the market is beginning to recognize the value of these technologies as they approach human-level performance.

Expectations vs. Reality in Technological Adoption

  • Initial expectations were that breakthroughs would lead to rapid adoption; however, business dynamics require social proof and strategic marketing for success.
  • The transition from academic insights to practical business applications revealed complexities not initially anticipated by the founders.

Competitive Landscape and Market Strategy

  • Gaining traction requires convincing competitors' clients to switch over before larger companies will adopt new technologies.
  • Founders underestimated the importance of distribution strategies, sales approaches, and product packaging beyond just technical superiority.

Background Influences on Founding Team's Approach

  • The founding team comprised particle physicists with experience building sensitive detectors under unique conditions.
  • Their work involved real-time machine learning on complex waveforms similar to audio signals, which laid the groundwork for developing Deepgram’s technology.

Exploring the Origins of Deepgram

Innovative Audio Recording Techniques

  • The speaker reflects on the unique experience of working in an underground setting, likening it to a "James Bond layer." They describe creating devices for continuous audio recording, resulting in over a thousand hours of data.
  • Initially seeking existing solutions for identifying interesting moments within their extensive recordings, they explored various companies like Nuance and IBM Watson but found no suitable products.

Challenges with Existing Technologies

  • Conversations with speech experts at major tech firms revealed skepticism about the feasibility of end-to-end deep learning for audio understanding, particularly regarding language complexities.
  • This skepticism motivated the speaker to start their company seven years ago, as they believed there was a significant opportunity in developing effective audio analysis technology.

Validation Through Industry Developments

  • The emergence of models like OpenAI Whisper serves as validation for their long-term efforts in end-to-end deep learning applications in audio processing.

Target Markets and Customer Segments

Voice AI Companies

  • The speaker identifies three primary customer segments for Deepgram: voice AI companies, which are typically young startups focused on integrating voice technology into their products.
  • These companies often begin by seeking transcription services or ways to analyze audio content before realizing the limitations and costs associated with open-source software or major cloud providers.

Transitioning to AI Compute Companies

  • As these voice AI companies scale and encounter challenges managing transcription infrastructure, they seek partnerships with specialized providers like Deepgram that can handle these needs efficiently.

Connectivity Companies

  • A second segment includes connectivity companies (10 to 25 years old), such as Twilio. These businesses view voice AI not as their core offering but as an avenue for market expansion and competitive advantage.

AI Automation in Enterprises and Open Source Models

The Role of AI in Customer Experience

  • Companies are seeking partners that provide platforms for building AI solutions, making them a strong choice for enterprises.
  • Enterprises, particularly older companies like banks and Comcast, aim to enhance customer service by removing human bottlenecks through AI automation.
  • On-premise solutions are essential for conservative enterprises that prioritize data security; flexibility between cloud and on-prem is crucial.

Whisper vs. Deepgram: Key Differences

  • Whisper represents a significant advancement in open-source models, improving aspects like capitalization and punctuation compared to previous models such as Wav2Vec and Kaldi.
  • While Whisper supports multiple languages, its accuracy diminishes beyond the top ten languages; it was trained on extensive public data using modern model architectures.
  • Despite its strengths, Whisper has limitations including slow performance, high operational costs, and lack of real-time capabilities which hinder certain applications.

Industry Impact of Whisper

  • Users may initially be impressed with Whisper's capabilities but will soon encounter limitations regarding language support and timing features necessary for specific applications.
  • The release of Whisper has sparked interest in speech recognition technology across the industry, prompting other companies to improve their offerings based on insights from Whisper's development.
  • Deepgram acknowledges learning from Whisper’s architecture while also integrating it into their services; this collaboration signifies a milestone in speech recognition advancements.

Understanding the Evolution of AI Models

The Lifecycle of AI Models

  • Awareness is created as users begin to build with initial models, leading to a demand for more sophisticated solutions over time.
  • The rapid obsolescence of models is highlighted; what works today may not be effective in six months or a year.
  • Companies like OpenAI have adopted a strategy of releasing initial versions but later restricting access due to concerns about misuse, as seen with GPT-2 and GPT-3.
  • Anticipation exists for future iterations of models like Whisper, which will likely follow similar release strategies while other companies may also develop competing technologies.

Translation Challenges in AI

  • Deepgram's approach to translation involves either sourcing from others or developing their own systems, emphasizing the importance of translation capabilities for users.
  • Two methods exist for audio processing: transcribing spoken language first before translating, which can compound errors, versus direct audio-to-audio translation that minimizes these issues.

Advancements in Real-Time Translation

  • Whisper's capability allows it to translate directly from one language to another without intermediate text representation, enhancing accuracy and efficiency.
  • While real-time translation is challenging due to structural differences between languages, advancements are being made towards achieving near-real-time capabilities.

Data Management and Model Training

  • The development of universal translators is underway as multimodal models become feasible; this progress relies on the availability of quality data rather than just model sophistication.
  • Managing low-resource languages presents challenges; acquiring high-quality data is essential for building effective models.

Building Effective Language Models

  • A combination of sourced data (both paid and free), along with real-world usage data collection and labeling, is necessary for creating robust language models.
  • The amount of training data required varies by language; foundational models trained on extensive datasets can support multiple languages effectively.

Understanding Fine-Tuning in AI Models

The Concept of Fine-Tuning

  • The process involves exposing a model to various languages and then allowing it to specialize in one, enhancing its proficiency through focused training.
  • This technique is often referred to as fine-tuning, transfer learning, or adaptation, where a base model is evolved using new data without needing extensive datasets.
  • The expectation is that advancements in these techniques will lead to significant improvements in underrepresented languages like Swiss-German.

Fundraising Landscape for AI Companies

  • In 2021, the company raised a Series B round; currently, AI sectors are performing well compared to other industries like B2B SaaS.
  • Building a successful AI company requires balancing aggressive growth with disciplined spending due to high operational costs associated with data labeling and research.
  • Investors expect AI companies to meet SaaS metrics while also managing higher expenses related to product performance.

Market Opportunities Amidst Economic Challenges

  • Companies excelling in customer satisfaction and innovative solutions attract investor confidence even during economic downturns.
  • There’s an increasing demand for automation and AI solutions as businesses seek cost-cutting measures while maintaining productivity.

Hiring Trends in the Tech Industry

  • Although the funding environment isn't exceptionally favorable for all companies, there are opportunities for growth within the AI sector amidst layoffs at larger tech firms.
  • Startups may find hiring easier now due to reduced competition from big tech companies.

Exciting Developments on the Product Roadmap

Innovations on the Horizon

  • Upcoming features include real-time translation capabilities which could revolutionize communication by making language barriers less significant.
  • Users will be able to visualize their speech in different languages through text-to-speech technology and voice cloning, enhancing user engagement.

Strategic Goals for Cost Reduction

  • Deepgram aims to significantly reduce prices while expanding language offerings and improving service reliability.
  • Lowering costs could potentially increase market size dramatically by making services more accessible.

Exciting Frontiers in Technology

Emerging Applications in Voice and Language Technologies

  • The discussion highlights the excitement surrounding new applications such as voice cloning, text-to-speech, real-time translation, and sentiment analysis. These advancements are expected to evolve significantly over the next year and a half.
  • There is a low tolerance among users for ineffective live translation tools. Previous experiences with these plugins often lead to frustration when they fail to function properly.
  • The conversation emphasizes that we are on the brink of achieving more reliable translation technologies that will enhance user experience without disrupting communication flow.
  • As technology improves, users may soon find these tools not only helpful but eventually indispensable in their daily interactions.
  • The speaker expresses optimism about the future developments in language processing technologies, indicating a transformative potential for how people communicate across languages.
Video description

Scott Stephenson, CEO of Deepgram, joins SlatorPod to talk about his unique journey to co-founding the deep tech, automatic speech recognition (ASR) company and raising over USD 50m in funding. Scott recalls how his experience working with dark matter detectors as a particle physicist in China led to him becoming a deep-learning entrepreneur. He discusses some challenges in solving ASR; from labeling data for machine learning to formulating and executing an effective go-to-market strategy. The CEO gives his thoughts on Whisper, OpenAI’s open-source ASR model, and how it may actually grow the total addressable market for voice AI companies. He shares the difficulties when it comes to translating a transcript versus translating straight from audio into another language. Scott gives his advice on how to build a successful AI company and appeal to investors. The pod rounds off with Deepgram’s roadmap for the next year, with text to speech, voice cloning, real-time translation, and sentiment analysis being potential step changes in their growth trajectory. First up, Florian and Esther catch up on the language industry news from the past month, with Google announcing the launch of Translation Hub, its enterprise-scale document translation service. Esther discusses some of the language highlights from Netflix’s third-quarter earnings call, including the titles of some of the best-performing non-English content. Meanwhile, Zoo Digital’s share price was at a near all-time high as they weighed in at an almost USD 170m market cap. The duo also talk about funding, where multilingual AI writer Jasper announced it had raised USD 125m in its unicorn-making series A, which valued the startup at USD 1.5bn. And, after a dip in the Slator Language Industry Job Index in September, the LIJI defied expectations of a slowdown as it reached an all-time high in November. Deepgram: https://deepgram.com/ Chapter Markers: 00:00:00 Intro and Agenda  00:01:35 Google's Translation Hub  00:04:43 Netflix Q3 Language Highlights  00:08:11 Zoo Digital's Share Price Triples  00:09:32 Multilingual AI Writer Jasper Valued at 1.5bn 00:13:46 Job Index Reaches All-Time High in November 00:16:25 Scott Stephenson Joins the Pod 00:16:54 Deepgram's Elevator Pitch 00:18:48 Biggest ASR Challenge 00:22:44 Transition From Dark Matter to Founding Deepgram 00:29:09 Key Target Markets and Industry Segments 00:34:33 Reaction to OpenAI's Whisper 00:41:18 Thoughts on Speech Translation 00:43:53 Managing Low-Resource Data 00:46:57 AI Funding Environment 00:50:03 Product Roadmap for 2022 and Beyond