# 136 Deepgram: From Dark Matter to Deep Learning Speech API With Scott Stephenson

Name: # 136 Deepgram: From Dark Matter to Deep Learning Speech API With Scott Stephenson
Uploaded: 2022-11-04T08:59:02.000Z
Duration: 1 h 46 min 39 s

The Impact of AI on Translation and Content Generation

Overview of AI in Translation

Good data labeling processes and customer satisfaction can boost investor confidence in a company's market potential.

AI is making translation more accessible, allowing users to generate specific content based on their prompts.

Introduction to SlatorPod

Hosts Esther and Florian welcome listeners back to SlatorPod, setting the stage for an engaging discussion.

Upcoming guest Scott Stephenson, CEO of Deepgram, is introduced as a leader in machine learning and automatic speech recognition.

Google Translation Hub Launch

Discussion shifts to Google's recent launch of its Translation Hub at a Cloud conference.

Google is packaging existing components into a suite for enterprises, including APIs and customizable machine translation solutions.

Key Features of Google’s Offering

The hub includes tools like the Google Translate API and auto ML translation as part of a fully managed solution.

CEO Sundar Pichai highlights 135 target languages, emphasizing that "Translation is one way AI is becoming more accessible."

Competitive Landscape

Google's offerings are set to compete with Microsoft, Amazon Translate, and potentially DeepL in the language technology sector.

Language service providers (LSPs) should monitor these developments closely as they could impact industry dynamics.

Insights from Industry Experts

David McNamee from Straker emphasizes the importance of watching Google's advancements despite not covering all industry complexities.

Future Implications

Enterprises may build sophisticated workflows using Google's Translation Hub; it offers more capabilities than before but still has limitations.

Speculation about Google's market entry will be addressed over time; investors frequently inquire about this potential competition.

Netflix's Content Strategy and Performance

Netflix's Recent Successes

Transitioning from Google to Netflix, shares have rebounded as the company produces significant amounts of content.

Highlights from Q3 Earnings Call

Analysis focuses on successful non-English original content; "Extraordinary Attorney Woo" leads with over 400 million viewing hours.

Comparison with English Content

In contrast, "Stranger Things" Season 4 amassed approximately 1.35 billion hours watched—three times that of the top non-English show.

Local Titles with Global Appeal

CFO Spencer Neiman discusses local titles' impact across various regions: 'Sintonia' (Brazil), 'The Empress' (Germany), 'High Water' (Poland), and 'Narco-Saints' (Korea).

Netflix's Global Strategy and Market Insights

Netflix's Focus on Local Content

Netflix is emphasizing the importance of local content to reach global audiences, showcasing a strategy that aligns with their vision for diverse programming.

Success of "Extraordinary Attorney Woo"

Co-CEO Ted Sarandos highlighted the potential of shows like "Extraordinary Attorney Woo," stating they can transform content perceived as culturally specific into globally appealing narratives.

Subscriber Growth Trends

In Q3, APAC emerged as Netflix's fastest-growing market, adding 1.43 million subscribers, while North America saw slower growth with only 100,000 new subscribers.

Dubbing and Language Availability

The discussion suggests exploring the availability of dubbing options across different regions, noting discrepancies in language offerings between the US and Europe.

ZOO Digital's Market Performance

Stock Price Surge

ZOO Digital has seen its share price nearly triple recently, reaching a market cap close to $160 million due to profitability and growth driven by increased demand for dubbing and subtitling services.

Investor Engagement

ZOO Digital held an investor meeting recently; a recording is available on Vimeo for those interested in deeper insights into their business strategies.

Jasper: A Multilingual AI Writing Tool

Overview of Jasper

Jasper is valued at $1.5 billion after raising $125 million in Series A funding. It positions itself as a multilingual AI writing assistant capable of generating various types of content based on user prompts.

Technology Foundation

Built on GPT-3 technology, Jasper aims to help users overcome language barriers by producing creative content in over 25 languages.

Market Positioning and Valuation Concerns

The valuation raises questions about whether Jasper can dominate the B2B writing aid category given its early-stage status (Series A).

Future Prospects for AI Writing Tools

Industry Dynamics

There are concerns regarding the sustainability of high valuations in early-stage companies like Jasper amidst competition from other technologies such as Whisper.

Job Market Insights

Despite fluctuations in tech valuations, job opportunities within the language industry remain robust, indicating ongoing demand for linguistic services.

Job Index Trends and Market Insights

Job Index Performance

The job index experienced a decline in September 2022, marking its first significant drop of the year, aside from the typical January decrease. However, it rebounded to reach its highest level since records began in mid-2018 by November 2022.

In November alone, the index climbed six and a half points, contributing to a total increase of 15 points for the year and over 87 points since 2018.

Market Dynamics

Despite challenges such as fluctuating employment rates in the US and Europe, many language companies are facing difficulties with share performance. However, unemployment remains low across various countries.

There is an ongoing sense of anticipation regarding potential negative impacts on the economy; nevertheless, current indicators show positive trends in employment and GDP.

Deepgram's Innovations in Speech Recognition

Introduction to Deepgram

Scott Stephenson introduces Deepgram as a speech AI company specializing in multilingual automatic speech recognition (ASR) and real-time transcription services.

Deepgram provides API-based solutions for developers creating voice applications across diverse sectors including call centers, podcasts, streaming platforms, and food ordering.

Technical Capabilities

The platform supports over 30 languages with low-latency processing that allows for near-human interaction speeds. This efficiency is particularly beneficial for handling large volumes of audio content quickly.

Understanding Beyond Transcription

Deepgram emphasizes not just transcribing words but also understanding context within audio—such as identifying topics discussed, speaker counts, sentiment analysis, and locating specific clips based on user interest.

Challenges in Automatic Speech Recognition

Key Challenges Identified

Scott highlights two main challenges:

Technical: The need for labeled data is critical; acquiring quality labeled audio data is costly compared to text data available online.

Market: Unlike self-driving cars or image recognition technologies that are perceived as complex tasks by consumers, automatic speech recognition does not carry the same weight of difficulty perception among users.

Data Acquisition Issues

Obtaining sufficient labeled audio data requires substantial financial investment (ranging from tens to hundreds of millions), making it one of the most significant barriers to advancing ASR technology.

Understanding the Evolution of Audio Technology and Market Dynamics

The Challenge of Perception in Audio Technology

The ease of understanding software capabilities is often undervalued, as people are more entertained by visual elements than audio functionalities.

Visual features like bounding boxes can captivate users, while accurate transcription and sentiment analysis are expected without much appreciation for their complexity.

Advancements in Transcription and Sentiment Analysis

Recent advancements have led to near-human accuracy in transcription, punctuation, sentiment analysis, automatic language detection, and translation.

After five challenging years, the market is beginning to recognize the value of these technologies as they approach human-level performance.

Expectations vs. Reality in Technological Adoption

Initial expectations were that breakthroughs would lead to rapid adoption; however, business dynamics require social proof and strategic marketing for success.

The transition from academic insights to practical business applications revealed complexities not initially anticipated by the founders.

Competitive Landscape and Market Strategy

Gaining traction requires convincing competitors' clients to switch over before larger companies will adopt new technologies.

Founders underestimated the importance of distribution strategies, sales approaches, and product packaging beyond just technical superiority.

Background Influences on Founding Team's Approach

The founding team comprised particle physicists with experience building sensitive detectors under unique conditions.

Their work involved real-time machine learning on complex waveforms similar to audio signals, which laid the groundwork for developing Deepgram’s technology.

Exploring the Origins of Deepgram

Innovative Audio Recording Techniques

The speaker reflects on the unique experience of working in an underground setting, likening it to a "James Bond layer." They describe creating devices for continuous audio recording, resulting in over a thousand hours of data.

Initially seeking existing solutions for identifying interesting moments within their extensive recordings, they explored various companies like Nuance and IBM Watson but found no suitable products.

Challenges with Existing Technologies

Conversations with speech experts at major tech firms revealed skepticism about the feasibility of end-to-end deep learning for audio understanding, particularly regarding language complexities.

This skepticism motivated the speaker to start their company seven years ago, as they believed there was a significant opportunity in developing effective audio analysis technology.

Validation Through Industry Developments

The emergence of models like OpenAI Whisper serves as validation for their long-term efforts in end-to-end deep learning applications in audio processing.

Target Markets and Customer Segments

Voice AI Companies

The speaker identifies three primary customer segments for Deepgram: voice AI companies, which are typically young startups focused on integrating voice technology into their products.

These companies often begin by seeking transcription services or ways to analyze audio content before realizing the limitations and costs associated with open-source software or major cloud providers.

Transitioning to AI Compute Companies

As these voice AI companies scale and encounter challenges managing transcription infrastructure, they seek partnerships with specialized providers like Deepgram that can handle these needs efficiently.

Connectivity Companies

A second segment includes connectivity companies (10 to 25 years old), such as Twilio. These businesses view voice AI not as their core offering but as an avenue for market expansion and competitive advantage.

AI Automation in Enterprises and Open Source Models

The Role of AI in Customer Experience

Companies are seeking partners that provide platforms for building AI solutions, making them a strong choice for enterprises.

Enterprises, particularly older companies like banks and Comcast, aim to enhance customer service by removing human bottlenecks through AI automation.

On-premise solutions are essential for conservative enterprises that prioritize data security; flexibility between cloud and on-prem is crucial.

Whisper vs. Deepgram: Key Differences

Whisper represents a significant advancement in open-source models, improving aspects like capitalization and punctuation compared to previous models such as Wav2Vec and Kaldi.

While Whisper supports multiple languages, its accuracy diminishes beyond the top ten languages; it was trained on extensive public data using modern model architectures.

Despite its strengths, Whisper has limitations including slow performance, high operational costs, and lack of real-time capabilities which hinder certain applications.

Industry Impact of Whisper

Users may initially be impressed with Whisper's capabilities but will soon encounter limitations regarding language support and timing features necessary for specific applications.

The release of Whisper has sparked interest in speech recognition technology across the industry, prompting other companies to improve their offerings based on insights from Whisper's development.

Deepgram acknowledges learning from Whisper’s architecture while also integrating it into their services; this collaboration signifies a milestone in speech recognition advancements.

Understanding the Evolution of AI Models

The Lifecycle of AI Models

Awareness is created as users begin to build with initial models, leading to a demand for more sophisticated solutions over time.

The rapid obsolescence of models is highlighted; what works today may not be effective in six months or a year.

Companies like OpenAI have adopted a strategy of releasing initial versions but later restricting access due to concerns about misuse, as seen with GPT-2 and GPT-3.

Anticipation exists for future iterations of models like Whisper, which will likely follow similar release strategies while other companies may also develop competing technologies.

Translation Challenges in AI

Deepgram's approach to translation involves either sourcing from others or developing their own systems, emphasizing the importance of translation capabilities for users.

Two methods exist for audio processing: transcribing spoken language first before translating, which can compound errors, versus direct audio-to-audio translation that minimizes these issues.

Advancements in Real-Time Translation

Whisper's capability allows it to translate directly from one language to another without intermediate text representation, enhancing accuracy and efficiency.

While real-time translation is challenging due to structural differences between languages, advancements are being made towards achieving near-real-time capabilities.

Data Management and Model Training

The development of universal translators is underway as multimodal models become feasible; this progress relies on the availability of quality data rather than just model sophistication.

Managing low-resource languages presents challenges; acquiring high-quality data is essential for building effective models.

Building Effective Language Models

A combination of sourced data (both paid and free), along with real-world usage data collection and labeling, is necessary for creating robust language models.

The amount of training data required varies by language; foundational models trained on extensive datasets can support multiple languages effectively.

Understanding Fine-Tuning in AI Models

The Concept of Fine-Tuning

The process involves exposing a model to various languages and then allowing it to specialize in one, enhancing its proficiency through focused training.

This technique is often referred to as fine-tuning, transfer learning, or adaptation, where a base model is evolved using new data without needing extensive datasets.

The expectation is that advancements in these techniques will lead to significant improvements in underrepresented languages like Swiss-German.

Fundraising Landscape for AI Companies

In 2021, the company raised a Series B round; currently, AI sectors are performing well compared to other industries like B2B SaaS.

Building a successful AI company requires balancing aggressive growth with disciplined spending due to high operational costs associated with data labeling and research.

Investors expect AI companies to meet SaaS metrics while also managing higher expenses related to product performance.

Market Opportunities Amidst Economic Challenges

Companies excelling in customer satisfaction and innovative solutions attract investor confidence even during economic downturns.

There’s an increasing demand for automation and AI solutions as businesses seek cost-cutting measures while maintaining productivity.

Hiring Trends in the Tech Industry

Although the funding environment isn't exceptionally favorable for all companies, there are opportunities for growth within the AI sector amidst layoffs at larger tech firms.

Startups may find hiring easier now due to reduced competition from big tech companies.

Exciting Developments on the Product Roadmap

Innovations on the Horizon

Upcoming features include real-time translation capabilities which could revolutionize communication by making language barriers less significant.

Users will be able to visualize their speech in different languages through text-to-speech technology and voice cloning, enhancing user engagement.

Strategic Goals for Cost Reduction

Deepgram aims to significantly reduce prices while expanding language offerings and improving service reliability.

Lowering costs could potentially increase market size dramatically by making services more accessible.

Exciting Frontiers in Technology

Emerging Applications in Voice and Language Technologies

The discussion highlights the excitement surrounding new applications such as voice cloning, text-to-speech, real-time translation, and sentiment analysis. These advancements are expected to evolve significantly over the next year and a half.

There is a low tolerance among users for ineffective live translation tools. Previous experiences with these plugins often lead to frustration when they fail to function properly.

The conversation emphasizes that we are on the brink of achieving more reliable translation technologies that will enhance user experience without disrupting communication flow.

As technology improves, users may soon find these tools not only helpful but eventually indispensable in their daily interactions.

The speaker expresses optimism about the future developments in language processing technologies, indicating a transformative potential for how people communicate across languages.