#77 Synthesia CEO Victor Riparbelli on Personalized Video in 40+ Languages

Name: #77 Synthesia CEO Victor Riparbelli on Personalized Video in 40+ Languages
Uploaded: 2021-06-25T08:52:17.000Z
Duration: 2 h 44 min 34 s

Translation as a Subscription: Future or Limitation?

Introduction to AI Voice Generation

Discussion on the potential of AI systems learning from voice clips, highlighting limitations in language capabilities, such as Mandarin.

Introduction of Victor Riparbelli, Co-founder and CEO of Synthesia.io, emphasizing the ability to generate professional-looking AI videos quickly.

Poll Results on Translation Subscriptions

Overview of a poll regarding translation as a subscription model; only 12% believe it is definitely the future.

Breakdown of poll results: about one-third see it as potentially promising while another third view it suitable for limited use cases.

Industry Insights and Challenges

Discussion on why expectations for subscription models might be lower than anticipated; challenges include variability in buyer needs.

Explanation of what a subscription-based translation model entails, including metrics like word count or document count.

Current Trends in Language Service Providers (LSPs)

Increasing number of LSPs offering subscriptions but not as their primary model; often part of tiered pricing plans.

Challenges faced by LSPs transitioning from traditional pricing models to subscriptions due to established practices and client expectations.

Considerations for Implementing Subscription Models

Importance of defining meaningful volume metrics for subscription services; difficulty in shifting from long-standing word-based models.

Potential advantages for LSPs with in-house linguists who can better manage costs and streamline processes within subscription frameworks.

Subscription Models in Localization Services

Benefits and Challenges of Subscription Models

The discussion highlights the simplicity and predictability of subscription models for localization services, as mentioned in the Pricing and Procurement Report.

A commentator from Welocalize noted that subscription models promote long-term collaboration and value generation, allowing for better retention of critical talent while incentivizing innovation in localization workflows.

A key issue identified is the fluctuation in project volumes, which creates hesitance among clients to commit to subscriptions due to concerns about underutilization or exceeding monthly allowances.

Buyers often misjudge their volume needs; they may overestimate or underestimate their requirements. An annual perspective can provide a clearer picture of predictable volume flows.

If LSPs can analyze client volume trends effectively, they could leverage accounts against each other to create more stable pricing structures.

Valuation Implications for LSPs

The speaker suggests that understanding subscription valuation could benefit LSPs, as tech companies are typically valued on revenue multiples while service companies are valued on earnings multiples.

Minimum fees associated with outsourcing can complicate discussions around subscriptions, particularly when dealing with small projects that freelancers may not find worthwhile.

Bill Ackman's Investment in KUDO

Overview of Bill Ackman and KUDO's Use Case

Bill Ackman, a billionaire investor known for his activist investment strategies, recently tweeted about KUDO after using it for an investor presentation regarding Universal Music.

Ackman previously shorted Herbalife stock due to its controversial business model but has shifted focus towards growth funding technology startups like KUDO, which specializes in remote simultaneous interpreting.

Impact of Technology on Language Services

The use of KUDO allowed Ackman’s team to conduct an investor presentation in 11 languages, showcasing how technology expands opportunities within language services beyond traditional settings.

This innovative approach illustrates how advancements enable multilingual meetings that were previously impractical outside large international events like FIFA congresses.

Discussion on Multilingual Technologies and Innovations

The Role of Technology in Investor Communication

Companies are leveraging technology to enhance communication with their investor communities, particularly in regions like China, Hong Kong, and Korea. This is crucial as language fluency varies among these groups.

RWS and Cedat85 Partnership

RWS has partnered with Cedat85 to develop a live subtitling and captioning solution for online meetings, which transcribes live speech into text.

The innovative aspect of this solution is its application of neural machine translation to convert the transcribed text into over 130 languages.

Functionality of Live Captioning Solutions

The integration allows captions to appear in real-time during virtual meetings, tailored to each attendee's language preference across various platforms like Zoom or Google Hangouts.

There are significant limitations in existing technologies; many claim multilingual capabilities but often only provide original language captions without effective translation.

Competitors in the Market

Key competitors include AI Media, Red Bee, and Verbit, who also offer translated live captions but face challenges regarding accuracy and functionality.

Verbit has recently gained traction by acquiring VITAC, indicating a competitive landscape focused on improving live captioning services.

Challenges in Automatic Speech Recognition

Current automated captioning solutions often produce incoherent translations even for widely spoken languages. This highlights ongoing challenges within the industry regarding quality assurance.

Swiss German Dialect Recognition

A recent academic conference featured a competition focused on recognizing Swiss German dialect and translating it into standard German text. Microsoft’s team won this task.

The discussion reflects personal insights from the speaker about Swiss German being their native dialect, emphasizing the importance of accurate dialect recognition technology.

Understanding Swiss German Dialects and AI Translation

The Complexity of Swiss German Dialects

The intelligibility of Swiss German varies significantly based on geographic location; southern dialects are more understandable than northern ones, with a notable barrier around Dortmund.

Within Swiss German, there exist 10 to 20 distinct sub-dialects that can differ greatly, making it challenging for translation systems trained on one dialect (e.g., Zurich) to work effectively on another (e.g., Bern).

Challenges in Translating Swiss German

A project aimed to develop a system for translating various Swiss German dialects into standard German using a labeled dataset of 293 hours from the Bern dialect and an unlabeled dataset of 1,200 hours primarily from Zurich.

The low-resource nature of this task is highlighted by the scarcity of written texts in Swiss German, as most communication occurs informally through texting without standardized rules.

Commercial Applications and Use Cases

There is significant commercial potential in translating spoken Swiss German into standard high German for media consumption since many people prefer not to read subtitles in their local dialect due to processing difficulties.

The need for translated subtitles arises because reading Swiss German can be mentally taxing compared to standard high German, creating a market opportunity for businesses.

Insights on Language Research

The speaker expresses amusement at Microsoft's research focus on translating his own dialect into standardized forms, indicating personal relevance and interest in the topic.

Exploring AI Video Generation with Synthesia

Introduction to Synthesia

Victor Riparbelli introduces himself as the CEO and Co-founder of Synthesia during the podcast discussion about AI video generation technology.

Overview of Synthesia's Mission

Synthesia aims to operate the world's first large-scale platform for AI video generation that transforms text and assets into engaging videos.

Shifting Information Consumption Trends

Emphasizing that most information today exists only as text, Riparbelli argues that video is becoming increasingly preferred over traditional textual formats among consumers.

Video Dominance in Modern Communication

He notes a shift towards video-first interactions across all demographics, including enterprise environments where employees also favor video communication tools like Zoom over static text-based methods.

Evolution of Social Media Platforms

Riparbelli highlights TikTok's rise as a purely video-centric platform compared to earlier social networks focused on text or images, illustrating the trend toward visual content consumption.

Video Production Revolutionized by AI

The Challenge of Traditional Video Production

Producing video content at scale is challenging due to the need for physical resources like cameras, studios, and film crews.

Demand for video content has outpaced production capabilities, prompting the development of an AI-driven platform that simplifies this process.

Simplifying Video Creation with AI

Users can create videos easily using just a laptop and internet connection; they log into the platform, select an AI avatar, and type their script.

The platform supports 50 languages and generates videos in minutes, making it accessible for various users.

Enhancing Content with Video

The service does not replace traditional video production but transforms existing text materials into engaging video formats.

It democratizes advanced visual effects technology, making it available to anyone at a starting price of $30 per month.

Future Possibilities in Media Production

By abstracting video production as software, new possibilities arise such as personalized videos tailored to individual viewers based on data points.

This shift mirrors trends seen in other media forms where technology has enabled high-quality creation from personal devices.

Personal Journey and Vision

The speaker's vision includes creating Hollywood-level films on laptops within 10 to 15 years, reflecting a significant evolution in media production methods.

The transition from traditional tools (cameras/microphones) to computer-based solutions represents a natural progression in media technology.

Background of the Innovator

The speaker's interest in sci-fi and computers led them to explore tech careers early on by building webshops for local businesses.

Their journey included studying at Stanford University and meeting influential figures in deep tech research which inspired their current venture.

A Glimpse into the Future of Content Creation

Founding Vision and Team Structure

The speaker perceives a transformative shift in content creation technology, recognizing both its potential and associated challenges. This realization led to the formation of a strong founding team composed of academics and business professionals.

The mission is to simplify video content creation for everyone, initiated nearly five years ago. The co-founders include notable figures from academia and industry.

Key Co-Founders

Professor Matthias Niessner: A leading researcher in deep learning methods for media content generation, previously at Stanford and now at TUM in Munich. His work focuses on digital humans and environments.

Professor Lourdes Agapito: Based at UCL in London, she has pioneered the integration of computer vision with deep learning technologies. Her contributions are significant in advancing this field.

Steffen Tjerrild: Responsible for business finance within the team, he complements the academic expertise with practical business acumen.

Jonathan Starck (CTO): Although not a co-founder, he joined early on and has experience creating technology used extensively in Hollywood visual effects production, bridging high-end visual arts with academic research.

Technology Overview

Proprietary Tech Stack

The company operates as a deep tech entity with approximately 45% of its staff holding PhDs or professorships focused on advanced research problems in their field. They aim to lead even within academic circles through innovative solutions.

Challenges in Product Development

Transitioning from theoretical demonstrations to scalable products involves significant effort; achieving consistent results is crucial for market viability—thousands of videos are produced daily for clients using proprietary technology that balances quality and cost-effectiveness.

Compute Intensity Considerations

The process is compute-intensive; however, advancements have made it manageable enough to offer services starting at $30 per month for generating 10 minutes of video content—a reduction from previous costs while maintaining quality standards.

Avatar Technology Insights

Types of Avatars

There are two main types of avatars being developed: real person avatars created from user-submitted video footage and artificial avatars generated through coding techniques—both serve different purposes but share common technological foundations.

Future Digital Representations

In the next five to ten years, individuals will likely possess various digital representations (avatars) that facilitate communication much like current social media profiles do today—this evolution emphasizes video as an emerging dominant medium over static text formats like emails or LinkedIn profiles.

Deep Learning and Synthetic Media Technology

Recording Process and Data Quality

The recording process for creating synthetic videos can be done using standard devices like an iPhone, but quality data is crucial. A green screen studio with good lighting and a decent camera is recommended for optimal results.

After recording, the data is sent to machines that analyze it to simulate new videos of the individual speaking any text inputted. This forms the core of their technology.

Synthetic Humans and Voice Generation

The concept of "super synthetic humans" involves creating characters not based on real individuals but generated entirely through technology, similar to character customization in video games.

Presenters on the platform are real people who earn royalties each time their likeness is used in a generated video, akin to stock photo actors earning from downloads.

Real People vs. Artificial Avatars

There’s a discussion about the necessity of using real people instead of artificial avatars; realism plays a significant role in client preferences.

Creating realistic synthetic humans remains challenging, with many clients preferring recognizable figures for branding or educational purposes.

Evolution of Text-to-Speech Technology

Over the past decade, deep learning has significantly improved voice synthesis from robotic sounds to more natural-sounding "neuro voices," enhancing overall audio quality.

Previously, voice generation involved concatenating recorded words which lacked natural intonations; now systems understand sentence structure better due to advancements in AI.

Voice Cloning Advancements

Recent developments have led to what is termed "voice cloning," allowing for more diverse voice options with less data required than before—previously needing extensive recordings.

Current technologies enable high-quality voice creation even with minimal data (as little as 15 minutes), marking a shift towards scalable voice production capabilities.

Emotional Expression in TTS Voices

In recent months, there has been progress in incorporating emotional nuances into text-to-speech (TTS), moving beyond basic vocalization towards more expressive outputs.

Exploring the Future of Voice Technology

Advancements in Voice Synthesis

The evolution of voice technology allows for emotional expression, enabling voiceovers to convey different moods such as happiness or sadness, which is reflected in video content.

Companies are beginning to create multilingual versions of voices, allowing an English speaker's voice to be replicated in Mandarin, although this presents significant challenges.

Challenges in Multilingual Voice Replication

Creating synthetic voices tailored for specific brands involves adjusting parameters like pitch and speed, highlighting the need for unique voice profiles.

The importance of style in speech delivery is emphasized; mismatched styles (e.g., a sales pitch sounding like an audiobook) can lead to ineffective communication.

Technical Difficulties with Neural Networks

Teaching neural networks to replicate human voices across languages is complex due to the lack of existing data for certain language combinations.

Generating a new version of a voice that speaks a different language requires disentangling various vocal signals, which is technically demanding and data-intensive.

User Experience and Accessibility

As synthetic media evolves, understanding how users will interact with these technologies becomes crucial. There’s a focus on creating user-friendly experiences that cater to non-experts.

Innovations include systems that analyze text emotions, determining whether sentences should sound happy or sad based on context.

Democratization of Video Creation

The goal is to make video creation accessible for everyone, not just professionals. Testing usability with diverse age groups ensures broader market appeal.

The comparison between writing and video production illustrates how video will become an integral part of daily communication without labeling individuals as "video producers."

Future Communication Trends

Just as writing has become commonplace in professional settings over the past few decades, video communication is expected to follow suit as a superior method compared to traditional email exchanges.

Building intuitive user interfaces remains essential as technology advances, ensuring that all users can effectively engage with new tools regardless of their technical background.

Understanding the Expansion of Video Technology in B2B

Addressable Market and Adoption

The speaker highlights the ease of use of their video technology compared to traditional software like Premiere or Final Cut Pro, which broadens their market appeal.

Rapid growth is noted since the launch of their self-service platform, indicating a significant adoption curve among early adopters in the B2B sector.

Differentiation from Traditional Video Production

The company emphasizes that they are not replacing high-end video production but rather addressing a different need within organizations for scalable video content.

They target large companies with global operations that require effective communication and training methods, recognizing that video is more impactful than text-based formats.

Enhancing Learning and Communication

The focus is on transforming traditional learning materials into engaging video content, particularly for employees who may struggle with written language.

Videos are presented as a more accessible alternative to lengthy documents, especially for non-native English speakers or blue-collar workers.

Internal and External Use Cases

Current usage trends show internal applications for training; however, there’s potential for external customer engagement through educational videos about products or services.

The technology allows companies to create numerous videos quickly, keeping content up-to-date without the delays associated with traditional filming.

Future Developments and Market Perception

An upcoming API platform will enable users to generate personalized content easily by integrating with existing marketing tools without coding knowledge.

There’s an acknowledgment that while synthetic media might seem experimental to some, user experience remains paramount over production quality.

Market Dynamics in Media Localization

The speaker discusses interactions with dubbing studios and media localization companies, hinting at varied responses ranging from interest to apprehension regarding competition.

They express a desire for multilingual customer interaction videos, highlighting how traditional methods would be impractical compared to their innovative solutions.

This structured summary captures key insights from the transcript while providing timestamps for easy reference.

Video Generation vs. Video Editing

Understanding the Shift from Video Editing to Video Generation

The speaker discusses their proficiency in nine languages and introduces a technology focused on video editing, which involves translating existing video content.

Initial market success was achieved with video editing for advertisements, allowing AI to translate videos into multiple languages efficiently.

Challenges arise when working with existing footage, particularly in Hollywood films where rapid cuts and complex scenes hinder AI performance.

Market Dynamics and Technology Limitations

The current limitations of AI technology make it difficult to scale video translation, especially for dynamic scenes like action sequences.

Despite interest from dubbing companies, the speaker's team chose not to pursue this market extensively due to its complexity and the need for high-quality visual effects.

Future of Content Creation

Competing with established artists in Hollywood presents challenges; they prioritize pixel-perfect quality over adopting new technologies quickly.

The future of filmmaking may diverge from traditional Hollywood styles, potentially leading to innovative forms created by younger generations using new tools.

Analogies with Music Evolution

The speaker compares the emergence of synthesizers in music to potential developments in video creation technology, suggesting that new genres will emerge rather than simply replicating existing formats.

Just as synthesizers led to electronic music genres like techno, advancements in video generation could create entirely new storytelling methods.

Collaboration with Translation Partners

Localization and translation are seen as essential partnerships; current technology cannot fully automate these processes yet.

Trust issues exist regarding automatically translated content among large corporations; human touch is crucial for maintaining brand voice during translations.

Building Partnerships within the Industry

Translation companies have approached the speaker’s team after witnessing their work, recognizing an opportunity to expand translation markets significantly through collaboration.

Collaboration and Market Expansion in Translation Services

Synergies with Clients

The collaboration with clients has been organic, with both parties benefiting from their distinct skill sets.

Many clients already have existing translation agencies, allowing for integration into their value chain and enhancing service offerings.

Marketing Opportunities

The podcast encourages listeners in the language translation space to leverage innovative tools to expand market reach.

Presenting new tools to marketing heads can significantly enhance sales strategies and open up new avenues for business growth.

Hiring Challenges in Machine Learning

Talent Acquisition

The company is rapidly expanding and actively seeking talent across various roles, including machine learning, web engineering, customer success, and sales.

The competitive landscape for machine learning talent is driven by rapid technological advancements outpacing educational output.

Industry Competition

Major tech companies offer substantial salaries to attract top-tier machine learning researchers, making it challenging for smaller firms to compete.

Candidates often prefer roles that allow them to explore cutting-edge research rather than just software production.

Funding Journey of Synthesia

Initial Funding Experience

Synthesia's first funding round was in 2017 with Mark Cuban after a long struggle to find investors who understood their vision.

Raising funds is crucial for technology development due to lengthy R&D cycles; thus, finding the right investors is essential.

Investor Relationships

Building relationships with visionary investors who appreciate bold ideas has been key; early skepticism about deepfake technology posed challenges.

A serendipitous email exchange led to significant initial investment from Mark Cuban, marking a turning point for the company.

The Challenges of Innovating in Synthetic Media

Understanding Market Perception and Investor Relations

The speaker discusses the ease for investors to analyze traditional markets, like accounting software, compared to innovative fields such as synthetic media, where they must envision a future where cameras are replaced by code.

Emphasizing the importance of having a clear vision and initial technology, the speaker contrasts early fundraising challenges with their current success as a proven high-growth business.

Relationships with investors are crucial; shared values and long-term collaboration (5-10 years) are highlighted as essential for successful partnerships.

Misalignment between investor expectations and company vision can lead to negative outcomes over time. It's vital that both parties share similar goals.

Early-stage investors often struggle to grasp complex visions in emerging tech sectors. The speaker reflects on how difficult it was to convey their innovative ideas effectively.

The Evolving Landscape of Venture Capital Understanding

In 2021, there is an improvement in VC understanding of components within SaaS and AI/ML spaces compared to previous years, but significant gaps remain.

Two primary approaches exist for starting companies: solving existing problems or presenting new visions. The latter is more challenging due to less data backing the concept.

Building something entirely new requires strong belief in one's vision since it relies heavily on team capabilities rather than established market data.

Overcoming Skepticism in Niche Markets

The speaker notes that their project was abstract and niche, making it hard for outsiders to appreciate its complexity and potential impact initially.

They emphasize that while competitors may have been around longer, their quality remains superior due to extensive experience and dedication over time.

Despite media portrayals suggesting easy access to technology through open-source solutions, the reality involves significant effort and expertise that isn't widely understood outside the field.

Industry Comparisons: Self-driving Cars vs. Synthetic Media

Comparing self-driving cars with synthetic media highlights how some ideas seem inherently valuable (like autonomous vehicles), while others require more convincing despite being equally promising.

Initial pitches for synthetic media faced skepticism regarding scalability and feasibility; many questioned whether audiences would accept synthetic videos or if regulatory issues would arise.

Conclusion: Navigating Complexity in Innovation

The discussion illustrates the multifaceted challenges innovators face when introducing groundbreaking technologies like synthetic media. It underscores the necessity of aligning investor relationships with long-term visions while navigating a landscape filled with skepticism about novel concepts.

Deepfake Technology and Its Implications

Overview of Deepfake Technology

The speaker discusses the potential benefits of deepfake technology, noting its growing traction and market interest.

The term "deepfake" has evolved, often associated with negative connotations related to AI-generated content.

Negative Perceptions and Misuse

Media narratives focus on the dangers of deepfakes, such as fake news and non-consensual pornography, which are valid concerns.

Despite these negative uses, the majority of deepfake applications are benign and create significant value in business contexts.

Ethical Considerations in Technology Use

Synthesia prioritizes ethical use by ensuring that their technology is not misused; they verify users and require consent for content creation.

A critical question remains: how can we mitigate harmful uses of deepfake technology beyond individual company practices?

Solutions for Mitigating Harmful Uses

Development of deepfake detectors aims to identify manipulated videos; however, their effectiveness is uncertain as synthetic content becomes more prevalent.

Media provenance is proposed as a solution to track the origin of digital content, enhancing accountability.

Future Directions in Content Authenticity

Collaboration with Adobe on the Content Authenticity Initiative seeks to establish a system similar to Shazam for video identification.

Current systems exist for music copyright detection on platforms like YouTube; similar mechanisms could be applied to video content.

Education and Public Awareness

Educating the public about synthetic media is essential; exposure to such content will help people discern reality from manipulation.

Campaign examples demonstrate how personalized messages using AI can raise awareness about synthetic media's capabilities.

Looking Ahead: Synthesia's Vision

The speaker outlines future plans for Synthesia focused on improving AI algorithms and enhancing avatar technology.

Product Roadmap and Vision for Synthetic Media

Overview of the Product Development

The platform aims to integrate emotional elements, gestures, and multiple participants in videos, enhancing user engagement.

The goal is to create a simplified cloud-based video editor akin to Adobe Premiere, focusing on personalized content generation.

Initial focus will be on business communications through video content and chatbots, with future plans for storytelling-oriented entertainment.

Managing Competing Ideas in Innovation

The speaker emphasizes the importance of maintaining a clear long-term vision amidst numerous innovative ideas that arise daily.

A structured approach is necessary; decisions should align with the overarching goals while addressing immediate user needs.

User-Centric Feature Development

Features are driven by user requests; however, many requested features may seem mundane but significantly enhance workflow efficiency.

There’s also a need to innovate beyond current user expectations by introducing capabilities they cannot yet envision due to technological limitations.

Strategic Planning and Execution

A high-level plan has been established since the company's inception, guiding development through various milestones despite minor adjustments along the way.

The company has successfully executed its strategy without major pivots, indicating effective management of innovation and direction.