# 158 How Game Localization is Trailblazing Speech Synthesis with Voiseed CEO

Name: # 158 How Game Localization is Trailblazing Speech Synthesis with Voiseed CEO
Uploaded: 2023-03-29T11:01:24.000Z
Duration: 1 h 18 min 30 s

Introduction to Voiseed and Andrea Ballista

Overview of Voiseed

The podcast introduces Andrea Ballista, CEO of Voiseed, a startup focused on AI-based virtual voice technology aimed at dubbing emotional content in multiple languages.

Andrea is based in Milano, Italy, and recently returned from San Francisco.

Elevator Pitch for Voiseed

Voiseed aims to create an engine that produces emotional virtual voices across various languages, allowing for diverse voice styles and expressions.

Andrea's Professional Journey

Background in Audio Dubbing

Andrea shares his journey starting from a passion for music and singing during his teenage years, leading him to explore computer music.

He founded his first company, Binari Sonori, in 1994 focusing on game localization which grew significantly over the years.

Transition to Keywords

In 2014, Binari Sonori was acquired by Keywords after their IPO. Andrea worked there as an Audio Director for four years before leaving to start Voiseed.

Insights from the Game Developers Conference (GDC)

Impressions of GDC

The GDC marked a return to normalcy post-COVID with high attendance and interest in new technologies related to game development.

Trends in Localization and Dubbing

There is growing interest in expressive voices following advancements in machine translation; many attendees are eager to learn about these developments.

Target Audience for Voiseed

Tech Innovations in Game Dubbing

Overview of Company Focus

The company emphasizes its identity as a tech innovator rather than a service provider, aiming to develop technology that enhances existing services in the industry.

Revoiceit is introduced as a specific product targeting language service providers for games, representing the company's core focus on creating new technology.

Introduction to Revoiceit Technology

Revoiceit is described as the first implementation of their innovative technology tailored for game dubbing, capable of understanding both voice and emotional profiles.

The platform supports multimodal input, allowing users to provide both spoken prompts and textual input for generating output voices.

Emotional Expression and Language Nuances

The system features autocasting, which generates "sound alike" voices that maintain emotional delivery while adapting to different languages.

Acknowledges the challenge of conveying emotions across cultures and languages, highlighting how each language has unique ways of expression.

Research Foundations and Competitive Landscape

Extensive research into psychological emotions informs the development of vocal delivery profiles, enhancing the system's ability to understand human emotion through voice.

The concept of universal TTS (Text-to-Speech), redefined here as speech-to-speech systems, aims to emulate various voice profiles and emotional expressions across multiple languages.

User Interaction and Project Collaboration

Designed as a multi-user platform, Revoiceit allows collaboration among project teams where native speakers can approve translations and voice deliveries.

Similar to machine translation processes, it incorporates human-in-the-loop mechanisms where linguists refine machine-generated outputs based on cultural nuances.

Addressing Unvoiced Content in Gaming

The platform targets unvoiced projects within gaming content, aiming to add emotional depth and voice representation to stories that currently lack them.

Understanding the Role of Prompt Engineering in Voice Models

The Importance of User Engagement with Language Models

The discussion highlights the need for users to understand how to interact with large language models, particularly in adjusting voice pitch and tone.

Users must learn to prompt the system effectively to receive useful feedback, which can be fine-tuned based on their preferences.

Engaging with tools like ChatGPT helps users develop skills that enhance their ability to drive desired responses from AI systems.

Training and Integration of New Technologies

Tech companies are focusing on educating users about utilizing platforms effectively, creating new workflows tailored to diverse needs.

There is a growing trend towards integrating voice technology for both virtual characters (NPCs) and real actors in larger projects.

The challenge remains in balancing human involvement with machine capabilities, especially when rapid localization is required for game trailers.

Market Dynamics and Localization Challenges

The conversation shifts towards identifying ideal user scenarios within gaming, emphasizing mobile games where budget constraints limit voice integration.

Popularity can lead to increased investment in voice talent as games gain traction; however, many languages remain underrepresented due to cost issues.

Major markets such as China and the US dominate sales potential, but there are significant opportunities in Eastern European and Asian territories.

Voicing Unvoiced Content Across Industries

A vast number of games released annually lack proper voiceovers; identifying solutions for voicing unvoiced content is crucial for market success.

Expanding beyond gaming into media and entertainment presents opportunities for emotional engagement through adaptive voice technologies.

Future Directions: Media, Entertainment, and Beyond

Media entertainment is seen as a natural evolution for voicing emotions since it involves complex narratives rather than simple lines.

Game development requires iterative processes that may utilize placeholder voices before final recordings with professional actors are made.

Text-to-Speech Systems and Emotional Scoring

The Role of Voice in Text-to-Speech

Emerging text-to-speech systems allow for the use of various voices, diminishing the importance of the original voice source.

Crafting emotional scores for segments is essential, akin to scoring music, requiring additional information that complicates the process.

Emotional Nuances in Speech

Efforts are being made to simplify emotional labeling by reducing key emotions while allowing engines to interpret multiple nuances.

Different vocal emissions (e.g., whispering vs. shouting) can alter frequency ranges, yet our ears perceive them as a single voice.

Human Perception of Emotion

Humans can quickly detect changes in tone or expression over a phone call, showcasing our advanced ability to interpret emotional cues.

Conveying emotion through narration is complex; subjective interpretations vary among individuals and professionals involved in audio production.

Challenges in Audio Production

There are countless ways to express simple phrases like "hi," leading to challenges in achieving an acceptable performance within limited studio time.

Testing these systems is complicated due to the vast number of inputs and outputs generated, necessitating expertise from language service providers.

Implications of Large Language Models on Startups

Navigating Technological Advancements

The rise of large language models like ChatGPT presents both opportunities and challenges for startups at different stages of development.

Founders must assess whether to pivot their strategies based on new technologies while considering existing roadmaps and investments.

Building Generative Systems

The startup began developing generative systems before ChatGPT's emergence, focusing on creating expressive voice content across languages.

Their approach prioritizes solving specific problems rather than merely integrating existing technologies like Tacotron.

Funding and Development Challenges

Startup Journey and Funding Insights

Initial Challenges and Remote Operations

The startup began its journey in March 2020, facing significant challenges due to the COVID-19 lockdown in Italy.

They applied for the Berkeley SkyDeck JP program remotely, which provided valuable learning experiences despite the difficult start.

The team pursued funding through Horizon Europe, which offered grants and equity support; they successfully won a grant in October 2021.

Equity Funding and Team Development

The first equity round closed at the end of January, following their initial grant; this involved thorough due diligence from investors.

Hiring during lockdown was challenging; they focused on recruiting talent from local institutions like Politecnico in Milan.

Emphasis was placed on training new hires due to the specific skill set required for their technology, highlighting investment in human resources as crucial.

Hiring Strategy and Remote Work Adaptation

The startup seeks smart individuals willing to embrace challenges, recognizing that they are still defining their processes as a growing company.

While most team members are based locally, there is an increasing interest in remote hiring to adapt to changing work environments.

Product Development and Market Focus

Revoiceit is positioned as a B2B product with ongoing improvements aimed at emotional control and voice creation for various projects.

Their focus includes game localization and media/entertainment sectors while maintaining high emotional content quality.

Customer Engagement and Future Plans

The current model does not offer standard SaaS subscriptions but allows customers to test technology with low entry barriers.

Chat Summary with Andrea and Florian

Closing Thoughts on the Discussion

The conversation concludes with a sense of enthusiasm, as one participant expresses that their mind is "on fire" from the fascinating discussion.

Acknowledgment of the engaging nature of the chat, highlighting its interesting content and stimulating ideas.

Gratitude is expressed towards Andrea for her contributions to the dialogue, indicating a positive interaction.

Florian reciprocates appreciation, emphasizing the pleasure derived from the conversation.