# 158 How Game Localization is Trailblazing Speech Synthesis with Voiseed CEO

# 158 How Game Localization is Trailblazing Speech Synthesis with Voiseed CEO

Introduction to Voiseed and Andrea Ballista

Overview of Voiseed

  • The podcast introduces Andrea Ballista, CEO of Voiseed, a startup focused on AI-based virtual voice technology aimed at dubbing emotional content in multiple languages.
  • Andrea is based in Milano, Italy, and recently returned from San Francisco.

Elevator Pitch for Voiseed

  • Voiseed aims to create an engine that produces emotional virtual voices across various languages, allowing for diverse voice styles and expressions.

Andrea's Professional Journey

Background in Audio Dubbing

  • Andrea shares his journey starting from a passion for music and singing during his teenage years, leading him to explore computer music.
  • He founded his first company, Binari Sonori, in 1994 focusing on game localization which grew significantly over the years.

Transition to Keywords

  • In 2014, Binari Sonori was acquired by Keywords after their IPO. Andrea worked there as an Audio Director for four years before leaving to start Voiseed.

Insights from the Game Developers Conference (GDC)

Impressions of GDC

  • The GDC marked a return to normalcy post-COVID with high attendance and interest in new technologies related to game development.

Trends in Localization and Dubbing

  • There is growing interest in expressive voices following advancements in machine translation; many attendees are eager to learn about these developments.

Target Audience for Voiseed

Tech Innovations in Game Dubbing

Overview of Company Focus

  • The company emphasizes its identity as a tech innovator rather than a service provider, aiming to develop technology that enhances existing services in the industry.
  • Revoiceit is introduced as a specific product targeting language service providers for games, representing the company's core focus on creating new technology.

Introduction to Revoiceit Technology

  • Revoiceit is described as the first implementation of their innovative technology tailored for game dubbing, capable of understanding both voice and emotional profiles.
  • The platform supports multimodal input, allowing users to provide both spoken prompts and textual input for generating output voices.

Emotional Expression and Language Nuances

  • The system features autocasting, which generates "sound alike" voices that maintain emotional delivery while adapting to different languages.
  • Acknowledges the challenge of conveying emotions across cultures and languages, highlighting how each language has unique ways of expression.

Research Foundations and Competitive Landscape

  • Extensive research into psychological emotions informs the development of vocal delivery profiles, enhancing the system's ability to understand human emotion through voice.
  • The concept of universal TTS (Text-to-Speech), redefined here as speech-to-speech systems, aims to emulate various voice profiles and emotional expressions across multiple languages.

User Interaction and Project Collaboration

  • Designed as a multi-user platform, Revoiceit allows collaboration among project teams where native speakers can approve translations and voice deliveries.
  • Similar to machine translation processes, it incorporates human-in-the-loop mechanisms where linguists refine machine-generated outputs based on cultural nuances.

Addressing Unvoiced Content in Gaming

  • The platform targets unvoiced projects within gaming content, aiming to add emotional depth and voice representation to stories that currently lack them.

Understanding the Role of Prompt Engineering in Voice Models

The Importance of User Engagement with Language Models

  • The discussion highlights the need for users to understand how to interact with large language models, particularly in adjusting voice pitch and tone.
  • Users must learn to prompt the system effectively to receive useful feedback, which can be fine-tuned based on their preferences.
  • Engaging with tools like ChatGPT helps users develop skills that enhance their ability to drive desired responses from AI systems.

Training and Integration of New Technologies

  • Tech companies are focusing on educating users about utilizing platforms effectively, creating new workflows tailored to diverse needs.
  • There is a growing trend towards integrating voice technology for both virtual characters (NPCs) and real actors in larger projects.
  • The challenge remains in balancing human involvement with machine capabilities, especially when rapid localization is required for game trailers.

Market Dynamics and Localization Challenges

  • The conversation shifts towards identifying ideal user scenarios within gaming, emphasizing mobile games where budget constraints limit voice integration.
  • Popularity can lead to increased investment in voice talent as games gain traction; however, many languages remain underrepresented due to cost issues.
  • Major markets such as China and the US dominate sales potential, but there are significant opportunities in Eastern European and Asian territories.

Voicing Unvoiced Content Across Industries

  • A vast number of games released annually lack proper voiceovers; identifying solutions for voicing unvoiced content is crucial for market success.
  • Expanding beyond gaming into media and entertainment presents opportunities for emotional engagement through adaptive voice technologies.

Future Directions: Media, Entertainment, and Beyond

  • Media entertainment is seen as a natural evolution for voicing emotions since it involves complex narratives rather than simple lines.
  • Game development requires iterative processes that may utilize placeholder voices before final recordings with professional actors are made.

Text-to-Speech Systems and Emotional Scoring

The Role of Voice in Text-to-Speech

  • Emerging text-to-speech systems allow for the use of various voices, diminishing the importance of the original voice source.
  • Crafting emotional scores for segments is essential, akin to scoring music, requiring additional information that complicates the process.

Emotional Nuances in Speech

  • Efforts are being made to simplify emotional labeling by reducing key emotions while allowing engines to interpret multiple nuances.
  • Different vocal emissions (e.g., whispering vs. shouting) can alter frequency ranges, yet our ears perceive them as a single voice.

Human Perception of Emotion

  • Humans can quickly detect changes in tone or expression over a phone call, showcasing our advanced ability to interpret emotional cues.
  • Conveying emotion through narration is complex; subjective interpretations vary among individuals and professionals involved in audio production.

Challenges in Audio Production

  • There are countless ways to express simple phrases like "hi," leading to challenges in achieving an acceptable performance within limited studio time.
  • Testing these systems is complicated due to the vast number of inputs and outputs generated, necessitating expertise from language service providers.

Implications of Large Language Models on Startups

Navigating Technological Advancements

  • The rise of large language models like ChatGPT presents both opportunities and challenges for startups at different stages of development.
  • Founders must assess whether to pivot their strategies based on new technologies while considering existing roadmaps and investments.

Building Generative Systems

  • The startup began developing generative systems before ChatGPT's emergence, focusing on creating expressive voice content across languages.
  • Their approach prioritizes solving specific problems rather than merely integrating existing technologies like Tacotron.

Funding and Development Challenges

Startup Journey and Funding Insights

Initial Challenges and Remote Operations

  • The startup began its journey in March 2020, facing significant challenges due to the COVID-19 lockdown in Italy.
  • They applied for the Berkeley SkyDeck JP program remotely, which provided valuable learning experiences despite the difficult start.
  • The team pursued funding through Horizon Europe, which offered grants and equity support; they successfully won a grant in October 2021.

Equity Funding and Team Development

  • The first equity round closed at the end of January, following their initial grant; this involved thorough due diligence from investors.
  • Hiring during lockdown was challenging; they focused on recruiting talent from local institutions like Politecnico in Milan.
  • Emphasis was placed on training new hires due to the specific skill set required for their technology, highlighting investment in human resources as crucial.

Hiring Strategy and Remote Work Adaptation

  • The startup seeks smart individuals willing to embrace challenges, recognizing that they are still defining their processes as a growing company.
  • While most team members are based locally, there is an increasing interest in remote hiring to adapt to changing work environments.

Product Development and Market Focus

  • Revoiceit is positioned as a B2B product with ongoing improvements aimed at emotional control and voice creation for various projects.
  • Their focus includes game localization and media/entertainment sectors while maintaining high emotional content quality.

Customer Engagement and Future Plans

  • The current model does not offer standard SaaS subscriptions but allows customers to test technology with low entry barriers.

Chat Summary with Andrea and Florian

Closing Thoughts on the Discussion

  • The conversation concludes with a sense of enthusiasm, as one participant expresses that their mind is "on fire" from the fascinating discussion.
  • Acknowledgment of the engaging nature of the chat, highlighting its interesting content and stimulating ideas.
  • Gratitude is expressed towards Andrea for her contributions to the dialogue, indicating a positive interaction.
  • Florian reciprocates appreciation, emphasizing the pleasure derived from the conversation.
Video description

Voiseed CEO and Co-founder Andrea Ballista joins SlatorPod to discuss the machine dubbing startup’s approach to operating and developing their AI-based virtual voice engine, Revoiceit. Andrea talks about how his passion for music as a child led him to founding audio localization studio Binari Sonori, which he sold to Keywords Studios in 2014, and why he is now launching Voiseed. He shares his impressions from this year’s Game Developers Conference where there was a lot of interest in new technologies, voice cloning solutions, and the development of emotional synthetic voices. Andrea unpacks Revoiceit’s ability to understand the voice and emotional profile of the user and transfer both profiles into the target language. Voiseed has been profiling vocal delivery and creating a data set, so the system can have a wider knowledge of human emotion in terms of voice and language. On the topic of large language models (LLMs), Andrea is not worried about the implications of LLMs like ChatGPT as they have had two years to build a dataset on vast amounts of voice content. He talks about Voiseed’s early financial backers and shares the story behind applying for blended finance (grants and equity) from the European Innovation Council Fund and LIFTT. Andrea shares his experience hiring during the lockdown and their approach to investing in ‘cool’ people and helping them grow through their incentive plan. The pod wraps up with Voiseed’s product roadmap, where they aim to improve features with more emotion control and work on how to create new voices that can be used in multiple projects. Voiseed: https://www.voiseed.com/ Chapter Markers: 00:00:00 Intro and Agenda 00:00:58 Voiseed Elevator Pitch 00:01:47 Professional Background and Journey to Voiseed 00:05:32 Insights from Game Developers Conference 2023 00:08:36 Voiseed's AI-Based Virtual Voice Engine 00:15:01 Revoiceit User Profile 00:18:32 Revoiceit User Cases 00:23:59 Data Labelling 00:28:40 Thoughts on ChatGPT 00:31:00 Financing and Funding 00:33:25 Hiring Environment 00:36:33 Product Roadmap for 2023