# 161 Microsoft’s Christian Federmann on the Translation Quality of Large Language Models

# 161 Microsoft’s Christian Federmann on the Translation Quality of Large Language Models

Introduction and Background

The introduction of Christian Federmann, a Principal Research Manager at Microsoft Translator, discussing his career background and focus on machine translation evaluation.

Christian's Career Path in NLP

  • Christian discusses his educational background in computational linguistics and computer science at Saarbrücken University.
  • Mentions working under Hans Uszkoreit at the German Research Center for AI, which influenced his path towards machine translation.
  • Shares how he chose to do programming on computational linguistics with Hans, leading to his PhD and eventually joining Microsoft.

Milestones in Career and Speech Translation

Highlights key milestones in Christian's career journey, including significant projects like Skype Translator and transitioning from statistical MT to neural MT.

Career Milestones

  • Recounts joining Microsoft and being involved in speech translation, leading to the development of Skype Translator as a major milestone.
  • Discusses the transition within Microsoft Research team from statistical MT to neural MT, emphasizing the publication of a human parity paper for Chinese-English in 2018.
  • Reflects on the aftermath of the human parity paper publication and the impact it had on academia's perception of machine translation evaluation metrics.

Importance of Context in Machine Translation Evaluation

Explores the significance of context in evaluating machine translation quality over time and its relevance amidst advancements in large language models (LLMs).

Contextual Evaluation Evolution

  • Notes initial criticisms regarding context evaluation during the human parity paper release and highlights the ongoing importance of context even as metrics evolve.
  • Emphasizes the shift towards considering context as crucial for better evaluation, especially with advancements in large language models offering wider contextual windows.

Insights into Microsoft Translator Framework

Provides insights into Microsoft Translator's evolution from a research project to utilizing cutting-edge neural MT technology within the Marian framework.

Evolution of Microsoft Translator

  • Details Microsoft Translator's origins as a research project focusing on statistical and machine translation before transitioning to neural MT using Marian framework.

Innovations in Translation Models

In this section, the speaker discusses various translation models and tools offered by Microsoft, highlighting their accessibility and customization options.

Access to Translation Models

  • Microsoft provides API-based access to models for both first and third-party customers.
  • Two main translation APIs are available:
  • Generic vanilla general domain translation API.
  • Custom Translator, allowing customization with user-uploaded data.

Document Translation Tool

  • Introduction of Document Translator tool shifting from segment level to whole documents within Microsoft 365 formats.
  • Maintains document formatting during translation process.
  • Emphasizes the importance of preserving formatting for practical use beyond academia.

Advancements in Language Models

  • Transition towards large language models like GPT due to their multilingual capabilities and potential for high-quality translations.
  • Initial skepticism regarding model quality resolved through internal investigations revealing strong performance in specific language pairs and domains.
  • Possibility of large language models serving general domain translation quality for top languages in the future.

Future of Large Language Models in Translation

This section delves into the potential role of large language models in revolutionizing translation services, focusing on scalability challenges and cost implications.

Role of Large Language Models

  • Large language models like GPT could potentially offer high-quality translations for various languages with sufficient training data.
  • Exploration ongoing to create hybrid translation models leveraging large language models' capabilities.
  • Cost efficiency remains a challenge compared to specialized models optimized for affordability and performance.

Scalability Challenges

  • Scaling up usage of large language models requires time, resources, and increased GPU capacity due to access limitations.
  • Comparison drawn with past transitions to neural models where improvements over time enhanced accessibility and speed of deployment.
  • Anticipated improvements in access over the coming years may address current scalability issues faced with large language models like GPT.

Latency and Decision Making in Language Models

The discussion delves into the challenges related to latency in language models and the decision-making process of determining when to use large language models versus specialized models.

Latency Challenges and Decision Criteria

  • Large multilingual models versus specialized models: The focus shifts from rule-based and Statistical Machine Translation (SMT) hybrids to deciding between large multilingual models or specialized ones, emphasizing the need for optimizing cost and quality.
  • Translation process simplification: Language model translation is simplified as predicting the next word. Questions arise regarding whether the source text is part of the prompt or separate, highlighting the encoder-decoder mechanism in large language models.
  • Prompt engineering significance: The prompt serves as an instruction for large language models, guiding them through encoding and decoding steps. This prompts a shift towards more principled approaches from current prompt hacking practices.

Exploration of Translation Prompts

  • Templatic versus variable prompts: Discussion on whether translation instructions are fixed templates or variable components within prompts, leading to ongoing prompt engineering developments for optimal translations.
  • Importance of optimized prompts: Emphasis on transitioning from ad hoc prompt creation to principled methods for determining effective prompts, acknowledging that optimal prompts may not always be intuitively human-readable.

Training Data Sources Impacting Language Models

Explores how training data sources influence large language models, potentially incorporating content from web scraping and publicly available APIs.

Influence of Training Data Sources

  • Web data incorporation: Large language model training often involves scraping extensive web data, including standard test sets like WMTs or NIST competitions. Publicly accessible translation APIs may also contribute to training datasets.
  • Testing with simplistic examples: Evaluation using basic phrases common in machine translation services highlights the challenge of distinguishing between machine-generated and human-authored content for enhancing model quality.

Discussion on Translation Quality Evaluation and Large Language Models

In this section, the discussion revolves around the evaluation of translation quality using large language models like GPT3-3.5 and ChatGPT, emphasizing the importance of utilizing current data sets such as WMT22 for comparison.

Importance of Current Data Sets for Translation Quality Evaluation

  • Last year's models like GPT3-3.5 and ChatGPT were evaluated before the release of WMT22 test sets publicly.
  • Utilizing WMT22 test sets is crucial for accurate comparison due to their relevance to current data, especially with the integration of large language models with plugin architecture.

Challenges in Model Evaluation

  • The difficulty arises when training data details are undisclosed, making it hard to ascertain if a model has encountered all necessary tasks, posing challenges in evaluating model performance accurately.
  • Introduction of plugins further complicates evaluation as it blurs distinctions between model-generated outputs and plugin-enhanced translations, raising concerns about transparency in assessing translation quality.

Evolution of Translation Quality Metrics

  • Historically, BLEU has been a predominant metric for machine translation evaluation; however, newer metrics like ChrF gained traction due to better stability and correlation with quality but faced adoption challenges.
  • SacreBLEU tool by Matt Post facilitated standardized BLEU scores interpretation and introduced additional metrics like ChrF and translation error rate for comprehensive evaluation.

Evaluation Using GEMBA Metric

This segment delves into the development and success of the GEMBA metric in assessing translation quality through large language models like GPT.

Introducing GEMBA Metric

  • The GEMBA metric leverages any large language model to evaluate translation quality based on given input-output pairs for different languages, offering a simple yet effective assessment approach.

New Section

In this section, the speaker discusses their experience with various GPT versions and the challenges faced in accessing GPT-4 due to its slow speed.

Experience with Different GPT Versions

  • The team experimented with seven different GPT versions as outlined in the paper.
  • GPT-2 was found to lack a clear understanding of user requests, leading to confusion in models like Ada and Babbage.
  • Progress is seen with Curie, while Davinci models are highlighted as the optimal choice due to their ability to understand requests effectively.

New Section

This part delves into the publication of prompt templates and proof of concept code for replicating prompts sent to GPT/other LLM endpoints.

Publication of Prompt Templates

  • Internal publication of prompt templates and proof of concept code aims at enabling replication by sending prompts to various language model endpoints.
  • Efforts are focused on leveraging this data for enhancing quality metrics internally, with plans for integration into production processes soon.

New Section

The discussion shifts towards evaluating GPT output translations and addressing concerns regarding circularity in model assessment.

Evaluating GPT Output

  • Evaluation involves assessing translation quality using outputs generated by the model itself, introducing a circularity challenge.
  • Concerns arise about LLM self-assessment within machine translation tasks, hinting at potential complexities ahead.

New Section

The conversation transitions to comparing state-of-the-art evaluators against human performance and automated metrics.

State-of-the-Art Evaluation Metrics

  • State-of-the-art evaluation metrics are compared against human references in WMT's shared task, aiming for high correlation with human decisions.
  • Insights from human parity papers inform discussions on large language models' translation quality assessment without implying superiority over humans.

Quality Control and Prompting in Language Models

In this section, the discussion revolves around quality control processes for language models and the impact of prompts on model output quality.

Quality Control Processes

  • Language models may transition from assessing a few thousand samples to millions for quality control.
  • The GEMBA metric flags certain samples for human confirmation, indicating a potential application area.

Impact of Prompts on Model Output

  • Discussion on prompting language models by claiming to be a machine translation system and its influence on output quality.
  • Importance of prompt design and measuring its quality highlighted through experiments with different prompt types.

Training Language Models for Machine Translation

This segment delves into training language models for machine translation, exploring methods like knowledge distillation and few-shot learning.

Training Approaches

  • Knowledge distillation approach as a current method for training language models in machine translation.
  • Crafting custom translation prompts using publicly available resources but noting the high cost involved.

Few-Shot Learning

  • Discussion on the potential of few-shot learning to enhance model performance significantly.
  • Exploring domain optimization through few-shot examples for building specialized translation engines.

Custom Translator and Continuous Model Updating

This part focuses on Custom Translator tools, user profiles utilizing them, and the process of continuous model updating.

Custom Translator Tools

  • Description of Custom Translator tool allowing users to fine-tune translations based on specific domains or interests.
  • Benefits of fine-tuning models with customer-specific data leading to improved quality within specific domains.

Continuous Model Updating

  • Discussion on the cycle of data supply, fine-tuning custom translation models, and iterative deployment for optimal performance.

New Section

In this section, the speaker discusses how companies can access and utilize custom translation models through an API, emphasizing the importance of data collection and model training for improved translation quality.

Accessing Custom Translation Models

  • Companies can access custom translation models via an API by pointing to their specific models.
  • Data collection is crucial as companies send more input documents for human translation processes.
  • Regular data updates are essential for training updated models in custom translation.

New Section

This part focuses on the collaboration with local governments to enable new languages in Custom Translator, highlighting the continuous data collection process and model improvement cycle.

Collaboration with Local Governments

  • Local governments contribute to building custom translation models by providing translation data.
  • Continuous human translation processes help collect data for quarterly updates and annual model training.
  • Feedback from local government trainings aids in upgrading publicly available models.

New Section

The discussion shifts towards key users of Custom Translator within enterprises, focusing on engagement with localization/globalization teams and the need for domain-specific high-quality data.

Key Users of Custom Translator

  • Localization/globalization teams are key users due to their need for high-quality domain-specific translations.
  • Adding known high-quality data on top of existing suppliers enhances translation quality.
  • Engagement with select customers aims at resolving domain-specific compatibility issues for improved quality.

New Section

This segment delves into language additions in MS Translator, emphasizing efforts to scale languages aggressively while engaging with diverse language communities.

Language Additions and Community Engagement

  • MS Translator has expanded from 42 to 125 languages over ten years, with ongoing efforts to add more languages.
  • Language community engagement plays a significant role in language addition decisions and model refinement processes.
  • Collaborative efforts ensure rough language models evolve into usable experimental versions through community feedback and additional data collection.

Discussion on Language Support and Data Availability

The discussion revolves around the challenges of providing language support for smaller communities due to limited data availability.

Prioritizing Language Support

  • Challenges in supporting languages with small speaker populations like Canadian-French, English, and Inuktitut due to insufficient data.
  • Commitment to roll out support for African languages such as Hausa, Igbo, and Yoruba following discussions with Nigerian President.
  • Opportunistic approach in adding languages based on available data and community support like Upper Sorbian.
  • Importance of active language communities in expediting language support rollout.

Language Preservation and Industry Efforts

Focuses on the significance of preserving endangered languages and industry efforts towards language preservation.

Language Preservation Efforts

  • Concern over the decline of small languages among young people due to lack of business incentives.
  • Collective industry effort to prevent extinction of languages by digitizing them for preservation.
  • Personal anecdote about witnessing Romansh-speaking children in Switzerland, highlighting the importance of language continuity.

Translation Model Strategies and Quality Improvement

Discusses translation model strategies, quality improvement challenges, and potential future directions.

Translation Model Strategies

  • Pivot via English for language pairs to avoid a quadratic explosion of models.
  • Customer-driven approach in building direct models based on market demand and data availability.
  • Exploration of challenges in achieving incremental gains in machine translation quality over time.

Future Directions in Translation Quality Enhancement

Explores future directions for enhancing translation quality through domain-specific training data.

Translation Quality Enhancement

  • Consideration of investing efforts into creating high-quality training data for specific domains rather than focusing solely on improving existing models.
  • Shift towards exploring new domains like automotive or eCommerce for quicker quality gains compared to saturated domains like news translation.

Leveraging Learning from Less Examples

In this section, the discussion revolves around the need to find ways to leverage and learn from fewer examples in machine learning models.

Leveraging Learning Efficiency

  • There is a need to explore methods for learning efficiently from limited examples, as humans do not require billions of examples for learning.
  • The conversation anticipates advancements in learning architectures that can extract more knowledge from smaller datasets, potentially impacting large language models (LLMs) and machine translation (MT) predictions within six months to a year.
  • Expectations are set for a potential decline in hype surrounding large language models as their limitations become apparent, leading researchers to develop new measurement approaches. This shift towards realism may prompt the identification of problems these models struggle with.

Future of Machine Translation Solutions

This segment delves into predictions regarding machine translation solutions and the evolving landscape of large language models in the market.

Predictions for Machine Translation

  • Forecasts suggest that within a year, accessible large language model-enhanced MT solutions will be available commercially, contingent on cost considerations.
  • While GPT-enhanced translation services may be effective, their high costs currently limit widespread adoption. However, it is anticipated that public availability of cost-effective large language models tailored for translation tasks will increase usage.
  • Specialized model providers face the challenge of justifying higher costs or demonstrating added value compared to generic alternatives like OpenAI's offerings. Competition drives providers to enhance service quality while maintaining competitive pricing strategies.

Challenges and Opportunities in Model Integration

This part explores contrasting perspectives on data privacy and integration challenges when utilizing OpenAI's models versus developing specialized model services.

Data Privacy and Service Differentiation

  • OpenAI's approach emphasizes model development without disclosing how input data is utilized, potentially incorporating user data into training sets over time.
  • Companies integrating OpenAI's models must address enterprise-grade expectations regarding data privacy and offer additional guarantees beyond basic translation services. This differentiation strategy aims to justify premium pricing or attract customers seeking enhanced service features.
Video description

In this week’s SlatorPod, we are joined by Christian Federmann, Principal Research Manager at Microsoft, where he works on machine translation (MT) evaluation and language expansion. Christian recounts his journey from working at the German Research Center for Artificial Intelligence under the guidance of AI pioneer Hans Uszkoreit to joining Microsoft and building out Microsoft Translator. He shares how Microsoft Translator evolved from using statistical MT to neural MT and why they opted for the Marian framework. Christian expands on Microsoft’s push into large language models (LLMs) and how his team is now experimenting with NMT and LLM machine translation systems. He then explores how LLMs translate and the role of various prompts in the process. Christian discusses the key metrics historically and currently used to evaluate machine translation. He also unpacks the findings from a recent research paper he co-authored investigating the applicability of LLMs for automated assessment of translation quality. Christian describes how Microsoft’s custom translator fine-tunes and improves the user’s MT model through customer-specific data, which degrades more general domain performance. He shares Microsoft’s approach to expanding its support for languages with the recent addition of 13 African languages. Collaboration with language communities is an integral step in improving the quality of the translation models To round off, Christian believes that the hype around LLMs may hit a wall within the next six months, as people realize the limitations of what they can achieve. However, in a year or two, there will be better solutions available, including LLM-enhanced machine translation. Christian Federmann: https://www.microsoft.com/en-us/research/people/chrife/ Chapter Markers: 00:00:00 Intro and Agenda 00:01:03 Background and Career Milestones 00:05:39 Microsoft Translator in a Nutshell 00:09:23 GPT Use at Microsoft 00:12:29 GPT Technical Issues 00:16:06 How do LLMs translate? 00:20:03 LLMs and Machine Translation 00:25:15 Translation Quality of LLMs 00:38:49 Prompt Engineering 00:40:51 Training LLMs to Machine Translate 00:43:33 Microsoft Translator User Profile 00:45:26 Data Training Cycle 00:48:01 Government Usage 00:49:45 Language Coverage 00:57:25 Pivot Languages 00:58:51 Improving Machine Translation Quality 01:02:17 LLMs and MT Predictions