Overview of the Challenges in Multilingual Model Training
Nikushko introduces the topic of multilingual language models, highlighting the increasing emergence of these models.
He notes that pre-training a multilingual model is costly and that fine-tuning monolingual models for multilingual tasks often yields subpar quality.
A referenced paper discusses how transformer embeddings undergo three distinct stages during training.
Stages of Transformer Embeddings
The first stage involves disjointed embeddings by language, leading to a lack of integration among different languages.
In the second stage, a multilingual manifold forms; however, underrepresented languages remain clustered outside this manifold.
The third stage sees clusters forming based on shared tokens across languages.
Proposed Solutions to Enhance Multilingual Models
Innovations Introduced by Nikushko
Nikushko's approach includes two key innovations: adding contrastive flows and implementing a learnable skip connection.
Contrastive flows aim to integrate unrepresented languages into the main manifold effectively.
The learnable skip connection retains essential information about present languages from earlier layers and pushes it to final layers for better mapping.
Evaluation of the Proposed Approach
The effectiveness of this method was evaluated using the Flores 200 dataset for translation tasks across three directions: English to German, Turkish, and Chinese.
Video description
Introduction
Recently, multilingual Decoder Transformer models like XGLM[1], Aya-27[2], and Vikhr[3] started to emerge. Training such models can be classified into three methods: pretraining on a diverse multilingual dataset, applying multilingual instruction tuning and optimizing the tokenizer for the target language. Unfortunately, all of these methods are suboptimal, since balancing languages in pretraining datasets is expensive, multilingual SFT hurts performance, and tokenizer optimization does not yield substantial benefits. This calls for additional research into multilingual LLM generalization.
Additionally, in the paper “Do Llamas Work in English”[4], it is shown that latent embeddings of Decoder Transformer Language Models undergo a series of evolution steps during the forward pass. In the first third of the model, the embeddings of sentences in different languages are not grouped by language. In the middle of the model, they evolve into a “multilingual manifold”, where all languages are being pushed closer to the language which was the most prevalent in the pretraining dataset. During the final layers, the embeddings contract into clusters based on how many tokens are shared across languages. Additional experiments show that embeddings of sentences in languages, which were underrepresented in the pretraining dataset are forming clusters outside of the main multilingual manifold and thus, not being mapped correctly.
These experiments suggest a way to increase the multilingual performance of decoder transformers: by introducing an auxiliary contrastive loss in the first part of the language model, we can mitigate the problem of incorrect embedding mapping. Additionally, to better retain the information about the source language, we propose adding a learnable skip connection, which passes the embeddings through a linear layer with an activation function to filter only the information needed for reconstruction, and then adds it to the residual stream near the start of the layers that contribute to the third stage of the embedding evolution process.
Methods
For experiments, we selected the XGLM-567M model, which was pretrained on a large and diverse multilingual dataset. For the training task, we used translation, which is both easy to apply the contrastive loss to, fast to train, and easy to evaluate. For the training dataset, we used the training subset of the Flores-200[5] dataset. For contrastive loss we used InfoNCE loss, which pulls together the embeddings of the anchor and positive examples and considers all other examples in the batch as negative examples. In our case, positive examples are direct translations of the texts, which share the same lexical meaning but are in different languages, while negatives are all other examples in the batch. The layers for applying contrastive loss and skip connections were selected using grid search and evaluated with the BLEU[6] metric on the test dataset. The main experiments were evaluated on the En-De translation direction using BLEU, XComet[7], ChrF1, and ROUGE metrics.
After selecting the best layer for applying contrastive loss and selecting the best layers for skip connection, we ablated the training process to explore how each of the additions to the training recipe and architecture of the model change the final translation metrics. First, we tried removing the skip connection and training a model using only the auxiliary contrastive loss. Second, we directly trained the model to translate to the target language, without using additional losses and architecture modifications. Third, we added translation to languages from two additional language groups to the translation directions: En-Zh and En-Tr.
Results and Discussion
The final scores showed promising results, performing quantifiably better than the baselines for the En-De translation direction, worse for En-Tr, and on par with the baselines for the En-Zh direction. We attribute this to the fact that we used a very small training dataset for translation and that, due to the limited amount of Turkish data in pretraining, the contrastive loss did not fully converge by the end of training.
The resulting paper[8] was accepted to the NAACL 2025 conference and was presented in the SRW poster session in Albuquerque, New Mexico in April 2025.
Sushko SkipCLM Abstract Video | YouTube Video Summary | Video Highlight