Giving a Voice to the Unheard: Multi-Language Support for Least Spoken Languages in LLMs

Author

Data Science

Sign up for our newsletter

We care about the protection of your data. Read our Privacy Policy.

Enhancing Global Communication with Multi-Language Support in Language Models

Large language models (LLMs) are making waves in the field of artificial intelligence, with their ability to process and generate human-like text. We at ProCogia believe that LLMs can also play a significant role in breaking down language barriers and transforming communication on a global scale if they are played right.

Closed- source LLMs like GPT3.5 are already doing a good job on machine translation tasks [1]. The story is not the same with current open-source models. These are predominantly trained with English corpus and, mostly, close to no multilingual data and are yet to receive a significant improvement in translation and related tasks.

However, if we assess the translation performance of both closed- and open-source models on least spoken languages such as Zulu, where there is less written material available online, both classes of models start to score lower. A possible solution is to separate the task of translating from the task of reasoning, adding a closed-source machine translation layer to facilitate the communication between the user and the model.

One common way to compare the translation ability of the proposed machine translation layer with the natural ability of LLMs to reason in different languages is the Bilingual Evaluation Understudy (BLEU) score [2]. BLEU is used to assess precision through any combination of adjacent words (or n-grams) and has been found to correlate well with human judgements of machine translation [3]. In this case, it makes sense to evaluate quadrigrams (4-grams), which should judge the ability of the model to really capture and convey the underlying message behind the text, and unigrams (1-grams) to focus on whether important information is preserved during the translation.

Based on our use case, the model was required to be able to give answers in many South African languages, therefore, the solution was tested on Xhosa, Northern Sotho, Afrikaans and Zulu. Findings are shown in tables 1 and 2.

BLEU-1
	Machine Translation Layer	GPT4-turbo	GPT3.5-turbo	Mistral 7B Instruct	Llama 2 13B Chat	Zephyr 7B Beta	Meditron 7B
Xhosa	16.5	13.4	3.8	0.2	0.5	0.2	0.2
Northern Sotho	63	42.5	18.6	0.1	7.8	2.6	1.2
Afrikaans	62.2	61.3	56.5	31.7	41.4	11	0.8
Zulu	29.1	25.2	5.1	0	0.7	0.4	0

Table 1 – BLEU-1 scores

BLEU-4
	Machine Translation Layer	GPT4-turbo	GPT3.5-turbo	Mistral 7B Instruct	Llama 2 13B Chat	Zephyr 7B Beta	Meditron 7B
Xhosa	1.8	1.2	0	0	0	0	0
Northern Sotho	35.5	12.5	0	0	0	0	0
Afrikaans	32	29.6	27.5	4.7	11.7	2.1	0
Zulu	7.6	3.7	0	0	0	0	0

Table 2 – BLEU-4 scores

The findings are in line with current literature, showing that closed-source models outperform open-source when tasked with machine translation. Even so, our machine translation layer was able to show an improvement in the task, outperforming GTP3.5 and GPT4 in all languages and in BLEU-1 and BLEU-4.

In conclusion, implementing a closed-source machine translation layer significantly improves the performance of LLMs in translation tasks for several South African languages. The cost of adding this layer is minimal, as the solution operates as a wrapper around the LLM, making the translation task seamless to the end user and adding only milliseconds of latency to the results. This approach paves the way for more inclusive AI development that empowers all languages, not just the dominant ones.

References

Yangjian W.; Gang H. (2023). Exploring Prompt Engineering with GPT Language Models for Document-Level Machine Translation: Insights and Findings
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. J. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation
“BLEU-Human Correlation is Increasing: What does this Mean?”. From the original of Jun 14, 2018. Retrieved on Apr 10, 2024.

Author

Gabriel Brock

View all posts

Subscribe to our newsletter

Stay informed with the latest insights, industry trends, and expert tips delivered straight to your inbox. Sign up for our newsletter today and never miss an update!

We care about the protection of your data. Read our Privacy Policy.

Keep reading

Dig deeper into data development by browsing our blogs…

Get in Touch

Let us leverage your data so that you can make smarter decisions. Talk to our team of data experts today or fill in this form and we’ll be in touch.

Take a deeper dive

Locate Us

Follow Us

Contact Us

Take a deeper dive

Locate Us

Follow Us

Contact Us

Giving a Voice to the Unheard: Multi-Language Support for Least Spoken Languages in LLMs

Author

Gabriel Brock

Table of Contents

Categories

Sign up for our newsletter

Enhancing Global Communication with Multi-Language Support in Language Models

References

Author

Subscribe to our newsletter

Keep reading

Get in Touch