The Problem: Large Language Models (LLMs) perform well in data-rich environments but struggle in areas where training data is scarce; this issue is even worse in non-English languages, and especially native African languages (due to limited digitized text). The linguistic disparity of LLMs limits their practical impact in high-growth, underserved populations. This study demonstrates how we successfully improved the performance of LLMs in African languages by developing new datasets and approaches for re-training AI systems. While our focus was on African languages, the methods and insights gained have broader implications for enhancing LLM performance in other languages and domains with limited data.
Our Approach: We developed a two-step process to bridge the performance gap between English and African languages.
The Impact: Our work addresses the lack of AI tools for African languages by creating over 1 million human-translated words in eleven underserved languages spoken by more than 230 million people. By incorporating high-quality, culturally relevant data, we improved the performance of best-in-class LLMs. While our focus was on African languages, our approach offers a scalable solution for improving AI performance in any domain with limited data. The translated benchmarks, datasets, and retraining tools are publicly available, providing resources to help others build more inclusive and effective AI across diverse languages and low-data environments.
Large Language Models (LLMs) are approaching human-level performance on many tasks, with each new generation narrowing the gap. However, LLM capabilities drop significantly in non-English languages, and especially African languages, where limited data and benchmarks make it difficult to assess (and later improve) performance. The gap between LLM performance in English and African languages reflects a broader issue of inequality. Despite being spoken by millions, African languages lack the data and tools needed to train AI effectively. This leaves LLMs least useful for communities that stand to benefit the most: 66% of the world’s poor live in Africa, of which 80% don’t speak English (so they couldn’t use an English LLM, even if they wanted to). As AI continues to improve for English speakers, vulnerable non-English speakers are left further behind, reinforcing existing inequalities and limiting access to technology-driven opportunities.
We started by creating tools to measure the performance gap between English and African language LLMs on a variety of tasks; we used what we learned to retrain existing LLMs, and close the performance gap.
1. A new way to assess LLMs in African languages: To address the underperformance of LLMs in African languages, we first identified the types of content and questions where the performance gap (compared to English) was greatest. Specifically, we generated the first-ever translation of widely-used AI benchmarks designed to assess both the general reasoning (WinoGrande), and expert knowledge (MMLU sections on clinical knowledge, college medicine, and virology) of AI systems in eleven widely spoken African languages: Afrikaans, Xhosa, Zulu, Amharic, Bambara, Igbo, Sepedi, Shona, Sesotho, Setswana, and Tsonga (see Fig. 1). The translation process involved multiple layers of validation by native speakers, professional translators, and domain experts to ensure accuracy, cultural relevance, and the preservation of technical meaning.
2. Measuring the LLM performance gap between English and African languages:
We used our translated benchmarks to compare how well the world’s best LLMs answer the same question when changing only the language the question was asked in (e.g. from English to Zulu). LLMs tested included private models (GPT-3.5, GPT-4, GPT-4o), open-source models (Llama 3 70B and 8B), and even models touted for their multilingual prowess (Aya 23, Aya 101, BLOOMZ 7b1); a selection of LLMs from each category is shown in Fig. 2. Comparing results across multiple models, languages, and tasks revealed a sizable LLM performance gap on African languages. Even the best-in-class LLM (GPT-4o) had a sizeable performance gap compared to English.
3. Closing the Gap: After analyzing how existing LLMs performed across a variety of tasks and languages, we performed experimental training (of over 400 new LLM models) to methodically identify how LLMs can be improved when data is limited. Our experiments revealed a novel way to train LLMs in data-scarce environments generally, and for African languages specifically: combine a small, high-quality, culturally relevant dataset from a target African language with a much larger dataset (even if it’s lower quality) from its linguistic relatives (i.e. “cross-lingual instruction”).
By emphasizing data quality, our LLM retraining process led to performance improvements across all 11 African languages.
Our LLM retraining process led to significant performance improvements (see Fig 3.), with 100 high quality samples or less; this demonstrates the impact of re-training LLMs using a strategic combination of high-quality, culturally relevant data supplemented by larger, less relevant data assets. The benchmarks, translations, and code required to reproduce the findings of this study are publicly available, contributing to continuous advancements in LLM performance for African languages and other low-resource scenarios.
Copyright © Ghamut Corporation