Improving LLM Performance in African Languages

January 5th, 2025Read AAAI 2025 Technical Report Explore Code Repository Access Dataset

Summary

The Problem: Large Language Models (LLMs) perform well in data-rich environments but struggle in areas where training data is scarce; this issue is even worse in non-English languages, and especially native African languages (due to limited digitized text). The linguistic disparity of LLMs limits their practical impact in high-growth, underserved populations. This study demonstrates how we successfully improved the performance of LLMs in African languages by developing new datasets and approaches for re-training AI systems. While our focus was on African languages, the methods and insights gained have broader implications for enhancing LLM performance in other languages and domains with limited data.

Our Approach: We developed a two-step process to bridge the performance gap between English and African languages.

Measuring the Performance Gap: We translated gold-standard reasoning and knowledge benchmarks into eleven African languages that are widely spoken, but have limited AI training resources. We used the translated benchmarks to measure how well frontier LLMs perform in these languages, and identify targeted strategies to improve them.

Managing the Performance Gap: We developed a method to retrain LLMs that improves their accuracy in limited data contexts generally, and for African languages specifically. Our approach combines small, high-quality datasets from a target African language with larger, lower-quality datasets from related languages, enabling LLMs to extract broad knowledge from the larger data while developing domain-specific reasoning from the higher-quality sources.

The Impact: Our work addresses the lack of AI tools for African languages by creating over 1 million human-translated words in eleven underserved languages spoken by more than 230 million people. By incorporating high-quality, culturally relevant data, we improved the performance of best-in-class LLMs. While our focus was on African languages, our approach offers a scalable solution for improving AI performance in any domain with limited data. The translated benchmarks, datasets, and retraining tools are publicly available, providing resources to help others build more inclusive and effective AI across diverse languages and low-data environments.

The Problem

AI models perform well in English but struggle significantly in African languages, limiting their usefulness for millions of disadvantaged people who stand to benefit from AI the most.

Large Language Models (LLMs) are approaching human-level performance on many tasks, with each new generation narrowing the gap. However, LLM capabilities drop significantly in non-English languages, and especially African languages, where limited data and benchmarks make it difficult to assess (and later improve) performance. The gap between LLM performance in English and African languages reflects a broader issue of inequality. Despite being spoken by millions, African languages lack the data and tools needed to train AI effectively. This leaves LLMs least useful for communities that stand to benefit the most: 66% of the world’s poor live in Africa, of which 80% don’t speak English (so they couldn’t use an English LLM, even if they wanted to). As AI continues to improve for English speakers, vulnerable non-English speakers are left further behind, reinforcing existing inequalities and limiting access to technology-driven opportunities.

Our Approach

We started by creating tools to measure the performance gap between English and African language LLMs on a variety of tasks; we used what we learned to retrain existing LLMs, and close the performance gap.

1. A new way to assess LLMs in African languages: To address the underperformance of LLMs in African languages, we first identified the types of content and questions where the performance gap (compared to English) was greatest. Specifically, we generated the first-ever translation of widely-used AI benchmarks designed to assess both the general reasoning (WinoGrande), and expert knowledge (MMLU sections on clinical knowledge, college medicine, and virology) of AI systems in eleven widely spoken African languages: Afrikaans, Xhosa, Zulu, Amharic, Bambara, Igbo, Sepedi, Shona, Sesotho, Setswana, and Tsonga (see Fig. 1). The translation process involved multiple layers of validation by native speakers, professional translators, and domain experts to ensure accuracy, cultural relevance, and the preservation of technical meaning.

Fig. 1: Countries where our 11 selected African languages are spoken. Hover over a highlighted country to see the primary languages spoken and estimated number of speakers.

2. Measuring the LLM performance gap between English and African languages: We used our translated benchmarks to compare how well the world’s best LLMs answer the same question when changing only the language the question was asked in (e.g. from English to Zulu). LLMs tested included private models (GPT-3.5, GPT-4, GPT-4o), open-source models (Llama 3 70B and 8B), and even models touted for their multilingual prowess (Aya 23, Aya 101, BLOOMZ 7b1); a selection of LLMs from each category is shown in Fig. 2. Comparing results across multiple models, languages, and tasks revealed a sizable LLM performance gap on African languages. Even the best-in-class LLM (GPT-4o) had a sizeable performance gap compared to English.

Fig. 2: Performance (%) of gold-standard LLMs in African languages (and English) on our translated benchmarks. Click each benchmark below the plot to see performance for that benchmark, and hover over the bars to see exact values. Languages are identified as follows: English (en), Afrikaans (af), Zulu (zu), Xhosa (xh), Amharic (am), Bambara (bm), Igbo (ig), Sepedi (nso), Shona (sn), Sesotho (st), Setswana (tn), Tsonga (ts).

3. Closing the Gap: After analyzing how existing LLMs performed across a variety of tasks and languages, we performed experimental training (of over 400 new LLM models) to methodically identify how LLMs can be improved when data is limited. Our experiments revealed a novel way to train LLMs in data-scarce environments generally, and for African languages specifically: combine a small, high-quality, culturally relevant dataset from a target African language with a much larger dataset (even if it’s lower quality) from its linguistic relatives (i.e. “cross-lingual instruction”).

The Impact:

By emphasizing data quality, our LLM retraining process led to performance improvements across all 11 African languages.

Our LLM retraining process led to significant performance improvements (see Fig 3.), with 100 high quality samples or less; this demonstrates the impact of re-training LLMs using a strategic combination of high-quality, culturally relevant data supplemented by larger, less relevant data assets. The benchmarks, translations, and code required to reproduce the findings of this study are publicly available, contributing to continuous advancements in LLM performance for African languages and other low-resource scenarios.

Fig. 3: Performance (%) of LLMs assessed on African language benchmarks when retrained with varying amounts of high and low quality data. 25% and 100% represent 17 and 66 samples used for LLM retraining, respectively. Click a language to see how data quality affected LLM performance in that language.