News

Open-Source

Global

Local AI Initiatives Challenge Global Language Model Bias

Researchers in Latin America, Africa, and Southeast Asia are building culturally-aware language models to counter the English-dominated bias in mainstream AI systems.

Researchers in Latin America, Africa, and Southeast Asia are building culturally-aware language models to counter the English-dominated bias in mainstream AI systems.

Researchers in Latin America, Africa, and Southeast Asia are building culturally-aware language models to counter the English-dominated bias in mainstream AI systems.

NewDecoded

Published Nov 27, 2025

Nov 27, 2025

4 min read

Large language models are systematically biased toward English and Western cultural contexts, a problem that researchers across the global south are now tackling with locally-trained AI systems. Of the world's 7,000 languages, fewer than 5% are meaningfully represented online, and this data imbalance translates directly into AI models that fail to understand regional nuances and cultural contexts. When ChatGPT incorrectly answered questions about Latin American literature for Chilean researcher Álvaro Soto, it highlighted how models trained primarily on English data struggle with culturally specific information. Three major regional initiatives are pushing back against this imbalance. Chile's CENIA is developing Latam-GPT, training on university theses, digitized local books, and transcripts from the Colombian Congress to capture how Latin American culture and politics actually work. The team has spent two years curating data and plans to release the open-source model by January 2026, incorporating Indigenous languages like Mapudungun, Náhuatl, and Quechua. In Southeast Asia, AI Singapore's SEA-LION model now processes 30,000 to 50,000 monthly requests after increasing Southeast Asian language content from just 0.5% to 40% of its training data. The approach has proven both effective and efficient, achieving a 10 to 100 times cost reduction compared to training from scratch. Meanwhile, Africa's Masakhane initiative has united 1,000 participants from 30 countries, recently completing 9,000 hours of transcribed conversations in 18 African languages through the $2.2 million Gates Foundation-backed African Next Voices project.

Testing reveals the scope of the problem. Cultural benchmark tests at CENIA show that while mainstream models easily identify Buenos Aires, they fail to recognize common regional foods or important local figures. Berkeley researchers found that GPT models routinely default to American spelling and produce condescending responses to non-standard English dialects, reflecting what linguists call "standard language ideology."

The grassroots nature of these projects distinguishes them from corporate localization efforts. "We're not competing with the global models," says Soto. "Our goal is to build a tool from and for Latin America." Computational linguist Mpho Primus frames the stakes clearly: "The global push to develop AI is no longer just about computing power or algorithmic breakthroughs. It's about who gets to speak and who gets left out in the digital future."

However, challenges remain. Even with localized training data, these models often build on architectures developed in the United States, potentially carrying embedded biases. Researchers also warn that dominant languages will continue shaping AI systems regardless of regional data collection. The technical challenge of "forgetting" means models can lose capabilities like coding when exposed to new cultural data, requiring careful balancing of old and new information during training.


Decoded Take

Decoded Take

Decoded Take

These regional AI initiatives signal a fundamental shift from passive consumption to active co-creation in the global AI landscape. The emergence of well-funded, technically sophisticated projects across Latin America, Africa, and Southeast Asia challenges the assumption that meaningful AI development requires Silicon Valley resources.

More importantly, these efforts expose how current AI systems encode not just language preferences but entire value systems and power structures. As AI increasingly mediates information access globally, from education to healthcare to governance, the question of whose knowledge gets preserved in these "repositories of cultural memory" becomes existential.

The success of models like SEA-LION, which outperform larger global models on regional tasks despite smaller scale, suggests that localized data quality beats generalized data quantity. This could force big tech companies to either genuinely partner with regional initiatives or watch their products become irrelevant in major markets. The January 2026 release of Latam-GPT will be a critical test of whether open-source, community-driven AI can provide a viable alternative to corporate models for the billions of people underserved by current systems.

Share this article

Related Articles

Related Articles

Related Articles