Distillations/Constellations #4: language justice and LLMs
At the weekend, I made my first appearance on German public radio – I was part of a panel discussion on Deutschlandfunk Kultur, talking about the challenges that arise online from Big Tech, to how myopic and Western-centric many digital rights debates are.
There are few things I find more humbling than trying to explain a complex topic in a language where – though I feel pretty comfortable – I sometimes lack the precise words I need to say what I want. And sometimes I'm left using a broad catch all word instead of that specific word that would convey my meaning very exactly, and speaking of course with a British accent, and – often depending on how people react to me – I'm reminded how much value we assign to being 'articulate', what that means, and how unfair that equation can be. As novelist Angie Kim writes: "we equate verbal fluency with intelligence" and that, she accurately identifies, is a deep-seated prejudice so many of us carry.
It's just one example of how injustice and discrimination can be perpetuated through language and communication. In 2022, I conducted a quick piece of research for Numun Fund, looking into what meaningful language justice would look like in the context of their work. That piece of work didn't end up finding a public home (yet? I'm open to ideas for where to publish!) – but I wanted to share some thoughts from it.
First: what do we mean by language justice? Here's what I wrote back then.
For many, the words ‘language justice’ conjure up images of simultaneous interpretation happening at events, or materials translated into multiple languages. But in practice, it is much, much more than this – it is an ongoing practice of acknowledging the social and political impacts of language choices and formats, and working to make these as just as possible. English supremacy, as perpetuated through colonialism, has had layered impacts throughout societies in different ways, from affecting how people value communications to how we even describe the problems that arise from this issue.
Language justice in the context of technology is particularly interesting in this moment, given the flurry of attention given to Large Language Models (LLMs), the best known of which is ChatGPT. But from what I could find, there's relatively little out there looking into LLMs in languages other than English. The main exception is this report written by Gabriel Nicholas and Aliya Bhatia for the Center for Democracy and Technology, called Lost in Translation: Large Language Models in Non-English Content Analysis.
One expected finding from that report: "Languages with the worst quality web data are disproportionately those written in non-Latin scripts (e.g. Urdu, Japanese, Arabic) and those spoken in the Global South (e.g. African languages, minority languages in the Middle East, non-Mandarin Chinese languages)." And because of this, we have yet another way that technological approaches will exacerbate existing inequalities – because LLMs need a large amount of text to be trained on to be 'helpful', so if there's not enough text to be trained on, the LLM effectively can't be trained or used.
But there are, it seems, some ways of mitigating this issue, one of which is multilingual language models. As Nicholas and Bhatia write: "Instead of being trained on text from only one language, multilingual language models are trained on text from dozens or hundreds of languages at once."
And thanks to the various ways that languages are related – "shared vocabulary, genetic relatedness or contact relatedness" – it's hoped that this means that languages that don't have much text data (called "low resource languages") can benefit from the amounts of information out there in "high resource languages" as Doddapaneni and Ramesh explain in this Primer on Pretrained MLMs.
But of course, it's not that simple. Multilingual Language Models are already being used by Big Tech for content analysis and content moderation on social media content, but as we've seen over and over again, context is everything – LLMs work well in contexts they've encountered before, and badly in ones they haven't. But language changes fast – new insults, new forms of hate speech, new slang – which they will struggle to detect. And what about language-specific forms of hate speech? Here in Germany, for example, 88 is a white supremacist numerical code for 'Heil Hitler' – H being the 8th letter of the alphabet. If we imagine that this symbol was only present in a low resourced language, there's little chance that a Multilingual Language Model would come across it enough to detect it as a form of hate speech online.
And then we have all the other issues of automated content analysis, which many other people have written about before, such as bias and discrimination encoded into text that LLMs are learning from, or the environmental and financial harms, as outlined in this somewhat seminal paper, On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? by Bender, Gebru et al.
But back to the language justice issue behind all of this. Nicholas and Bhatia outline clearly many of the problems and concerns with Multilingual Language Models. It seems like there's a lot that's not known about the performance of Multilingual language models, yet they're already being deployed in high-stakes situations. We've seen this go wrong before – for years, hate speech and violence against the Rohingya was actively amplified on Facebook (flagged over and over again by civil society in Myanmar), which Amnesty International says "substantially contributed to the atrocities perpetrated by the Myanmar military against the Rohingya people in 2017." - and yet, in 2018 it was reported that Facebook still had zero employees in the country of 50 million people.
Here's what I'd like to see more of: more funding going to groups like Rising Voices, who are doing incredible work supporting and bringing together language activists who speak those 'low resourced' languages we talked about earlier. Research collaborations led by institutions based in the Larger World, to bring together people who speak medium to low resourced languages to figure out what an intersectional language justice approach looks like in the context of today's LLM-focused world. More linguists and translators involved in these discussions, and a broader acknowledgement of the art of translation, and how automated translation cannot replace that.
Here's some more interesting takes and articles on the topic:
- A couple of years ago, I read and loved this book: The Fall of Language in the Age of English, by Minae Muzumura, translated by Mari Yoshihara and Juliet Winters Carpenter. It's a thoughtful analysis of the consequences of the rise of English as the world’s ‘universal language’, and the consequences for other major languages, including Japanese. It provides a good reminder that some things simply cannot be translated.
- Teaching New Worlds/New Words, by bell hooks. She writes of the ways in which English is the “oppressor’s language”, describing how “the very sound of English had to terrify…How to describe what it must have been like for Africans whose deepest bonds were historically forged in the place of shared speech to be transported abruptly to a world where the very sound of one’s mother tongue had no meaning.”
- Colonialingualism: colonial legacies, imperial mindsets, and inequitable practices in English language education by Scottish Gaelic language activist Paul Meighan-Chiblow. He writes "Colonial languages carry colonial legacies and can perpetuate an imperialistic and neoliberal worldview. Languages can be disembodied from place and commodified as mere “resources”, important only for economic “value” rather than cultural importance, in a “modern” global, neoliberal empire.”
- The AI project pushing local languages to replace French in Mali’s schools in Rest Of World.