March 19, 2024

English and LLMs

The British Empire was once a global hegemon, and during its rise to power its influence touched every continent. This culminated in the Pax Britannica, a period where British power was unquestioned. During this time, it was said to be “the empire on which the sun never sets”, since it was always daylight in at least some British territory.

The sun, of course, did eventually set (figuratively) on the British Empire. World War II and its aftermath accelerated its decline, with many remaining colonies gaining independence. However, the legacy of this empire remains today in the form of the culture and customs left in its former colonies. One of the most important of these is the English language.

The influence of English on the Internet (and the wider tech industry) is difficult to overstate. With its early origins coming from various organizations in the United States, it was inevitable that English became the lingua franca of the Internet. Indeed, almost all programming languages and protocol specifications are defined in English terms, so the emergent tech industry (and Silicon Valley especially) became very English-centric. This made the early Internet a (mostly) English-only phenomenon, and this effect persists today, with English remaining the most popular language for web content.

This has carried over to the era of LLMs. Because most of the content on the web is English, most of the popular training datasets used for pre-training tend to be majority English, since they were assembled (scraped) from the web. For example, the Common Crawl (and its derivations like C4) and The Pile¹ are two of the largest and most common datasets, and both are English-centric. A brief survey of other open-source datasets also shows English as being the most common.

While details surrounding the training process and training data used for SOTA LLMs from the AI Frontier Labs (e.g. OpenAI, Anthropic, Google DeepMind, etc) are increasingly opaque, there are important mentions of this “English bias”, mainly in terms of its impact on non-English performance. Some selected highlights:

In the GPT-4 Technical Report, in footnote 27, in regards to harm mitigation during the alignment phase:

Mitigations and measurements were mostly designed, built, and tested primarily in English and with a US-centric point of view. The majority of pretraining data and our alignment data is in English. While there is some evidence that safety mitigations can generalize to other languages, they have not been robustly tested for multilingual performance. This means that these mitigations are likely to produce errors, such as mistakenly classifying text as hateful when it may not be in other cultural or linguistic settings.

In the Claude 2 Model Card, they mention that only ~10% of the training data was non-English:

Training Data
Claude models are trained on a proprietary mix of publicly available information from the Internet, datasets that we license from third party businesses, and data that our users affirmatively share or that crowd workers provide. Some of the human feedback data used to finetune Claude was made public [12] alongside our RLHF [2] and red-teaming [4] research.
Claude 2’s training data cuts off in early 2023, and roughly 10 percent of the data included was non-English.

Claude 3’s Model Card makes no mention of the language distribution in its training data, but has an entire section devoted to multilingual capabilities of the model, indicating that substantial effort was made to combat this “English bias”, and its multilingual MMLU performance (section: Multilingual MMLU) appears to be close to that of GPT-4. ² In sum, the term multilingual appears 21 times in the Claude 3 Model Card (excluding references), and the abstract mentions “improved fluency in non-English languages” indicating this an area of concern for Anthropic. But they still hedge with this statement under Areas for Improvement:

Claude models possess multilingual reasoning capabilities, but their performance is less robust when it comes to low-resource languages.

Whatever the multilingual capabilities of these LLMs, how we interact with them is still currently majority-English. Most prompt engineering guides are written for English, including the “official” ones from Anthropic and OpenAI, meaning that proficiency in English is still important for getting most out of LLMs currently.

Will LLMs influence natural language and the way we communicate?

While this discrepancy in multilingual capabilities (and the efforts going into correcting it³) are an interesting topic, it’s also interesting to think about what impact LLMs will have on English and the way we communicate, as they become more prevalent. (Despite what you may feel, it does not yet appear that LLMs have gained general traction in society yet)

We typically think of influence as a one-way street when it comes to ML models. That is, the training data directly influences how the model works. But when that model is widely deployed, it can in turn affect behavior of the people (or systems) that interact with it, creating a feedback loop. A typical example are the recommender systems that influence most of what you see on the modern web and can thus affect people’s future behavior. So, it’s interesting to think about how LLMs, while being artifacts built from natural language, may end up influencing that natural language. I can think of at least two ways:

Inputs to an LLM: Prompt Engineering guides have more or less coalesced around a certain set of principles. For example, the OpenAI and Anthropic prompt engineering guides are remarkably similar (be clear in your instructions, etc.) and as more people adopt these strategies, it’s likely that other LLMs will be aligned so that they function well under the same prompting, to ensure “interoperability” with these models. In some sense, the advice given by these prompt engineering guides can be seen as a distinct style of English: A sort of instructive prose that has a certain set of rules but lacks the strictness of a formal language.⁴ In order to get the most out of LLMs currently, you need know English, and you need to know the style of written communication that works well for the LLM.
Outputs from an LLM: LLMs tend to have a default “style” if you don’t specify a role or persona. This tends to result in an output writing style that reflects how the LLMs were aligned to be helpful, clear, and concise. (Though it may be part of the LLM’s system prompt as well). This can result in certain idiosyncrasies in the model’s output, such as ChatGPT’s use of the phrase "Certainly, here is …" As these outputs become more common, this style may influence how people write, especially those who are new English speakers.

At this point, this is all just speculation and while the above claims may seem far-fetched, I think they’re plausible. Just as English influenced the Internet and the web, usage of these ended up influencing English and communication as well. The most obvious of these is Internet Slang, mostly acronyms or various short forms, but things like Internet memes can also be seen as a change in the way we communicate with one another.

This, of course, depends on how we end up interacting with LLMs, or how they are deployed. The typical “chatbot” or “assistant” experience seems obvious and here to stay, but whether it will become widepsread remains to be seen. Additionally, it’s likely that the details behind “prompt engineering” will be abstracted away for the most common use cases. (e.g. how ChatGPT generates the detailed prompt for DALL·E 3 so that you don’t have to write it yourself)

Other use cases for LLMs may not even expose the prompt to the end user. In these cases, the input to the model is essentially hidden and an implementation detail, no different than the wire format of a network protocol. In these cases, the ability of the LLM to influence natural language would be limited to just this narrow technical domain.

The Pile was assembled by EleutherAI. One of its components (albeit the smallest), is the Enron Emails Corpus, some 600K emails taken from Enron during the discovery process when they were under investigation. (The company infamously collapsed in 2001 after perpetuating numerous frauds) These emails were released into the public domain by the US Federal Government. ↩︎
Though the GPT-4 Technical Report also has a benchmark on MMLU across different languages, I don’t believe the results are directly comparable for at least two reasons. First, the GPT-4 benchmark uses 3-shot, and Claude 3 appears to use 5-shot. Secondly, the questions in MMLU are in English, so they have to be translated into different languages for a multilingual benchmark. The GPT-4 Technical Report says they used Azure Translate (Appendix F) to avoid using GPT-4 itself for the translations. In contrast, the Claude 3 Model Card references this paper for its multilingual MMLU benchmark. That paper mentions (in Section 2.4 Evaluation Data Creation) that MMLU was translated into different languages using ChatGPT! ↩︎
For more about the “English bias” in LLMs and its potential impacts, see this Wired article. ↩︎
You can, of course, interact with an LLM using a formal language like a programming language, but you’re not required to. Additionally, many LLMs support using things like pseudo-XML to structure and delimit your input. ↩︎