AI and Copyright
For the past couple months, one of the things I have been working on is how to prevent Pi, Inflection’s personal AI assistant product, from reproducing copyrighted material. The most immediate reason why this is important is that several of our competitors are being sued, and we would like to 1) if possible not be sued, and 2) if we are sued, to be able to say that we are working very hard on not reproducing copyrighted material. The more fundamental reason, as argued in these lawsuits, is that creators put their own time and effort into creating copyrighted works: music, art, books, essays, articles. The New York Times, for example, is the plaintiff in a suit against OpenAI; they point out that OpenAI’s training data includes millions of articles from the Times, and claim that occasionally their large language models will “regurgitate,” or reproduce verbatim, Times material, without citation or reference. Thus, the theory goes, a would-be New York Times subscriber could instead ask GPT-4 for the top news stories of the day, receive some amount of professional journalism for free (or at least, by paying OpenAI rather than the New York Times), and have no further need for the paper of record. So, the Times is calling for “billions of dollars in statutory and actual damages,” and the destruction of any training data with copyrighted Times works, as well as any models trained on that data (read: all of them).
It is worth separating the complaint into two parts, which are matched in more recent suits with additional plaintiffs and defendants, and considering each individually. The first part concerns whether or not the use of such material as news articles in training data, including perhaps storing reproductions, for large language models is itself a copyright infringement. When I first considered this question, as a reasonably informed non-lawyer, it struck me as similar to the Google Books case brought before the Second Circuit in 2015. In order to index as much of the literature produced by humanity as possible, Google scanned hundreds of thousands of books without permission from copyright owners. When said owners protested, Google argued, successfully, that it was a transformative application (by providing a search function through previosuly-unsearchable texts) and that they never produced more than a matching snippet. It helped people find relevant books but would not substitute for reading those books. Therefore, I thought, training machine learning models on copyrighted articles is almost certainly fair use; regrettably content producers don’t have much leverage, because any given provider, even behemoths like the Times or the Associated Press, make up such a small portion of the overall dataset; considering that people already excerpt these pieces all over the internet, it’s probably not possible to filter out anyway; and so on. Unfortunate, but a fundamental characteristic of the technology, which of course has many, many more significant uses than trying to get around a publication’s paywall. And I still think that when push comes to shove, AI companies are well-positioned to win on this front, although the increasing number of licensing deals seemingly indicates that companies are getting more and more nervous about it.
The second part of the complaint is the reproduction or regurgitation of copyrighted text, which on its face seems like it certainly is a copyright violation, almost a textbook example. And so an interesting question here is how AI companies can prevent it. Broadly speaking, the interventions take place either at the model level, or in post-processing. At the model level, you might train the model to identify queries for copyrighted material and refuse to provide it. This is achievable through a variety of strategies used for safety more generally: supervised fine-tuning, reinforcement learning. In practice, you might have the model conduct conversations with people who ask it for quotations from song lyrics, books, papers, or any other form of media, and label whether the responses complied with some specific copyright policy (never quoting more than a sentence, for example, or always including a citation). The main drawback of this approach is that it can lead to refusals that make the model less helpful to users. The model cannot readily learn what text is copyrighted and what text isn’t, so if it learns not to quote documents directly, it very well might refuse to produce the lyrics to the Star-Spangled Banner, or Bible verses. For the most part, an LLM would not have access to the publication year of the documents included in its training data, so even the most basic of heuristics such as “Is it more than 100 years old,” would be impossible. So most people seemingly do not do this, because of the inevitable side effects.
The other option, postprocessing, is nice because it does not impact the performance of the model, but has its own challenges. For an LLM developer, this might entail constructing new engineering systems to enable overwriting the model’s original response or generation if it is found to contain a copyright violation. Finding a copyright violation, of course, is a nontrivial question — you can train classifiers to identify them, but then you run into the same problem about identifying copyrighted material without any context for the legal question. Otherwise, you can index the data that is used to train the LLM — specifically the pre-training data, or the vast swathes of text pulled from the internet — and then look for any too-close matches between what the model generates, and that data. “Too close” is another nontrivial implementation detail: how do you make sure that you don’t flag every common phrase or idiom as copyrighted? How do you know how long the match has to be to count? In my experience, these are parameters on match length or frequency that are best determined empirically.
Such a system can prevent exact reproductions, but not plagiarism. An adjacent question is what Timothy B. Lee and James Grimmelmann have termed the “Italian plumber problem.” They prompted the text-to-image model Stable Diffusion to create an image of an Italian plumber, and the results were not merely Mario-inspired, but very clearly Mario, in his red top and denim overalls, complete with a cap marked “M.” I like to do a similar exercise with our language models when testing them: I ask for a story about a boy wizard who goes to wizarding school, and count how many times the model happens to name the boy Harry. I think the Italian plumber problem will be very, very hard to fix. In effect, to do so is to instruct the model that in some cases, it should ignore its training data, and that can be extremely risky in its own right. It actually reminds me of the more recent dustup over Google Gemini’s image generation models.
In the unlikely event that anyone reading this missed that story, what happened was that people began to notice that when you requested Gemini to generate images of people, it was hard — quite hard, actually — to get it to generate images of white people. This resulted in extremely unfortunate depictions of, for example: “1944 Nazi soldiers” that included Black and Asian men and women, “founding fathers of the United States” that resembled the colorblind cast of Hamilton, and so on. Now, obviously Google did not do this because of copyright risks; they did it because for years, AI ethicists, feminist theorists, and others have noted that these systems reproduce and therefore perpetuate biases. They instructed the model that, when faced with a request of images of people, it should provide a diverse group of individuals. This successfully avoids the very-common case of generating a host of white men when asked for images of “doctors” or “computer programmers,” but it is still clearly a model failure. The model cannot distinguish between a general request, where it might safely deviate from the most prevalent images in its training data, and a specific request, where the user actually wants one specific thing from the model’s data.
For plagiarism, it’s easier from a policy standpoint, because you can also refuse the specific requests, but it’s still a deviation from what LLMs are designed to do, which is generate the most likely tokens. It’s like telling the model, “Give me a story about a boy wizard who goes to wizarding school, but not the one that you’ve seen hundreds of copies of.” But reproducing concepts from its data is how the model generates all of its responses, so any attempt to avoid certain stories or syntaxes might result in less coherent responses overall. Based on external observations only, it seems like most AI developers use postprocessing to prevent outright copyright violations, but don’t put forth a ton of effort in preventing plagiarism, likely because it is both trickier to define from a legal or policy perspective, but also because it’s very technically difficult.
More of the existing lawsuits focus on direct reproduction, and the premise that underlies the lawsuits in the vein of the Times’ suit against OpenAI is that AI will do only further damage to the already decimated media industry. In my non-work life, I have served on the board of the Austin Monitor, a local nonprofit newsroom, for the past three or so years. Local media has been especially hollowed out, in my view not by social media itself — which I would argue primarily serves a different function — but by online advertising, dominated almost exclusively by Facebook and Google. Tech journalists Kevin Roose and Casey Newton recently interviewed the founder and CEO of the buzzy AI startup Perplexity, Aravind Srinivas, on an episode of their podcast Hard Fork. Perplexity positions itself as an alternative to web search. If you type in a query like “best tote bags for carrying laptops,” or “did Texas win their basketball game last night” or “ways to kill vampires,” the idea is that instead of getting a page full of hopefully relevant links for you to click through and skim, the model will retrieve those results for you and generate, based on the content, a paragraph-style response to your question. The overlap with copyright is that, to take the first example, if The Strategist recommends a tote bag based on testing and research, Perplexity might regurgitate the bag and their journalistic reasoning verbatim, thus robbing them of the clicks they would have otherwise received, had the article not been copied.
In his interview, Srinivas argued that Perplexity would not actually take readers away from journalism, because the answers produced by Perplexity included citations and links. He compared it to Google Search again and suggested that it would help readers find relevant articles to the questions they wanted to answer. Roose and Newton, fairly, were skeptical. Newton put it succinctly when he asked Srinivas whether it wasn’t an explicit goal of Perplexity to answer the user’s questions without them needing to rely on outside sources. In other words, if the model does its job, the user shouldn’t need to click the links — the key points and takeaways that some poor journalist spent time on (or an SEO specialist or marketer, or another AI, but you get the idea) are served up for free. Maybe some percentage of people have their curiosity piqued and decide to read more, but it certainly doesn’t seem that it would result in more readers, or even as many. In a world increasingly dominated by decontextualized snippets and soundbites, it’s another shortcut, but this one cuts out the content provider entirely. And that is an issue practically begging for legal intervention.
In that context, the licensing deals struck by OpenAI make a lot of sense. You can virtually guarantee high-quality data from these publishers (especially compared against everything else on the open web), and models could preference results from those sources, leading to a potentially symbiotic relationship wherein the model does summarize or even quote the text, cites and links to it, and the model developer pays the publisher for its use.