On training and copyright

The essay in which I set out the Eleven Freedoms of free AI describes the freedoms that should exist, from a moral perspective. Those are not necessarily the same as which freedoms are actually protected by present-day law in some specific jurisdiction. In particular, the Seventh Freedom, to run the program on any data you have and even to train models on copyrighted material without permission, might seem to go against present-day copyright law. Let's consider whether it really does.

In the USA, the doctrine of "fair use" is fundamental to copyright law; and in places inheriting from English legal tradition, such as Canada, something similar exists called "fair dealing." There are differences between the two, but both concepts as embodied in current legislation and court decisions say that it is legal to copy copyrighted works, in a way that would otherwise be copyright infringement, when the copying is done for purposes of research or study.

The exact boundaries of how far fair use or fair dealing go (from here on, I will just say "fair use" meaning to cover both) vary with jurisdiction, but learning from or "studying" copyrighted material is always at the centre of fair use, not at the margins where there may be disagreement between jurisdictions. Everywhere with copyright laws even roughly in line with the consensus of Western civilization, you are legally allowed to copy something in order to learn from it. That applies even if you wouldn't otherwise be allowed to copy the work - and the power of the copyright holder to forbid you from such copying is narrowly limited or non-existent.

Fair use only becomes relevant at all when works are copied. You are free to read a book and learn from it, and that right is not protected as fair use because it doesn't need to be. The bundle of legislatively-created privileges called "copyright" simply does not include a right of the copyright holder to forbid you from reading a book or benefiting from the information in the book. The copyright holder has no claim on reading and learning.

But if you write a critical review of a book, you might want to include a quotation from the book. If you want to give a lecture about the book, you might copy a page onto a slide to show it to the audience. If you want to learn about the author's writing style, you might type or scan the book's text into a computer so you can run software that counts the uses of different words. Any of those cases involve something that might be claimed to be "copying" the words of the book, and the copyright bundle includes a privilege of the holder to restrict copying. So the fair use doctrine operates to say that you are still allowed to do these things - the copyright holder's ability to limit copying is itself limited. You do not need permission for fair use, and the copyright holder is never allowed to opt out of it. The fact that fair use is allowed and protected is part of the balance of protections necessary for copyright law to be useful to society.

Let's consider that balance. Copyright, unlike true human rights, has not always existed. It is a creature of legislation. And when created by legislation, copyright has always existed for some purpose. The Copyright Clause of the US Constitution, giving authority for Congress to make copyright and patent laws, grants that power specifically "To promote the Progress of Science and useful Arts[.]" The 1710 law commonly called the Statute of Anne, creating the first author-centred copyright law in the English tradition, has a lengthy full title that begins with "An Act for the Encouragement of Learning[.]" Copyright is protected to serve such purposes and should only be protected to the extent it actually does serve them. And legislation recognizes that fact by limiting copyright protection, with the fair use doctrine and in other ways, when overextended copyright would not promote science and useful arts, or would not encourage learning. So when considering whether training of generative models should be called a protected fair use, we need to consider its effect on the purposes of copyright.

Figuring out how training a generative model relates to the purposes of copyright raises interesting questions because, although systems that can be called "generative models" have existed for centuries, the ones that have become salient in the current decade seem to work so much better in certain ways that they feel like really new things with really new consequences. Maybe there are new questions here. For copyright law and model training in particular, I think the most important questions are, is training a generative model copying of the training data? And, if to some extent it is copying, is training fair use?

I think the answers respectively are "no," and (given the qualification) "yes." Both admit to grey areas and exceptions, but the answers are clear enough in the most important cases that these categorical single-word answers are usable generally; and it is not appropriate to permanently fix their technical boundaries in legislation or policy.

Training as copying

Defining "copying" can be surprisingly difficult. If I read a book that is written in English and then I write a book in English myself, I will probably use the same alphabet that was used in the book I read. Did I copy the alphabet? Well, not in any way that would be considered to infringe copyright. I can't be sued for infringement by all the previous users of the English/Latin alphabet just because I used the same alphabet they did, not even if I learned it from them.

But maybe I also used whole words that others have used before. In fact every word in this article has appeared before in others' writings. Many multi-word phrases in this article have appeared before in others' work. Did I copy them? In a sense, yes. In order for my language to be useful at all, I need to use words and phrases that my readers will understand because of having encountered them before. And I learned my vocabulary by reading books in English and listening to English speakers. But writing a book in English is not usually seen as infringing the copyrights of all earlier books in English solely because of the shared language. You just can't copyright single words. A new work in English is likely to consist entirely of words and phrases that have appeared before and that's okay.

Then suppose I read ten books and create a new book in which chapter 1 is word for word identical to a chapter from the first book, chapter 2 is word for word identical to a chapter from the second, and so on. Should that be allowed? Such a thing would usually be seen as "copying" the earlier works, and would put me in line to be sued by their authors. There is a spectrum here: I can have letters, words, and short phrases in my work that others have used before, and it will still be a new work of my own, but when "my" work consists of entire chapters reproduced verbatim it will be seen as a "copy" and not really my own original work after all. Somewhere between these points is the boundary of what constitutes a "copy" for the purposes of copyright.

A similar spectrum with a boundary exists in visual art. If I make a drawing with an HB pencil, that by itself isn't "copying" in an infringing way from the millions of other artists who have done so before, even if all HB pencil lines look more or less the same. But if I do it by tracing over an existing drawing made by someone else, it'll probably be called a "copy." If you draw a portrait of someone and then after looking at your art I also draw a portrait of the same person, that's less clear. Was I looking at your art or at the human model, more? Were there creative decisions that I made the same way in imitation of you? It's even less clear if I make a drawing of subject matter that you have never drawn, but I'm imitating your style in a visually recognizable way.

These questions are nebulous. They cannot be answered with a single consistent rule. There are extremes where it is obvious, from the artifact itself, that something is or is not a copy of something else, but the boundary is fuzzy, and copyright law in practice rightly does not specify exactly how much similarity counts as copying beyond saying that it depends. Despite a certain amount of folk law to the contrary, there is no set number of consecutive words that are definitely allowed to be identical without copyright infringement, and no set number of consecutive words that is definitely an infringement of copyright. There is no fixed length of a musical fragment, measured in seconds or notes or beats, that will or will not be copyright infringement when used in a new song. In a real copyright case the court will look at other factors to determine whether infringement occurred, if the amount "copied" does not settle it beyond all question.

One of those factors is the idea of a derivative work, and that is interesting for generative machine learning models: we can inquire into where I got the material alleged to have been copied. Whether it's copying is not determined by how much material is identical, but whether an act called "copying" did or did not occur.

If I had never read Jabberwocky and I started a descriptive sentence with "'Twas brillig," then I might say in court, credibly or not, that I'd come up with those words just because I was factually describing a particular Summer day when, here look at these weather records, it gosh-darn was brillig that day. Anybody might independently choose those two words to describe the factual situation. On the other hand, if the Lewis Carroll estate could establish that before writing those words I had read Jabberwocky, and moreover the word "brillig" didn't exist in English until he used it, then it would look like I probably "copied" those words from Jabberwocky and I might be liable (if copyright still applied to the poem) for infringing copyright.

The issue in that case isn't how many words are identical between my work and Carroll's, but how the words came to be identical. Copyright infringement is associated with the act of "copying," not solely with the final artifact. Identical words might or might not be infringing depending on the relation of causality. This shift in emphasis is discussed in my 2004 article on Colour, social beings, and undecidability, which is the follow-up to What Colour are your Bits?. The law cares a lot about causality.

Applying a causality analysis to machine learning training looks bad for the machine at first glance, because if we train a model on terabytes to petabytes of Web pages, and the model cannot be trained without doing that, and then the model turns around and produces output that is identical to what's in the training, then we certainly know where that output came from. Or do we?

It's a common misconception that current language models operate solely by reproducing verbatim quotes from the training data. This misconception underlies proposals to have a model "cite the sources" it is quoting, tacitly assuming that all of the output does come from quoting "sources." The misconception that models work by quoting underlies the idea of copyright holders being able to claim both the model and the model's output as derivative works from the training data. You couldn't have built that model without using my work, so I own the model too; pay up!

Simple information theory makes clear that that cannot really be exactly how the models work, because the training data is so much bigger than the model as to make such operation impossible. Models may be gigabytes, but training is terabytes or petabytes: thousands or millions of times the size of the models. There is not enough space inside the model to store all the training data. Rather, what the model stores amounts to factual information about the training, descriptions of general trends.

The model weights encode information like:

"The words 'and' and 'the' are very common;
'twas' is not common but when it occurs, it's often at the start of a sentence;
'brillig' is seldom used at all, but might describe the weather; and
when we're talking about 'toves' they are likely to be 'slithy.'"

Even these paraphrases are more elaborate and closer to human language than the opaque matrix of weights that the model actually records.

From such information it's easy to determine that the word sequence "'TWAS BRILLIG AND THE SLITHY TOVES" is very much consistent with the training data. So is "'TWAS A DARK AND SLITHY NIGHT." It is not possible to look at "'TWAS A DARK AND SLITHY NIGHT" and say exactly where in the training each word in it came from. Many millions of sentences start with "twas." A "dark and stormy night" is a scène à faire that occurs in countless works. The device of substituting a different word in a well-known phrase, like changing stormy to slithy, is a common thing that many authors do. Sappho famously did it 2600 years ago with "βροδοδάκτυλος σελάννα" ("rosy-fingered moon," alluding to Homer's rosy-fingered dawn). At most we may be able to guess who first wrote the word "slithy," but a single word isn't copyrightable either, and others have used that one since Carroll. With "'TWAS BRILLIG AND THE SLITHY TOVES" we can pretend that the model generated it as an exact quote from Jabberwocky, but that's probably not really what the model did - instead, both these fragments came out simply because they were rated as consistent with the training data as a whole, not just from individual works.

Generative models operate not by memorizing and repeating exact quotes, but by learning and combining patterns which are usually more complicated than verbatim quotes. In some cases a pattern could possibly be a verbatim quote, but if so, it would usually be one that appeared many times in many different parts of the training. Each pattern generalizes something that was seen multiple times: there would rarely if ever be space inside the model to store single-occurrence quotes. So when the output of a model does sometimes include verbatim quotes, causality is difficult to determine. Which of the many sources that used those words, was the model quoting? If you own the copyright in just one of them, why should you uniquely be "the" copyright holder with a claim on the model's output? And if there are many sources that all say the same thing, that's evidence the words in question were not so creative after all, and the model is infringing none of them by joining the crowd.

For models to work by combining previously-seen patterns is not so surprising because it's how human creativity works, too. Everything we ever say or write is influenced by the language we have previously ingested, without necessarily being "copying" of our previous experience. In general, I am allowed to read books and then write on the same subject matter, including using the same factual information, the same words, and even sometimes extended verbatim quotes. As a human I am expected to navigate the minefield of copyright's fuzzy definitions. I am likely to be sued if words I represent as "mine" coincide with someone else's too much. But just because I read a book on the same subject matter before doesn't mean the author of that book automatically has a claim on my subsequent writings.

The definition of "copying" for the purposes of copyright law is fuzzy in the human world, and that may be annoying but it is absolutely necessary. A rigid and precise definition would be worse. It should all the more be fuzzy when it comes to language models because so much is changing and unknown about them at this point. The only reasonable way to evaluate whether a language model's output infringes copyright of training data is case by case on specific examples. Models should be treated the same way as human beings in this respect. Humans are allowed to learn from copyrighted work, even when it's to inform their subsequent creation of new work. We might be sued if we screw up, but we're not preemptively forbidden to read and write just because of that possibility. The same should go for computers.

There is an additional swindle when copyright maximalists jump from language models to image models. More than once I've seen people argue that models do nothing but quote training data by citing research preprints on the feasibility of coaxing language models to reproduce verbatim quotes, and the extent of the small exceptions that may exist to what I said above about models not having enough storage space to memorize quotations. Even to the extent such research results are true, they relate to words and text.

Copyright maximalists will cite these early and speculative results specific to language as solid proof that creators of not writings but images should be able to exert copyright claims through the training process on the outputs of image models. That's a Hell of a jump. It's not even clear what a "verbatim quote" would mean in the context of images; they certainly want to claim more than only bit-for-bit identical images. This is motivated reasoning again: those who fear economic competition from generative models will see whatever they need to see, whether it's factual or not, in order to see a reason why the models ought to be forbidden.

At best it might be said that an image model can closely imitate the human-recognizable features of some well-known images; but copyright is not infringed by imitating style or subject matter, only by copying. We should be oh so hesitant to ever consider extending copyright to cover style or subject matter.

The bottom line on "copying" is that although there are grey areas in what is or is not "copying," training of generative models is more like human learning and analysis than like copying. And we shouldn't attempt to define this boundary too precisely in advance. Then to the extent training is not copying, copyright is just not relevant to training.

Training as fair use

What about the extent to which training is still somewhat like copying? There we need to think about fair use.

The fair use doctrine goes beyond the causal relationships between works themselves, and the concept of whether "copying" occurred, to examine the purpose behind a specific instance of copying. There are two important purposes at play when someone copies work under fair dealing: the individual purpose they themselves hope to achieve, which depending on what it is makes the copying allowable; and the overarching purpose of copyright law which makes it appropriate to have the policy of conditionally allowing copying.

The simplest and most solidly protected individual purpose of fair use is what's often called "private study": you want to learn factual information contained in a work. For instance, you might read a book. Merely reading a book is not copying it, but it's easy to construct situations where copying might be involved in private study. For instance, you might photocopy a chapter of a library book to read repeatedly later, rather than needing to come back to the library or take out the book. Fair use allows doing this.

Another more subtle kind of "study" applicable to copyrighted works is study of the works themselves as opposed to factual information contained in them. If you're interested in how different translators deal with the unusual Ancient Greek adjective "βροδοδάκτυλος" you might copy a page each out of several different translations of Sappho to line them up and compare them. You might even use a computer to make the comparison in a systematic way. The incidental copying necessary to learn about the copyrighted work itself, is allowed as fair use.

But there is another form of fair use especially important in relation to generative models: you're allowed to copy work as part of creating new work. There are limitations, and a lot of details both in legislation and court decisions specifying the ways in which copyrighted work can be copied as part of creating new work. The new work can't be only a mechanical reproduction of earlier work, to qualify as fair use - there needs to be creativity involved too. So it becomes an interesting question, to what extent are generative models creative?

There are tantalizing hints in the scientific results that what present-day generative models do may be basically the same thing humans do when creating copyrightable work, but really, it is unknowable at this point to what extent generative models are creative, let alone how creative they will become in the future. Too many questions are unanswered today about how they work, and how human creativity works. In this uncertain and rapidly changing landscape it is completely inappropriate to jump to conclusions and legislate on the basis of those conclusions. We should err on the side of caution; and erring on the side of caution with respect to fair use means allowing as much use as possible to be counted fair. Remember that fair use is the default. Copyright's legislated monopoly on copying is properly an exception to the natural right to make copies, and it should be a limited exception.

The inherently limited nature of copyright is mentioned in the US Constitution's Copyright Clause, where it says "for limited Times." That clause, the Statute of Anne, and subsequent copyright laws have emphasized the nature of copyright as intended to serve social goals - especially, supporting creation of new work. Economic copyright, that is the privilege of preventing others from using creative work, exists for the purpose of incentivizing the creation of that work in the first place: it gives authors a fictitious good they can sell in order to make money from their efforts. But society should and will protect this monopoly only to the extent it really is necessary for the creation of new work. If copyright instead becomes a barrier to the creation of new work, then copyright is not serving its important purpose.

Generative models, if allowed to reach their potential, represent vast new sources of creative material of great value to society. The role of copyright law should be to encourage the creation of new and better generative models; not to strangle them in the crib. If the economic monopolies of copyright, used to protect the hypothetical income streams of entrenched interests, prevent the training of new models by imposing unrealistic burdens of payment and permission on whoever would try; then copyright has been perverted from its fundamental purpose. One way to prevent such a travesty is to make it clear that training, to the limited extent that training is even copying at all, is fair use.

I'd like to mention the ideas of "opt-in" and "opt-out" that have sometimes come up in discussions of machine-learning "ethics." The privileges of copyright do not include a right for the copyright holder to forbid fair use. On the contrary, many of the most socially important applications of fair use, in such things as parody and critical review, depend on fair use specifically against the wishes of copyright holders. The important purposes of fair use would not be possible if they were subject to veto. So to treat the training of generative models as fair use means that just like any other fair use, training is allowable even against the wishes of copyright holders. Offering copyright holders a chance to opt out from having their work included in training corpora, would be neither legally nor ethically appropriate.

Pragmatic considerations

Even one who disagreed with my view of the importance of fair use and the reasons for copyright law to exist at all, should think carefully about the likely practical consequences of using copyright claims to block machine learning. Look at what happened with Google Books: despite multiple claims, with varying levels of legal validity, that the Google Books service violates copyrights, it still exists. Google made deals with its largest opponents and can afford to out-fight its smaller opponents in court, and subject to minor restrictions that don't much affect the project as a whole, Google gets to keep running Google Books the way it wants to.

An enterprise as big as Google can effectively buy its way past any obstacles created by copyright law. But a small and non-profit effort to digitize books that were still under copyright, and make them searchable on the Web, would be dead in the water because of the lawsuits it would immediately face. The side deals among large organizations, that they will agree to stop suing each other, create a private club - indeed, a trust in the sense contemplated by antitrust law - that can keep out smaller participants. Extending copyright to block fair use in the AI context would, in practice, only have the effect of reinforcing large corporate monopolies.

There is a line of reasoning that goes, "I charge $100 for a commissioned drawing, so if somebody trains a model on millions of images including one I drew, and the model generates ten images in a style similar to my style, then I should get $1000 because the model's output is copied from my drawing and is really my creation." Note that that's the jump from language models to images again. And regardless of the moral status of such reasoning, it just isn't going to happen as a practical matter. That $1000 payment doesn't exist in any real sense. Large corporations just don't care about you, and they are powerful enough not to have to. Copyright law is a low speedbump to the likes of Google while it's an insurmountable barbed-wire electric fence to smaller players. So how many new and wonderful things do you want to destroy with extended copyright regulation of everybody else than the large corporations, in a fruitless effort to chase after a payment of fairy gold that you'll never really get?

Extending copyright law to block training of machine learning models does not mean writers and artists will be paid real money when their work is used as training data. It does not mean that they can really choose to have their work excluded from use as training data, either. There will always be loopholes available to corporate monopolies, not least by way of foisted non-negotiable contracts in the Web services you depend upon. Post work on Twitter, Facebook, or TikTok, and it will turn out that you've "opted in" for these companies to train models on it. Even send work privately by email and you'll be "agreeing" to its use for training purposes. Google is already known to train AI models on email sent through GMail. Their terms of service include enough vague language about licensing "for the purposes of improving the service" that even a copyright extension allowing copyright holders to forbid training, would likely fail to stop Google.

Copyright extension purporting to create a right to forbid training, will only keep smaller and especially non-profit players out of the machine learning game, leaving AI completely dominated by corporate monopolies. That is the worst outcome and we shouldn't wish for it.