Software versions and extreme reproducibility

The Eighth Freedom of free AI is the freedom to run the same program again.

Being able to run the same program again is important in any kind of research context, because of the need to repeat experiments. If you observe an effect using one program, and then try to confirm it with a new experiment but discover that the software you first used can no longer run, then the original experiment loses much of its value. You end up searching for a replacement for the original software, and having to control out the possibility that differences between the original and the replacement may change the outcome of the experiment. The amount of research involvement in machine learning and AI makes the freedom to run the same program again especially important in this field.

As a first step to the freedom of running the same program again, it is important that you should be able to freeze the version of a program. A new version is properly understood as a different program, even if it happens to share its name and most of its code with the old version. You should be free to not "upgrade" to a new version, neither via automatic downloads nor at the behest of some other package demanding global installation of a new version. And as I've said about other user-hostile features, automatic updating if offered should be optional, not default, and not too easy to turn on. By all means software versions should not have expiry dates.

But in the other direction it is also important that you should be allowed to install new versions of things if you want to, without breaking your system. That creates an obligation of backward and forward compatibility on software in general, not only one package: everything should be designed to expect, and to be able to function on, a system where some parts are old and some are new.

My general objection to dependencies bears on this point because a program with few dependencies will have fewer ways it can be broken by unexpectedly new or old versions of its dependencies. But any software, even if it tightly controls its own dependencies, might itself become used as a dependency by something else. For that reason, new versions have an obligation to continue providing the interfaces the old versions provided, to whatever extent makes sense.

There's a simple guideline that might help to prevent versioning-associated breakage in many cases: that it should be forbidden to make dependency versions moving targets. If a project's policy is to work with "the latest version of Foo Library" then it is pushing users to make changes all the time. Real projects, aware of the problems with such a policy, might more likely say they will work with "the latest three major versions of Foo library"; then users are less often obligated to make changes, but still at the mercy of Foo Library's variable release schedule. So the next level up is to make it based on time instead of version numbers: "we work with Foo Library releases from the last two years." That at least limits the frequency of breakage to a predictable schedule; but it is still a moving target forcing likely-unwanted breakage onto users. Moving targets are a problem.

Instead of setting a moving target, any policy on dependency versions should name a fixed version in the policy, not one that will automatically change: we will work with Foo Library "versions 7.32 and later." From time to time it may be necessary to change the policy to name a newer version if there is an important reason to use the features of a newer version; but when that happens it should be an explicit choice by the project, something done for a specific reason every time it happens. Users with systems that work should be able to expect their systems to continue working indefinitely if they don't make changes themselves - not only until a planned obsolescence date. And "no moving targets," although probably not usable in every project, is a simple guideline that is easy to describe and evaluate.

"But security!" is the cry when anybody tries to force others to use newer versions of software. I've several times witnessed attempts to shut down discussions of compatibility with an old version of a program on the part of people who seemed to think "That version has security advisories!" was a trump overriding any reason anyone could ever have for continuing to use the version in question, or even for continuing to talk about it. On the contrary, if a given version of a program "has security advisories" and somebody still wants to use it and not a newer version, then that is strong evidence the maintainers have not done their jobs well. They broke the program so badly in the newer version, that users would rather risk the security advisories than accept the breakage. And if you are holding yourself out as the authoritative source of critical security fixes for a program, then you have a responsibility to provide those fixes in a way that allows them to be used in isolation - without breaking the program in other ways.

Unpaid volunteers don't owe the world anything, and this is a context where idiots frequently misuse the word "entitled"; but actually volunteers do owe the world something sometimes, and users are entitled in that word's true sense to expect certain things of maintainers. Maintainers of software shared with the public gain status by being maintainers, even if they aren't paid in money. They voluntarily take on responsibilities that go with their status; all the more so if they choose to represent what they're releasing as "free software," which has a meaning and is a promise.

If maintainers claim the software they maintain is free software, then it should really be such - including the freedom to continue using the software without losing its usefulness to breaking changes. If maintainers, despite not being obligated to continue being maintainers and release patches at all, nonetheless choose to release patches they think are critical for security, then they have a responsibility to make the patches clean ones, capable of being adopted without breaking unrelated features. Otherwise, they are knowingly forcing users to choose between security and breakage, and knowingly contributing to the problem of some installations remaining insecure. Conversely, if maintainers choose to release a new program that has the same name as an old one but doesn't work like the old one did, calling it an "update," then they must respect the freedom of users to choose not to switch to the new program.

Note that "security" is much less an issue for software running in isolation, and that's another reason the freedom to run software in isolation is important.

Web browsers, with their telemetry, constantly growing lists of new Javascript features, feature removals, and frequent breaking UI redesigns, are especially obnoxious with respect to updates. Web sites are designed to use bleeding-edge Javascript features, and to detect browser versions and scold or lock out users of older browsers even if the older browsers actually do support all features the sites use, so there is third-party pressure on users from the sites, not only from the browser maintainers. Note that major Web browsers today are maintained almost entirely by paid employees of large corporate entities, who don't have the "volunteer" excuse even if we thought such an excuse was valid. Because Web browsers are so far from the Fourth and Eighth Freedoms, it is especially a problem for free AI software to depend on Javascript-based user interfaces and documentation. Your whole system breaks if you depend on a browser that breaks, even if your own code as such is still able to run.

Running the same program again means not only running the same program again that you have run before, but also being able to run again the same program that someone else ran before. This aspect of the Eighth Freedom is necessary for users to properly exercise the Second and Third Freedoms, which relate to sharing programs. The way it becomes an issue at all is that some programs may be sensitive to who is running them. Years ago I used to be able to tell people "Search for such and such on Google, it's the third link from the top," because Google results were the same for everybody. Now, Google for me is different from Google for you, because individual search history informs the ranking algorithm. My third-ranked result today will be different from yours, and probably also different from my third-ranked result last time I searched.

The specific example of Google search as a moving target is relatively unimportant, is a service rather than a program, and does not claim to be "free software." But similar issues appear with locally-run computer programs when they depend excessively on local configuration and history. An example more specific to machine learning would be binary snapshot formats "sharded" for the specific GPU configuration on a given computer. They cannot be used, even by what purports to be the same software, on a computer with a different GPU configuration unless carefully converted to the new computer's format. The portability described in the Fifth Freedom should extend not only to the software being able to run on different systems at all, but also to running in the same way on different systems.

The freedom to run the same program again, combined with the need for reproducing scientific experiments, makes for yet another reason to discourage implementation of AI systems as online services. There have been many incidents where someone observed an interesting effect in a system like DALL-E or ChatGPT, shared their results, and then others tried to reproduce it and couldn't - because the operators of the service made changes without telling anybody or providing a way to roll back to the older version. Then commentators were left to argue without hope of resolution about whether the effect was real or not, and whether the person who originally saw it was lying.

When scientific experiments are to any degree adversarial, and have public interest stakes, such as critical review of the ChatGPT political filter, then lack of reproducibility is all the more problematic. If I see ChatGPT do something I think demonstrates political tampering, and I publish transcripts of the misbehaviour, then it is reasonable to expect OpenAI will change ChatGPT to stop visibly doing the bad thing. Then others can't reproduce the behaviour I saw. It doesn't happen anymore, and you have only my word that it did happen once in the past.

What is worse, I could be lying and that would be equally unverifiable. I could fake a transcript, respond to claims that it was fake with "Well, they must have changed the filter again," and it would be difficult to prove they hadn't. With local software that I can freely copy, on the other hand, I could show you exactly the model I used to get the result I said I got, and you could verify it yourself, or be suspicious should I refuse to provide such evidence. But even local software is susceptible to creeping changes through package management.

The freedom to run the same program again is especially important for science, so that experiments may be reproduced; and this freedom is difficult or impossible to have when software is offered only as a service. To the limited extent academic journals are still relevant, the timelines of journal publication mean that any online service discussed in a research article is likely to have changed a lot before the article is published.

There are voices within the free software community, especially within projects like Debian, who advocate for what I choose to call extreme reproducibility. Extreme reproducibility is pure functional programming at the scale of entire software packages, especially within build systems: it is the requirement that running a program repeatedly with identical input, should produce identical output. There is advocacy for the ideas that Linux distribution build systems should produce bit-for-bit identical packages on separate compilation attempts, and that scientific papers should come with zipped-up virtual machines allowing reviewers to reproduce, bit-for-bit identically, the computations reported in the papers. There are even attempts to mandate such things as a condition of participation in a distribution or a journal. Followers of this school of thought redefine the word "reproducibility" by itself to refer to what I call the extreme version, dismissing any non-extreme form of "reproducibility" as not really reproducible at all. There is no doubt that having programs be reproducible in the extreme sense makes the Eighth Freedom easier to exercise.

It is easy to reason about extreme reproducibility, by which I mean it is easy to prove that some statements about extreme reproducibility are true or false. We can draw inferences like "if program A is extremely reproducible and program B is extremely reproducible, then the pipeline that routes the output of A into B is also extremely reproducible." And we can also easily test extreme reproducibility by running a program repeatedly and comparing the outputs, though this is technically only a one-sided test. Testing software in other ways becomes easier when the software has the extreme reproducibility property; all the usual advantages of pure functional programming apply. On the other hand reproducibility as such, meaning only that we can run the same program again, even if the output might not be bit for bit identical, is harder to test and may involve subjective judgment of what counts as "the same program." These properties make extreme reproducibility nice to have.

However, extreme reproducibility requires extreme tradeoffs that are especially likely to cause problems in the AI context. First, it basically forbids files from ever containing metadata describing the circumstances of their creation - because that will be different each time the same file is re-created. Extreme reproducibility either demands careful design of file formats to exclude metadata, or grudgingly allows that files in what advocates will try to have us call "legacy formats" might contain metadata, but then the metadata is not allowed to be truthful. Advocates of extreme reproducibility design kludges for forcing metadata fields to constant placeholder values so that files can end up identical. The real metadata for the file has to be kept separately, outside the boundary of the extreme reproducibility criterion, and then it will likely end up being lost. Although a problem in any system, these constraints on metadata are especially likely to make trouble in AI work where people are more than usually concerned about the provenance or the Colour of files - where did a file come from, what part did humans play in creating it, how much were the humans paid, and so on.

Extreme reproducibility causes problems for software that uses randomized algorithms. Many randomized algorithms by their nature do not produce the same output every time. Present-day machine learning software in particular often depends on such algorithms and there are strong theoretical reasons why it would. It is usually possible to substitute seeded pseudorandom number generators for true randomness and then carefully handle the seeds in such a way as to guarantee identical output on repeated runs while preserving most of the advantages of randomization; but such techniques are easy to get wrong. Successfully derandomizing a large system is a large-scale effort that may require low-level auditing of the entire system. It becomes even more difficult when a computation is parallel or distributed across multiple computers communicating over a network, possibly even when distributed among multiple cores in GPU or TPU-style hardware. Forcing the entire program to be purely functional at every point in the computation is a large demand to make in comparison to the benefits of extreme reproducibility.

Extreme reproducibility is also a problem for security, because of the prevalence of randomization in cryptographic protocols. Many cryptographic operations use random numbers and depend for their security expectation on just the opposite of extreme reproducibility: an assumption that a random number will never be bit-for-bit identical on different occasions. There are, again, strong theoretical reasons to expect that this randomization will always be necessary, and in the cryptographic context derandomization is not a reasonable option. Such operations cannot be done inside the boundaries of a system that is meant to be extremely reproducible. If we would secure a system with cryptographic protocols and also demand that parts of it are extremely reproducible, then we must design the system with a carefully agreed-upon boundary between the two realms, which is a significant imposition on the kinds of systems we can build.

Finally, extreme reproducibility may actually be detrimental to the looser definition of reproducibility applicable to scientific experiments. If you do an experiment on a sample of one hundred files and report the results of your analysis, and you share your code and your files in a way consistent with extreme reproducibility demands, so that I can run your code and get bit-for-bit identical results, then it might be said that I have "reproduced" your experiment. But in such a case all I really know is that you truthfully reported the result of running your code on your data (with one choice of random seed, if applicable). If we think the experiment may be measuring a phenomenon that occurs in real life, then the much more interesting question is not whether you are honest, but whether the phenomenon really occurs. Science is about the world, not about scientists. To reproduce your experiment, I should be running the code on a different data sample, to control against the possibility you were unlucky and got a non-representative sample; or different code of my own that implements the same function to control against errors you might have made in your code; or the randomized algorithms with a different seed, which is another way of trying a different sample.

Note that until computers came along, scientists never even thought of software-style extreme reproducibility of experiments as a possibility, let alone a hard requirement. If you experimented on one hundred guinea pigs, instead of files, and wrote a paper about it, nobody would say that to reproduce your experiment I need to have those same individual guinea pigs, in the same state of mind they were in at the start of your experiment. They may be long dead by the time your paper makes it through the publication process. As a matter of course I would get one hundred new animals, and I would expect not to observe and report exactly the same numbers you reported. A successful replication of your experiment would mean I observed substantially the same phenomenon, not the identical numbers, that you said you observed.

Really reproducing an experiment should mean doing the work again myself, and the fact I might get a different answer is exactly why it's valuable. Merely verifying that you did get the result you said you did, is the wrong kind of reproducibility for science but the only kind contemplated by an extreme reproducibility requirement on software. Science involving computer programs is best served by the Eighth Freedom as I define it: that I should be able to run the same program you did.