Fwd: My views on GitHub Copilot

Sat Feb 26 05:53:19 UTC 2022

I submitted the following email to the Libreplanet-Discuss mailing
list of the Free Software Foundation.
Since this mailing list is specifically about the topic, I will also
forward it here.

---------- Forwarded message ---------
From: Valentino Giudice <valentino.giudice96 at gmail.com>
Date: sab 26 feb 2022 alle ore 06:45
Subject: My views on GitHub Copilot
To: LibrePlanet-discuss <libreplanet-discuss at libreplanet.org>

Hi,
I am an Italian student of Computer Science (AI, in particular).

Today the FSF has published the selected whitepapers on GitHub Copilot
(OpenAI Codex), but hasn't decided its own opinion on the topic yet.

I haven't submitted any whitepaper myself, but, as others will as
well, I will publish my views on the topic.

Views from a non-lawyer and which are not legal advice.

To summarize:
- Training on a dataset does not require a copyright license from the
copyright holders of the dataset, or entries of the dataset.

- A trained model is not copyrightable.

- A pre-trained model, in any form, and without any additional data or
software, is its own source code.

- The output of a machine learning model is generally in the public
domain, except when it contains significant portions of its input.

- Any other view would be harmful for the free software community.

An article which is not among the whitepapers, but is just as
interesting as them, and which I am expressly endorsing, is the
following: [1].

Here is my full-length opinion on the subject:

First, I will assume the reader is already familiar with GitHub
Copilot and OpenAI Codex, so I will not explain what they are and what
they do.
Before discussing laws and ethics, it's important to clarify what a
neural network is, essentially, designed to do.
A neural network "bends space": its input represents a point, the
output does too and the inner computations map one space to another.
The input and output could represent more than one point, too.
Parameters in the neural network need to be trained on a learning
dataset: the learning dataset samples some distribution. The
parameters are such that the behaviour of the neural network reflects
that distribution: assuming each individual of the overall population
is an input-output pair then, given an input, the neural network will
return the expected value for the output, or the most likely output
for the value, or the probability for each possible output, or
something of the sort, depending on the kind of neural network and how
it was trained. If individuals in the overall populations are in some
other forms, the purpose of the neural network might be that of, given
some noise with some known distribution, output an individual of the
population, with a probability that reflects the distribution.
It's apparent, then, that the purpose of the parameters isn't that of
encoding the training dataset, but rather that of representing the
information about the population that can be induced from the samples
in the dataset.

It is possible that the parameters will contain more information about
the samples in the dataset, but that isn't the intention: it is
accidental and due to the imperfections of current technology or to
the low quality of the dataset itself.

In the EU there are specific copyright exceptions (Directive 2019/790,
Articles 3 and 4) which, although unfortunately limited, provide a
legal framework for training neural networks.
In the US, I argue that training neural networks is fair use. OpenAI
has argued the same in [2]. The purpose of copyright is to "promote
the progress of science and useful arts" and the general idea is that
allowing authors to reserve some rights creates an incentive for
creating some works.
But training neural networks requires an extremely large amount of
works and there is no business model in preventing individual works
from being used as part of a training dataset: no author will make a
work to be paid by those wishing to include it in a training dataset.
All four main factors in evaluating fair use are clearly on the side
of allowing training neural networks, meaning a copyright license to
do so is not needed in the US.

A neural network well trained on a good dataset should be effectively
independent from any individual entry of the dataset (since one entry
doesn't change the distribution in any significant way).

The strictest possible criterion to determine whether distribution of
a trained neural network requires a license from the copyright holder
of an entry in the dataset could be an approach similar to
differential privacy. However, given that distributing such a model
doesn't harm the copyright holder in any way (and it wouldn't replace
the original work), and that any failure to meet this criterion would
be effectively accidental, I suggest that much laxer criteria should
be used.

Not everything is copyrightable. Trained neural networks, unlike
computer programs, are not literary works and are even further away
from any other category of copyrightable works. What about database
sui generis rights in the EU? Well, trained neural networks are not
databases either, since parameters (which are the individual data
entries) would have to be meaningful on their own, and qualify as
"independent works", but that is clearly not the case.

And indeed, trained neural networks are very different from the kinds
of work which are *meant* to be copyrightable: all kinds of
copyrightable works are original and creative forms of expression. But
the parameters of neural networks are determined through a mostly
automatic process and, while they do encode useful information, they
merely act as the constants of a really large mathematical formula
which determines the behaviour of the neural network. It's not just
that individual neural networks are trained automatically, it's that
their very nature is widely different from that of anything which is
considered copyrightable.

So, if patents aren't in the way, then there should be no licensing
issues when it comes to neural networks. But what about source code?
For something to qualify as "free software", or to be acceptable as a
module of free software, source code must be available. What even is
source code?
According to the GPL, source code is "the preferred form of the work
for making modifications to it".
Not that:
- It doesn't have to be the *original* form of the work. Usually it
is, since programmers will work in the form that allows modification
from the beginning. But if I were to, for instance, write code on
paper, then scan it, use OCR, and then compile it, the source code of
that program would be the text files before compilation, not the
scanned images.
- Making modifications doesn't have to be *easy*. There is no kind of
digital work which doesn't have a source code, no matter how hard
modifying it is, because source code is simply one of the forms in
which the same work could be provided.

Now, in the case of a trained neural network, what is the source code?
Neural networks are widely criticized for being "black boxes". I will
not get into details, but I will say this is true to some extent. The
"meaning" of each individual parameter is not known, modification
isn't easy, and sometimes we don't actually fully know why certain
techniques work. And this has raised questions about what is the
source code of a trained neural network.

But note that, in the case of software, it always exists in a form
which is easyish to modify. In the case of neural networks, it's
none's fault that this is harder: it's just what neural networks are.
The training dataset and the training code are not part of its source
code, because they are not part of the trained neural network at all,
regardless of the form in which it is provided. And, unlike software,
and unlike many other works, the parameters themselves are practically
the same thing regardless of the form they are provided in, and can be
converted from one format to another.

Therefore, a trained neural network is its own source code if it is
provided in a free format.

But what about the output of a neural network? Often it is
non-copyrightable information, but what about when it's images or
text? Unfortunately, in the UK (Copyright, Designs and Patents Act
1988, Article 9) such works are copyrightable by "the person by whom
the arrangements necessary for the creation of the work are
undertaken". This is utterly unreasonable, as it would be for works
made by animals (Naruto, et al. v. Slater, et al., no. 16-15469 [3]).
Luckly, however, this is not the case in most of the world [4] and
isn't the case in the US [5].
Sometimes, however, the output of a neural network will contain
significant portions of the input: in these cases, it's clear that it
constitutes a modification of, and thus a derivative of, the original
work.

There is, however, a point I haven't mentioned yet. What if the
trained neural network contains so much information about individual
entries of the dataset that it will actually generate significant
portions of such samples?
In that case it isn't completely unreasonable to argue that generated
works are derivatives of such entries and, even, that distributing
copies of the parameters is also effectively a form of distributions
of the works themselves. The latter, however, should consider that
this effect is purely accidental, and simply due to a lack of
generalization by the algorithms. It's akin to taking a selfies on the
street when a poster happens to be on the background: the mere fact
that a significant portion of the poster could be extracted from the
photograph, doesn't mean that the photograph is a derivative work of
the poster if the poster plays no significant role in the photograph
itself, which, thus, is not based on the poster.

Training neural networks is and should be a legal activity, and it is
an ethical activity. It doesn't hurt copyright holders and is fully
compatible with the framework of free software. And while copyright
law should be changed and better adapted to allow for this task, it is
not incompatible with current copyright law.

Companies such as Microsoft, Google and OpenAI are very involved in
neural networks. But the mere fact that laws which hamper the field
would hamper those companies doesn't mean they wouldn't hamper the
free software community as well, in a similar way to how software
patents harm free and proprietary software programmers alike.

This is a new field, which may be crippled by copyright law, or which
may become essentially free from it. I believe it is not the job of
the FSF or the free software community to make sure that the reach of
copyright law extends beyond its current reach.

If the FSF were to declare the training of neural networks to be
incompatible with free software (for instance because of the "source
code" problem, which I addressed previously), this would create an
unprecedented schism within the free software community and it would
exclude it from a large, growing and promising field. Not only that,
it would be the wrong decision.

If the FSF were to try and argue that training neural networks
infringes copyright, that would support an extremely broad
interpretation of copyright law, one which doesn't help anyone.

And even in the case of Copilot, consider that GitHub doesn't just
host free software. It hosts software, generally, in source code
format. A lot of it is non-free. And a lot of works in general,
software and non-software, are non-free: that is the default, not the
exception. We do not need to "protect" them to an even more
unreasonable extent, one which doesn't even help their authors
anyways.

There *are* problems in the task of training neural networks. The
biggest issue is that drivers and firmwares for the most powerful GPUs
are non-free. And that's an issue, because computational power is
essential for the job.

The FSF needs to endorse free trained neural networks, available for
all. Recently, GPT-NeoX-20B was released by EleutherAI. Before that,
we got GPT-J-6B. Respectively, they have 20 billions and 6 billions
parameters: they are provided in free formats and, in case they turn
out to be copyrightable, under a free software license.

That of drivers and firmwares for GPUs is the biggest problem, the
harder to solve. I don't have any strategy for how to solve it, but
some smarter people might. It's important not to give up, however, and
not to throw the whole field under the bus because of it.

[1] https://felixreda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/
[2] https://www.uspto.gov/sites/default/files/documents/OpenAI_RFC-84-FR-58141.pdf
[3] http://cdn.ca9.uscourts.gov/datastore/opinions/2018/04/23/16-15469.pdf
[4] https://www.leexe.it/en/magazine/artificial-intelligence-computer-generated-works-and-dispersed-authorship-spectres-are-haunting-copyright
[5] https://www.copyright.gov/rulings-filings/review-board/docs/a-recent-entrance-to-paradise.pdf