Should there be a clause for AI?

Sat Jul 12 15:49:50 UTC 2025

On Fri, Jul 11, 2025 at 4:01 AM Vasileios Valatsos <me at aethrvmn.gr> wrote:
>
> Normally, when you use any software under a copyleft license, you must
> disclose any modifications, and release them under said license. However
> recently, with the training of language models and other generative
> methods, there is a laundering effect, where the end user is "handed"
> copyleft licensed code generated by the generative model, which
> effectively bypasses the copyleft clause. To my understanding (I am not
> a USA citizen) this is because, in the USA, the products of non-humans
> are not copyrightable (or copyleftable).

No, that is not correct, though I've heard confusion even from US
lawyers on this point.

The emerging notion (in the US and some other jurisdictions) that
generative model output is not copyrightable is based on the
assumption that the output would otherwise meet the minimum
requirements for copyrightability if the output *had* been created by
a human: in the US, basically, minimum originality and minimum
creativity (to the extent those are different things). In other words,
for example, if I use an LLM to generate a novel, and it's indeed as
far as anyone can tell an entirely original novel, no substantial
resemblance to any existing works, I can't claim I am the copyright
holder of the novel, because it was generated by a machine. At the
same time, the creator or vendor of the LLM or LLM product that
generated the novel also can't claim copyright ownership of the novel,
for the same reason.

(As an aside, this rationale is somewhat curious, given that we've
assumed for decades that things like aleatoric music compositions are
copyrightable.)

However, none of that means that the output couldn't possibly infringe
the copyright of some third party. You're right that a problem of such
models is that they could in effect 'launder' copyrighted material
(possibly copylefted) in the training data of the model. But that's
not because of some special legal situation, and it's really no
different from other modes of copyright infringement. If I write a
novel, and it's used to train a model (let's assume I don't have a
copyright infringement claim based on the act of training, an issue
that has been raised in a number of current lawsuits in the US), and
the model can be shown to produce output that's substantially similar
to my novel, I might have a copyright infringement claim against
someone in connection with the use of that model.

I will also say that I have not seen a credible argument since after
the initial rollout of GitHub Copilot that an LLM product was emitting
copylefted code (that is, code that was substantially similar to some
existing third party code under a copyleft license). It's obviously
*possible* and indeed the possibility is unavoidable, but I have to
wonder whether it's uncommon (and/or practically detectable) since I
think we'd otherwise be hearing more accusations about specific
misappropriations from copyleft code authors.

I assume you're not suggesting that the mere obvious fact that popular
LLMs are trained, in part, on copyleft code means that any output of
such LLMs has to be assumed to be a copy, in some sense, or a
derivative work, of such copyleft training data even if there is no
apparent similarity between training data and output. That's a
separate issue, anyway.

> To this extent, I myself personally use a second license, specifically
> for using projects as training data:
>
> ---
> Training Public License (TPL) v1.0
>
> Copyright © 2025
>
> This code and content is licensed under the GPLv3 or later, with the
> following special condition:
>
> If you use any part of this code, notes, or data for training,
> fine-tuning, or evaluating a machine learning system (including but not
> limited to neural networks, large language models, or any algorithm
> where this content influences the resulting system), you must release
> all resulting models, weights, and related code under the GPLv3 or later.
>
> All other uses are governed by the regular terms of the GPLv3 or later.
> ---
>
> I wonder if (a) this would make sense in copyleft-next, (b) if it even
> belongs in the conversation, (c) if there is a better way to tackle this.

There's a threshold question here which we haven't even addressed with
this reboot of copyleft-next, which is raised by your suggestion. Your
license is not a free software license -- by current standards, in my
opinion. I think that's fairly obvious but I wonder if you disagree
since you're structuring this as a "condition" on GPLv3 that requires
resulting work to be licensed under GPLv3.

The threshold question is, should copyleft-next be a free software
license? I think the answer has to be "yes" since I have no interest
in drafting a non-FLOSS license and I think it was always assumed that
copyleft-next would be a free software license.

Of course, the definition of free software, or software freedom, is
not set in stone. I think the most basic principles are permanent but
the interpretation of those principles will have to evolve with
changing conditions. For example, in another thread the Affero concept
was mentioned. In 1991 the notion of an Affero clause might have
commonly been regarded in the early GPL community (not to mention the
large anti-copyleft camp in the early free software community) as
being in conflict with the principles of free software, though maybe
that community wouldn't have had a good way of expressing that idea.
In 2025, there are *still* some people in the free software community
who regard an Affero clause as being categorically in conflict with
software freedom, but I'd contend this hasn't been a mainstream view
since the early 2000s.

So all that is to say that maybe the definition of free software, or
our application of that definition, needs to evolve specifically
because of the challenges being created by the rise of LLMs trained on
vast quantities of free software and being used (a) potentially,
though hopefully very rarely, to produce output that violates the
licenses of that training data, (b) commonly to produce output that
may not resemble such training data closely at all and yet in some
justifiable (to some people) sense feels like a misappropriation of
that work. I don't know whether extending the frontier of what
software freedom is is within the scope of this project, though. I've
kind of thought we have a more modest goal, which is to assume a
fairly conservative view of what software freedom is (basically,
software freedom is what the FLOSS community generally thought it was
around 2012) and draft a copyleft license that fits within that view.

Richard