Should there be a clause for AI?

Sat Jul 12 17:32:33 UTC 2025

On 12/7/25 17:49, Richard Fontana wrote:
 > But that's not because of some special legal situation, and it's really
 > no different from other modes of copyright infringement. If I write a
 > novel, and it's used to train a model (let's assume I don't have a
 > copyright infringement claim based on the act of training, an issue
 > that has been raised in a number of current lawsuits in the US), and
 > the model can be shown to produce output that's substantially similar
 > to my novel, I might have a copyright infringement claim against
 > someone in connection with the use of that model.

Yes, I fully agree. My point is that with the current state of things, 
it is very problematic to figure out *who* that someone is.

It obviously can't be the end user, because they has no control over the 
stochastic output of the model, and they can't possibly reference the 
output and compare to figure out if it may violate any copyright/copyleft.

 > I assume you're not suggesting that the mere obvious fact that popular
 > LLMs are trained, in part, on copyleft code means that any output of
 > such LLMs has to be assumed to be a copy, in some sense, or a
 > derivative work, of such copyleft training data even if there is no
 > apparent similarity between training data and output. That's a
 > separate issue, anyway.

Obviously not. A large language model is a graph with weighted nodes; 
the training data only alters the values of the weights, so it can't be 
a derivative.

There are two different arguments made with respect to "copyright 
abuse"; I won't go into depth here because as you said, it is a separate 
issue. For completeness though, the first is what the NYT claim against 
OpenAI and Microsoft; that because the model can output verbatim copies 
of copyrighted content, it is a copyright violation which hurts the 
author financially. The second one is the one that artists use in the 
case of generative AI for media (images/video/audio) generation, and has 
to do with the fact that .

Neither of those really matter when it comes to copyleft; copyleft 
grants freedoms rather than restrict, however in both those cases it is 
not the model that is treated as a copy or a derivative, but it's 
capacity to output copyrighted (or copylefted) material (exactly because 
it has been trained on them, rather than accidentally generating a 
similar output). I am also not a lawyer, so I can't really confidently 
interpret neither the EU AI 
Act(https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai), 
nor the US Copyright office reports (https://www.copyright.gov/ai/), nor 
any other legal framework that might come up in other parts of the world.

 > I will also say that I have not seen a credible argument since after
 > the initial rollout of GitHub Copilot that an LLM product was emitting
 > copylefted code (that is, code that was substantially similar to some
 > existing third party code under a copyleft license). It's obviously
 > *possible* and indeed the possibility is unavoidable, but I have to
 > wonder whether it's uncommon (and/or practically detectable) since I
 > think we'd otherwise be hearing more accusations about specific
 > misappropriations from copyleft code authors.

I don't know if this derails the discussion, but my personal opinion and 
observation is that this is because there are two "groups": One "group" 
has moved out of GitHub and either self-host, or use Codeberg and other 
FOSS solutions (in both Codeberg and SourceHut, as well as freedesktop's 
GitLab, Anubis is deployed to block crawlers.). The other one is either 
apathetic or at least waiting to see what the law has to say.

Regarding your criticism of my patchwork of a license, I fully agree; as 
stated in another response in the thread, I made it, and use it as a bad 
deterrent against scraping, and is there only until a proper solution is 
found. I would never consider myself capable to come up with a license 
from scratch, hence why this thread was started in the first place.

Obviously regarding your statement that any new copyleft license should 
be libre, I also agree. I was the person mentioning the Affero clause, 
as an example of something that *could* be interpreted as non-free 
whilst actually being free; this was done because the claim was that a 
clause regarding the use of the software as training data would be seen 
as violating Freedom 0, while I hold the position that it is of a 
similar nature to the AGPL, LGPL, and the GPL with linking exception 
(all three discriminate about a specific use of the software, but still 
are libre, because they don't prevent the use of the software, just 
extend or retract the copyleft clause).

 > So all that is to say that maybe the definition of free software, or
 > our application of that definition, needs to evolve specifically
 > because of the challenges being created by the rise of LLMs trained on
 > vast quantities of free software and being used (a) potentially,
 > though hopefully very rarely, to produce output that violates the
 > licenses of that training data, (b) commonly to produce output that
 > may not resemble such training data closely at all and yet in some
 > justifiable (to some people) sense feels like a misappropriation of
 > that work.

If we are to accept that the definition of free software is 
community-derived, and not mandated by an entity such as the FSF or the 
OSI, then there is no need to "evolve" the definition, since this 
process is organically happening. It is, however, an issue trying to 
freeze and formalize the "current" or "2012" definition such that it can 
be used as a basis for writing new licenses, since while everyone agrees 
on the spirit of FLOSS, people vehemently disagree on the word of FLOSS.

 > I don't know whether extending the frontier of what
 > software freedom is is within the scope of this project, though. I've
 > kind of thought we have a more modest goal, which is to assume a
 > fairly conservative view of what software freedom is (basically,
 > software freedom is what the FLOSS community generally thought it was
 > around 2012) and draft a copyleft license that fits within that view.

This is exactly why the subject of the thread is in the form of a 
question, as well as at the end I questioned even the sensibility of 
such a clause in the first place. To be honest I believe that the mere 
act of writing a new license is more than a "modest" goal, since the 
hope is that it gets adopted and preferred over other copyleft licenses, 
either for legibility/clarity, better legal interpretability, etc. This 
means that, at least to some extent, the underlying goal of 
copyleft-next would be to replace GPL at least in some use-cases. Most 
probably this task by it's own is ambitious enough for now and aiming 
for a "perfect, all encompassing license that tries to tackle every 
single issue at the same time" on the first go might be too much.

At the same time, I remember in another thread a discussion about the 
ambition of copyleft-next having a structure similar to CC; perhaps in 
place of that there is a family of licenses closer to the (A/L)-GPL family?

- Vasileios Valatsos