Should there be a clause for AI?
Vasileios Valatsos
me at aethrvmn.gr
Sat Jul 12 17:32:33 UTC 2025
On 12/7/25 17:49, Richard Fontana wrote:
> But that's not because of some special legal situation, and it's really
> no different from other modes of copyright infringement. If I write a
> novel, and it's used to train a model (let's assume I don't have a
> copyright infringement claim based on the act of training, an issue
> that has been raised in a number of current lawsuits in the US), and
> the model can be shown to produce output that's substantially similar
> to my novel, I might have a copyright infringement claim against
> someone in connection with the use of that model.
Yes, I fully agree. My point is that with the current state of things,
it is very problematic to figure out *who* that someone is.
It obviously can't be the end user, because they has no control over the
stochastic output of the model, and they can't possibly reference the
output and compare to figure out if it may violate any copyright/copyleft.
> I assume you're not suggesting that the mere obvious fact that popular
> LLMs are trained, in part, on copyleft code means that any output of
> such LLMs has to be assumed to be a copy, in some sense, or a
> derivative work, of such copyleft training data even if there is no
> apparent similarity between training data and output. That's a
> separate issue, anyway.
Obviously not. A large language model is a graph with weighted nodes;
the training data only alters the values of the weights, so it can't be
a derivative.
There are two different arguments made with respect to "copyright
abuse"; I won't go into depth here because as you said, it is a separate
issue. For completeness though, the first is what the NYT claim against
OpenAI and Microsoft; that because the model can output verbatim copies
of copyrighted content, it is a copyright violation which hurts the
author financially. The second one is the one that artists use in the
case of generative AI for media (images/video/audio) generation, and has
to do with the fact that .
Neither of those really matter when it comes to copyleft; copyleft
grants freedoms rather than restrict, however in both those cases it is
not the model that is treated as a copy or a derivative, but it's
capacity to output copyrighted (or copylefted) material (exactly because
it has been trained on them, rather than accidentally generating a
similar output). I am also not a lawyer, so I can't really confidently
interpret neither the EU AI
Act(https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai),
nor the US Copyright office reports (https://www.copyright.gov/ai/), nor
any other legal framework that might come up in other parts of the world.
> I will also say that I have not seen a credible argument since after
> the initial rollout of GitHub Copilot that an LLM product was emitting
> copylefted code (that is, code that was substantially similar to some
> existing third party code under a copyleft license). It's obviously
> *possible* and indeed the possibility is unavoidable, but I have to
> wonder whether it's uncommon (and/or practically detectable) since I
> think we'd otherwise be hearing more accusations about specific
> misappropriations from copyleft code authors.
I don't know if this derails the discussion, but my personal opinion and
observation is that this is because there are two "groups": One "group"
has moved out of GitHub and either self-host, or use Codeberg and other
FOSS solutions (in both Codeberg and SourceHut, as well as freedesktop's
GitLab, Anubis is deployed to block crawlers.). The other one is either
apathetic or at least waiting to see what the law has to say.
Regarding your criticism of my patchwork of a license, I fully agree; as
stated in another response in the thread, I made it, and use it as a bad
deterrent against scraping, and is there only until a proper solution is
found. I would never consider myself capable to come up with a license
from scratch, hence why this thread was started in the first place.
Obviously regarding your statement that any new copyleft license should
be libre, I also agree. I was the person mentioning the Affero clause,
as an example of something that *could* be interpreted as non-free
whilst actually being free; this was done because the claim was that a
clause regarding the use of the software as training data would be seen
as violating Freedom 0, while I hold the position that it is of a
similar nature to the AGPL, LGPL, and the GPL with linking exception
(all three discriminate about a specific use of the software, but still
are libre, because they don't prevent the use of the software, just
extend or retract the copyleft clause).
> So all that is to say that maybe the definition of free software, or
> our application of that definition, needs to evolve specifically
> because of the challenges being created by the rise of LLMs trained on
> vast quantities of free software and being used (a) potentially,
> though hopefully very rarely, to produce output that violates the
> licenses of that training data, (b) commonly to produce output that
> may not resemble such training data closely at all and yet in some
> justifiable (to some people) sense feels like a misappropriation of
> that work.
If we are to accept that the definition of free software is
community-derived, and not mandated by an entity such as the FSF or the
OSI, then there is no need to "evolve" the definition, since this
process is organically happening. It is, however, an issue trying to
freeze and formalize the "current" or "2012" definition such that it can
be used as a basis for writing new licenses, since while everyone agrees
on the spirit of FLOSS, people vehemently disagree on the word of FLOSS.
> I don't know whether extending the frontier of what
> software freedom is is within the scope of this project, though. I've
> kind of thought we have a more modest goal, which is to assume a
> fairly conservative view of what software freedom is (basically,
> software freedom is what the FLOSS community generally thought it was
> around 2012) and draft a copyleft license that fits within that view.
This is exactly why the subject of the thread is in the form of a
question, as well as at the end I questioned even the sensibility of
such a clause in the first place. To be honest I believe that the mere
act of writing a new license is more than a "modest" goal, since the
hope is that it gets adopted and preferred over other copyleft licenses,
either for legibility/clarity, better legal interpretability, etc. This
means that, at least to some extent, the underlying goal of
copyleft-next would be to replace GPL at least in some use-cases. Most
probably this task by it's own is ambitious enough for now and aiming
for a "perfect, all encompassing license that tries to tackle every
single issue at the same time" on the first go might be too much.
At the same time, I remember in another thread a discussion about the
ambition of copyleft-next having a structure similar to CC; perhaps in
place of that there is a family of licenses closer to the (A/L)-GPL family?
- Vasileios Valatsos
More information about the next
mailing list