Minutes of the AI Assist Committee on Tuesday 2022-03-22, 18:00-19:00 UTC

Mon May 23 19:14:55 UTC 2022

I want to thank all of you who subscribed to this list when we announced it
back in March for your patience.  I was out of the office for half of April
due to an wrist injury that made it difficult for me to type, and as such I
did not complete the minutes from the first meeting of the Committee.

We will make minutes from the Committee meetings public on this list.  We
welcome discussion from the public, and some members of the Committee are
watching this mailing list and will bring useful points raised in public
discussion back to the committee.

Since I'll be the primary one doing this, I'd ask that when you reply to the
list, be mindful to keep the subject line descriptive.  A long thread with
“Re: Minutes of the AI Assist Committee” is going to be difficult for me to
pull together public discussions for the Committee to digest them, so while
using threading is great and useful, making sure that the Subject line
matches the content is also much appreciated.

BEGIN MINUTES, AI Assist Committee on Tuesday 2022-03-22, 18:00-19:00 UTC

Committee reached consensus that the goal of the committee is not to come to
a legal conclusion about what existing copyleft licenses do or do not
require with regard to AI Programming Assistance Software (APAS) [0], but
rather, what we believe are users' rights of software freedom regarding
APAS.  While we'll consider how the existing copyleft licenses work with
regard to these issues and what license drafting might now be needed due to
the advent of APAS, we will focus initially on a policy statement/document
that explains our conclusions about what rights users' deserve regarding
APAS.  There are very few lawyers on the Committee by design, as this isn't
primarily a legal endeavor, it's a policy endeavor.

We framed the usual APAS system as three parts:

0. a Corpus, which is software, and is under some license.

1. That Corpus is fed into a software system that trains (the “Trainer”)
   that creates some “Model”.  The Model is the output this process.

2. The Model is then *input* to a third piece of software, which is the
   system that the Programmer uses directly

We considered the question in terms of the standard Free Software Definition
(the four freedoms).  First, the “freedom to run” didn't seem highly
relevant to any of the parts, since they are unlikely to take the “freedom
to run” away.

Second, the Trainer exercises its own “freedom to study”, and it seems
reasonable that users should be able to use software tools to study other
software (cf: static analysis tools).

Third, the issue of “freedom to modify” is the heart of the issues with all
parts an APAS system.  The fundamental question may well be whether the
Model is “based on” or a “derivative work” of the software, which quickly
drags us into legal details we want to avoid, or (at least), defer.

The example was given of the first copylefted program that was likely to
take other FOSS as its inputs and outputs: The GPL'd GNU implementation of
the Unix-like 'cp' command.  When we look at 'cp', its operation has three
parts: input (file to be copied), outputs (the copy of the file), and 'cp'
itself, which is the program that does the copying activity.  Every software
rights activist agrees that 'cp' ought to be FOSS for all who use it — even
if they use it with inputs and outputs that are under a variety of terms
(including some copylefted and some not).  The question of whether 'cp'
should be FOSS has always been considered orthogonal to terms/restrictions
of its inputs and outputs.

When we analogize 'cp' to the Trainer, we conclude three things:

(a) activists have never argued that 'cp' should be engineered to refuse to copy
    non-FOSS

(b) 'cp' never enforces some sort of license-checking to assure the
    copylefted software that it copies is done so in compliance (i.e., you
    could easily use cp to install copylefted binaries without bringing the
    source code along),

(c) copyleft as a policy matter has never restricted permission
    from using proprietary Unix 'cp' commands on the copylefted software.

Similarly, programmers have always had permissions to use proprietary tools
to modify copylefted software (e.g., using Visual Studio).

Therefore, by analogy, software freedom ethicists might conclude that, under
historical policy precedent, that a Trainer can:

   * operating freely on both FOSS and non-FOSS (as inputs) to create Models
     (as outputs)
   * Allow proprietary-licensed Models to be created when the inputs are
     proprietary.
   * Proprietary Trainers are not particularly worse than any other
     proprietary software; IOW, they are a moral injustice to software
     rights, but not necessarily moreso than other proprietary software.

This in some way creates a “lower bound threshold” on what the license of
the Model should be.  If we agree with this analysis, it indicates that we
should not seek to influence the license constraints of the Model via the
terms of FOSS Trainers, but rather should solely focus on the license of the
inputs.  While we're avoiding the legal question of whether *already* a
Model is a “work based on the Program” (or otherwise triggering copyleft
requirements), we could conclude, as a moral matter, that copylefted inputs
should mean that a copylefted Model.

Conversation at this point descended into legal details of fair use,
requirements for derivative works.  We found it difficult as a Committee to
avoid thinking in these legalistic terms multiple times during the meeting —
likely because so much discussion about copyleft historically has focused on
the question of what situations trigger source code disclosure.  

There seemed to be some consensus that the proprietarizing any Model where
the inputs are FOSS is morally problematic and that the Model *ought* to be
FOSS.

 * * *

We then considered a more fundamental question: are there a social set of
concerns at a meta-level that has hitherto been not considered.  Namely, we
now have “programmers writing programs that write programs and help others
to write programs”.  What are our concerns that programmers can write
software that they themselves don't understand because an AI generated that
code for them.

We wondered about the analogy to the now long-standing tradition of writing
code by cut-and-paste.  We consider that situation well understood (again,
heavily influenced by various legal conclusions about what types of
cut-and-paste are fair use and what types require
modification/redistribution permissions).  We want to avoid the risk of
losing sight of Free Culture issues.  Any policy conclusions we come to much
have coherent outcomes for Free Culture (i.e., AI-assisted systems that help
make creative works) and Free Software (APAS).

The committee reviewed the article from Creative Commons, entitled
“Beginning of Creative Commons consideration on AI training using CC
content” available at
https://creativecommons.org/2021/03/04/should-cc-licensed-content-be-used-to-train-ai-it-depends/
We note that CC's work focuses in the exact direction we're trying to avoid:
thinking purely in terms of what copyright rules require or don't in this
regard, rather than considering the policy implications for rights and
freedoms.  one Committee member noted that CC's mission has historically
been focused on “some rights reserved”, so it's unsurprising they don't seem
to have a policy contribution here. Also, CC has not done anything publicly
on this issue since then.  Nevertheless, We do hope to include more
participants from the Free Culture community, and are open to inviting more.
We noted, however, that a high-ranking person from CC itself has been
invited to the Committee but had not responded to multiple initiations.

We then moved on to discussion about whether there *all* Trainers are
currently proprietary.  Some thought was given to the idea that the Trainer
is analogous to a source-to-binary compiler (e.g., GCC).

We discussed if we wanted to fund or make a Trainer that did respect
software freedom.  Are proprietary Trainers already so far technologically
ahead that we are hopeless to make FOSS?  We suspect that there may not be
anything particularly technologically advanced, and there may well be
existing FOSS that operates as Trainers.  However, Trainers are somewhat
similar to Internet search problems in terms of computing resources.
Trainers require large data storage, CPU, RAM.  Thus, the main issue is not
learning and implementing the Training algorithms, but being able to run
them.

We pondered: “is that a freedom-to-run issue?”  Is it fair to users that
they cannot rerun the Trainer and produce Models without expensive hardware.
Even if all Trainers were/are FOSS, the non-commercial FOSS community would
remain at a disadvantage.

 * * *

Conversation then moved to the question of Trainer and the Model together,
as a unit.  The Model is the most interesting artifact, and the Trainer
rarely is (although it's admittedly useful).  The Model is also
non-deterministic, which means byte-for-byte reproducibility is basically
impossible, but near reproducibility should be possible.

Meanwhile, the Model is inherently tied to the software that builds it but
it is not a situation we're a programmer is likely able to make a change to
the inputs and the Trainer and be able to understand the changes the Model.
It's inherently different from the compiler example: where you can change
the source and have a clear understanding as programmer what the source code
change will yield after running it through the compiler.  Perhaps a spell
checker and its data (i.e., the spell checker, as the Trainer), is useless
without a correctly-spelled-word list.

All this begs the question: is the Model really more like code than data.
And, if we treat the Model as software, then what is its CCS.  It seems the
CCS *must* be all the input data, as well as the code that trains it, as
well as the steps to do the Training.

Nevertheless, a code/data dichotomy may well be completely insufficient for
this analysis.  The APAS is trying to do “learning” in the way that humans
do.  The human analogy would leave the Trainer being the human's life
experience in programming, and that “Model” is the neural net in their own
brain.  We are left asking: “Is the Model the philosophical and moral
equivalent to a human programmers' brain?”

Computers, however, allow some sort of culpable deniability that humans
don't have.  Humans make a moral choice to do literal copying, as humans
(usually) *know* when they've memorized something and typed it in again
vs. writing something new.  Intent matters for humans, but there is no
presumption of malice on the part of an APAS.

Furthermore, human memory is rarely all that good, whereas computers, even a
Model, have the ability to memorize and reiterate verbatim.

 * * *

Discussion returned to the legal details of copyright and literal copying.

That led us to wonder: “Does the copyright-centered nature of traditional
FOSS thinking continue to cause great difficulty in our ability to
understand the implications of APAS's?  Is our vocabulary lacking because we
keep deferring to copyright concepts even as we try to discuss a purely
policy analysis separate from copyright implications?”  The “four freedoms”
may themselves fail us here, since they are targeted at that framework of
activities like copyright.  AI usage (thinking more of the racial bias in
AI's other than APAs), and other people (who aren't the user, author, nor
programmer) are often impacted by AI's and their behavior.

No one understands yet — neither ethically, morally, nor legally — what “AI
authorship” is, and how it compares to “human authorship”.  As such, the
problem of collaborative “co-authorship” by a human and an AI — particularly
when that AI has learned what it knows by studying primarily human-written
FOSS with both licensing and moral imperatives to assure future software
rights — are even more difficult to consider.

We ended noting that copyleft has expanded in the past in the face of new
technology (e.g., GPL ⇒ Affero GPL).  We can and should consider changes to
both software freedom/rights philosophy and copyleft in the face of these
new technologies.

Further Reading Proposed (on policy questions of studying AIs):

James Grimmelmann (Cornell Law School; Cornell Tech) and Daniel Westreich
(University of North Carolina (UNC) at Chapel Hill - Department of
Epidemiology) article “Incomprehensible Discrimination” :
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2950018

END MINUTES, AI Assist Committee on Tuesday 2022-03-22, 18:00-19:00 UTC

[0] I'm using the acronym APAS for lack of a better abbreviation.  The
    acronym is intended to include all parts of a AI Programming Assistance
    Software stack.
-- 
Bradley M. Kuhn - he/him
Policy Fellow & Hacker-in-Residence at Software Freedom Conservancy
========================================================================
Become a Conservancy Sustainer today: https://sfconservancy.org/sustainer