Minutes of AI Assist Committee TUE 2022-06-22, 18:00-19:00 UTC

Bradley M. Kuhn bkuhn at sfconservancy.org
Wed Jun 29 17:43:10 UTC 2022


We make minutes from the Committee meetings public on this list.  We welcome
discussion from the public, and some members of the Committee are watching
this mailing list and will bring useful points raised in public discussion
back to the committee.

We ask that when you reply to the list, be mindful to keep the subject line
descriptive.  A long thread with “Re: Minutes of the AI Assist Committee” is
going to be difficult to follow the different threads of response that come
from the Committee's minutes.  Thus, while using threading (via
In-Reply-To:, References: and other RFC-5322-compliant headers) is still
useful, making sure that you change Subject: to match the content is much
appreciated for delineation of conversation.  Thanks!

BEGIN MINUTES, AI Assist Committee on Tuesday 2022-06-22, 18:00-19:00 UTC

The meeting began with a summary of the public mailing list discussion,
including the level of Computer Science academic interest in this particular
problem.

We discussed whether to invite academic Computer Science researchers focused
on APAS, either as full committee members or as a guest attendee.  We noted
that most researchers would focus on study about how the utility and
viability of APAS for developers, so there are few (if any) academics
studying the *ethics* related to that (as we are).  Nevertheless, their input
may be useful (at least) for complementary knowledge.

We discussed such engagement might also help to educate the academics about
community concerns — even if they wouldn't necessarily take us seriously.
Knowledge committee members on the topic did note that researchers tend to
always focus the technology — not whether or not the technology is ethical.

We also considered briefly the possibility that when the Committee is a bit
further a long, some subset of the committee should write an academic paper
ourselves.  We thought that such a paper might well have a impact on the
debate, and would potentially be of interest to the (likely minority of)
academics who are concerned about the ethics of technology.

ACTION ITEM: Two committee members agreed to follow up on the academic links
             posted to the list and follow up on these issues.

We had previously talked about plausibility of FOSS communities to reproduce
the results of large companies on creating APAS's.  Some committee members
thought it may in fact not be so expensive to train our own APAS using 100%
FOSS.  One committee member felt this was in the order of tens of thousands
of USD — which isn't hobbyist amounts — but is amounts that we could actually
effectively raise through donations.  The community and this committee itself
might well benefit from a system that's completely FOSS to better determine
if APAS's are an existential threat to FOSS, or if, in fact, APAS is merely
another key tasks that users want that we simply need to replace with FOSS
(i.e., perhaps APAS's *are* no different any other proprietary systems).

As a counter point, it's been reported that GPT-J is estimated to have cost
*hundreds* of thousands of dollars in retail compute time to train, so the
costs of running the software may greatly outweigh even the cost to pay
people to develop and maintain it.

A key additional task is curating the training set, since the model will
likely only be particularly useful if the training set includes actually good
choices of what software was fed in.  APAS (and machine learning in general)
are well-known to suffer from traditional “garbage in, garbage out” problems.

We discussed that there is still the threshold question of whether the
software itself (regardless of whether it is FOSS or proprietary) is ethical
at all.  A committee member compared APAS to FOSS implementations of Digital
Restrictions Management (DRM): while it's *possible* to write FOSS that does
the job of DRM, no one who writes FOSS wants to create such software, as most
people who support FOSS are also opposed to DRM entirely.  Should activists
argue that APAS's are just as dangerous to the future of computing as DRM?

However, a *FOSS* APAS (end-to-end — that spans from training, to model
tweaking tools, to APIs, to editor plugins — quite a bit of software) *would*
allow the Committee to more easily delineate the problems with these
systems.  It could well turn out that there is no existential threat: we
simply need FOSS solutions for this “new task in town”.

Furthermore, a FOSS system could be designed to annotate licensing
information (e.g., tagging it) in the model itself, which could then be used
as part of output to help users.  AI training systems are notoriously bad at
telling you “Why!?!?” it got a particular answer.  But licensing information
is generally well-marked with the code, so including the information as part
of the metadata may be a subset of the “Why!?!?” problem that we *can* solve.
This is difficult, but might not be impossible.  It *is*, however, an active
area of research as it's asking roughly the same scientific question as is
asked regarding “explainability of machine learning solutions” — so engaging
in this approach would require us to be involved in PhD-level CS work.

Nevertheless, if we were successful at annotating in this way, it would also
set the narrative and industry standard that we *do* expect licensing
information to be carried through AI models.  A committee member noted that
this approach may be more doable than we think — while researchers and others
are saying carrying this information along is very difficult, their
motivation is *not* carry this information along as annotations, and as such,
there has been limited scientific work in this area thus far (compared to
machine learning in general).  Another member noted that the motivations
might also be that, given that creating models is expensive, they want to
have *one* model, and not separate copyleft and non-copyleft models.

Multiple committee members noted that if we *do* write a FOSS system, it
definitely needs to be copylefted itself.  It's notable that the license of
the model is also in question.  Generally speaking, inputs and outputs aren't
impacted by most copyleft licenses of software, so the license of the model
probably stems from input to output (which is what many are already saying is
true with Copilot anyway).

We moved to a tangential discussion about how the training set license might
impact the output license.  It's an easy assumption to make that the entire
model is impacted by the license of all the data that's input.  (Microsoft
and GitHub obviously argue the opposite.)  However, the most conservative and
easiest legal analysis would obviously be that the input and output must be
under the same license.  Nevertheless, the attribution and patent clauses of
licenses are making it even more difficult to determine the licensing terms
of the model.  (In some sense, copyleft is easier because it requires the
whole work (including derivatives and works based on it) to carry the same
license.  IOW, ASLv2 and attribution-only FOSS licenses provide complexities
of metadata that copyleft doesn't.

We could make the simple conclusion as a committee that “Our Rule” is: “if
License A is in the training set, then License A terms apply to the model”.
If we build an alternative system that respects this rule, and we simply
conclude that APAS's that fail to do this *are* violating the licenses.  That
conclusion is, at least, nor more or less reasonable than the conclusions
that GitHub's position that “if License A is in the training set, the user of
the APAS can ignore license A no matter what happens”.  It's harder for make
that claim that “Our Rule” is correct if we don't have an APAS to offer that
follows “Our Rule”.  But, if we produce a working APAS follows “Our Rule”, it
gives us a non-hypothetical example of the conservative licensing approach.
As an activist matter, this turns it back to others to argue that “Our Rule”
is too strict.  In other words, given that these issues are entirely novel,
there is at least just as much chance that “Our Rule” is correct as any
other, and as such, we should really create an APAS that follows “Our Rule”.

The license compliance industry *now* generally encourages companies to
respect other people's licenses.  They are in some way unlikely allies to
“Our Rule” here.  While the compliance industry are often fear-mongerers,
they do have a valid point that tracking what the external inputs are to your
software development process *do* impact the licensing conclusions; copyleft
activists agree with the compliance industry on that point.  We should
frankly ask the compliance industry: “Do you really think Microsoft and
GitHub know some legal decision we don't?  Can you really be sure that using
their system will not cause an infringement problem for you?”  We expect that
is the key manner in which the compliance industry is an unlikely ally.

Regardless, these are hard issues to raise when there is nothing to compare
Copilot too, but if we build this FOSS APAS that follows “Our Rule”, it's
much easier to reject APAS's that *aren't* license-respecting.  Regardless,
open compliance is central: a proprietary APAS would never give enough
information about compliance because it's a black box, just as proprietary
scanning tools from the compliance industry are useless black boxes, too.

[In conclusion on this line of thinking: we have more power if we have built
the tool that does it right, and we can create a industry standard that you
*should* be tracking the license in your APAS's.  Ultimately, this is the
“hackavist” way to approach the problem.

However, are we looking to the solution of writing new FOSS because writing
new FOSS is a task our community knows how to do?  Is there a different
approach we could or should take that we're not seeing because we're enticed
by an (albeit difficult) problem that we already know how to solve?

A counter-argument to that worry is that, with a FOSS APAS following “Our
Rule”, we can poke at very specific parts of the system and examine its
ethics outside of the fact that that the APAS itself is proprietary.  IOW, we
all don't like Copilot no matter *what* its job since it is, itself, a
proprietary/trade-secret system from top to bottom.  Without looking at FOSS
system that does this task, it's hard to consider the broader ethical
problems that are unique to APAS's.

A committee member changed topics to ask a fundamental question: “Is it
morally — not legally, but morally/ethically — wrong to create new
proprietary software with an APAS?”

One ethical argument that's been made is that a company that makes money from
APAS's are profiting from the labor of others without following the wishes of
those who did the labor (e.g., the license terms).

Another approach is the argument that, is a standard Free Software purist
argument, with logic like this: since we agree any proprietary software is a
moral affront to users' rights, then *any* system (be it FOSS, proprietary or
otherwise) that assists someone to write more proprietary software
(particularly one that helps the user write proprietary software better,
faster and/or more easily) is itself, a morally wrong system.  (By
comparison: if you're opposed to more fossil fuel usage, you usually would
oppose systems any system that can pump oil faster out of the ground.)
However, there are very few people radical enough such that they agree with
the premise that “any proprietary software is a moral affront to users'
rights”.  We could be splitting the fragile coalition of FOSS activists if
our ethical arguments rely on agreement to that premise.

Meanwhile, it could be that APAS's make writing software *so* easy that it's
easier to write FOSS, too.  We could easily theorize a science-fiction-like
system that makes software so easy to write, that it would be strange to
write any proprietary software.  In such a world, writing new code would
always be easier than using someone else's proprietary software.  This
hypothetical points to fact that our view is always skewed by the fact that
proprietary software is the *norm*, while FOSS is rarity.  We should make
effort to at least consider how various ideas and arguments play out *if*
FOSS were to become the norm — even though that doesn't seem likely in the
future, could APAS's be the innovation that turns that tide?

On a related point, historically, we *have* written a lot of FOSS that helps
people write proprietary software, and they do so with those tools: Emacs,
GCC, Eclipse, etc.  One committee member suggested that perhaps it's just a
weird side-effect and/or a licensing bug that Emacs' license allows the user
to write proprietary software with it.  Is “input/output excluded from
copyleft” (i.e., that copyleft licenses went to great effort to avoid having
the license govern the input and output of the software) a principle, or
simply a historical accident?  The fact that we think it's problematic to
restrict “field of use” (to use the OSD framing of the same question) may be
more “cargo cult” than it is based on some central principle; and, it could
be true that our historical thinking was overly influenced by the question
of how copyright covers software, and the various derivative works standards
in different legal systems.  IOW, maybe “no field of use restriction” was
simply inherent in the license design because the licenses were contemplated
first as copyright licenses, so “field of use” including “proprietary
software development” was more a practicality of license design rather than
an ethically-founded moral principle.

The problem, however, is if we put the ethical restrictions that go too far
(in any direction), we have a hard time building/maintaining a broad
coalition of FOSS supporters.  Even We all on this very committee don't
necessary agree that all inputs and outputs of FOSS should be FOSS, yet we
all consider ourselves strong software freedom activists.

In response to point, one member attempted to narrow the question: Could we
all agree that, as an ethical matter (not necessary a legal matter) that:

   If the inputs to a training set for APAS is FOSS, *then* the output the
   users gets from the APAS should (as a moral matter) be FOSS?

If we did agree to that, purely as a moral/ethical matter, would we be losing
a coalition that exists elsewhere in FOSS?

Unfortunately, we realized, the fragile FOSS coalition may well be built more
about what the legal conclusion is rather than what activists believe is the
correct moral conclusion.  There may not actually be an overriding moral
principle that binds FOSS activism together; it may merely be that we all
have simply historically agreed on the legal conclusions about copyleft's
scope.  This issue could well be testing that coalition, since it may be the
first time the copyleft scope is not obvious to everyone in the broader FOSS
coalition.

Some also noted we'd probably need to use terms of service to keep the thing
FOSS if we wanted to enforce a moral requirement (as opposed to, say, having
the copyleft license alone “take care of it”).  Using terms of service to
mandate software freedom has not been used in past, and may be another type
of activity that would dilute coalition between more and less strident
activists.

As a final matter for the meeting, we discussed what one committee member
dubbed the “creepiness factor” of AI-assistive systems.  We've found there to
be creepy and systemic bias issues in, for example, AI systems that assist
with hiring, or those that decide what alleged criminals receive bail.  We
considered: do these kinds of problems exist with APAS's?

The general consensus of the committee was that *if* we are training on all
FOSS anyway, even comments, or other personal information, all that would
have been made public already.  It's unlikely to be racially/ethnically
biased, violent, or triggering output to the user.  The worst we could
imagine would be curse words and the like.  Once the data is public, it's
unlikely something could be rehashed in a way that would likely upset
developers.

There *is* a lot of personal stuff in code repositories, and that's been well
confirmed by others.  However, the problem here may ultimately be identical
to the problem of accidentally committing data to a Git repository that you
didn't mean to, which means similar solutions that solve the problems there
will need to be applied here.  The big hurdle is how to remove items from a
model without complete retraining.

However, if we decide the right approach to respond to APAS is to put full
force of support behind a copylefted, FOSS APAS (end-to-end) that follows
“Our Rule”, then these problems *will* be our problems.  It's admittedly much
easier to criticize bad actors if you are not trying to also solve the same
problems.

END MINUTES, AI Assist Committee on Tuesday 2022-06-22, 18:00-19:00 UTC


More information about the ai-assist mailing list