Minutes of AI Assist Committee TUE 2022-06-22, 18:00-19:00 UTC

Wed Jun 29 19:17:42 UTC 2022

An interesting read. Thank you for sharing, bkuhn.

> Some also noted we'd probably need to use terms of service to keep the thing
> FOSS if we wanted to enforce a moral requirement (as opposed to, say, having
> the copyleft license alone “take care of it”).  Using terms of service to
> mandate software freedom has not been used in past, and may be another type
> of activity that would dilute coalition between more and less strident
> activists.

IIUC, the terms of service would be used to mandate that this new FOSS
Trainer be exclusively used on FOSS inputs to produce a FOSS Model. Is
that correct?

> As a counter point, it's been reported that GPT-J is estimated to have cost
> *hundreds* of thousands of dollars in retail compute time to train, so the
> costs of running the software may greatly outweigh even the cost to pay
> people to develop and maintain it.

*If* the primary cost to produce a FOSS Model is compute time and not
developer time in creation and maintenance of the Trainer, then perhaps
no TOS would be necessary to prevent the creation of a proprietary Model
with the FOSS Trainer. IOW, if donations are given with the explicit
goal of creating a FOSS Model (and Trainer), and we create a good FOSS
Model, then perhaps the cost to create an alternative proprietary Model
from the FOSS Trainer would be prohibitively high.

Thanks,

Joseph

"Bradley M. Kuhn" <bkuhn at sfconservancy.org> writes:

> We make minutes from the Committee meetings public on this list.  We welcome
> discussion from the public, and some members of the Committee are watching
> this mailing list and will bring useful points raised in public discussion
> back to the committee.
>
> We ask that when you reply to the list, be mindful to keep the subject line
> descriptive.  A long thread with “Re: Minutes of the AI Assist Committee” is
> going to be difficult to follow the different threads of response that come
> from the Committee's minutes.  Thus, while using threading (via
> In-Reply-To:, References: and other RFC-5322-compliant headers) is still
> useful, making sure that you change Subject: to match the content is much
> appreciated for delineation of conversation.  Thanks!
>
> BEGIN MINUTES, AI Assist Committee on Tuesday 2022-06-22, 18:00-19:00 UTC
>
> The meeting began with a summary of the public mailing list discussion,
> including the level of Computer Science academic interest in this particular
> problem.
>
> We discussed whether to invite academic Computer Science researchers focused
> on APAS, either as full committee members or as a guest attendee.  We noted
> that most researchers would focus on study about how the utility and
> viability of APAS for developers, so there are few (if any) academics
> studying the *ethics* related to that (as we are).  Nevertheless, their input
> may be useful (at least) for complementary knowledge.
>
> We discussed such engagement might also help to educate the academics about
> community concerns — even if they wouldn't necessarily take us seriously.
> Knowledge committee members on the topic did note that researchers tend to
> always focus the technology — not whether or not the technology is ethical.
>
> We also considered briefly the possibility that when the Committee is a bit
> further a long, some subset of the committee should write an academic paper
> ourselves.  We thought that such a paper might well have a impact on the
> debate, and would potentially be of interest to the (likely minority of)
> academics who are concerned about the ethics of technology.
>
> ACTION ITEM: Two committee members agreed to follow up on the academic links
>              posted to the list and follow up on these issues.
>
> We had previously talked about plausibility of FOSS communities to reproduce
> the results of large companies on creating APAS's.  Some committee members
> thought it may in fact not be so expensive to train our own APAS using 100%
> FOSS.  One committee member felt this was in the order of tens of thousands
> of USD — which isn't hobbyist amounts — but is amounts that we could actually
> effectively raise through donations.  The community and this committee itself
> might well benefit from a system that's completely FOSS to better determine
> if APAS's are an existential threat to FOSS, or if, in fact, APAS is merely
> another key tasks that users want that we simply need to replace with FOSS
> (i.e., perhaps APAS's *are* no different any other proprietary systems).
>
> As a counter point, it's been reported that GPT-J is estimated to have cost
> *hundreds* of thousands of dollars in retail compute time to train, so the
> costs of running the software may greatly outweigh even the cost to pay
> people to develop and maintain it.
>
> A key additional task is curating the training set, since the model will
> likely only be particularly useful if the training set includes actually good
> choices of what software was fed in.  APAS (and machine learning in general)
> are well-known to suffer from traditional “garbage in, garbage out” problems.
>
> We discussed that there is still the threshold question of whether the
> software itself (regardless of whether it is FOSS or proprietary) is ethical
> at all.  A committee member compared APAS to FOSS implementations of Digital
> Restrictions Management (DRM): while it's *possible* to write FOSS that does
> the job of DRM, no one who writes FOSS wants to create such software, as most
> people who support FOSS are also opposed to DRM entirely.  Should activists
> argue that APAS's are just as dangerous to the future of computing as DRM?
>
> However, a *FOSS* APAS (end-to-end — that spans from training, to model
> tweaking tools, to APIs, to editor plugins — quite a bit of software) *would*
> allow the Committee to more easily delineate the problems with these
> systems.  It could well turn out that there is no existential threat: we
> simply need FOSS solutions for this “new task in town”.
>
> Furthermore, a FOSS system could be designed to annotate licensing
> information (e.g., tagging it) in the model itself, which could then be used
> as part of output to help users.  AI training systems are notoriously bad at
> telling you “Why!?!?” it got a particular answer.  But licensing information
> is generally well-marked with the code, so including the information as part
> of the metadata may be a subset of the “Why!?!?” problem that we *can* solve.
> This is difficult, but might not be impossible.  It *is*, however, an active
> area of research as it's asking roughly the same scientific question as is
> asked regarding “explainability of machine learning solutions” — so engaging
> in this approach would require us to be involved in PhD-level CS work.
>
> Nevertheless, if we were successful at annotating in this way, it would also
> set the narrative and industry standard that we *do* expect licensing
> information to be carried through AI models.  A committee member noted that
> this approach may be more doable than we think — while researchers and others
> are saying carrying this information along is very difficult, their
> motivation is *not* carry this information along as annotations, and as such,
> there has been limited scientific work in this area thus far (compared to
> machine learning in general).  Another member noted that the motivations
> might also be that, given that creating models is expensive, they want to
> have *one* model, and not separate copyleft and non-copyleft models.
>
> Multiple committee members noted that if we *do* write a FOSS system, it
> definitely needs to be copylefted itself.  It's notable that the license of
> the model is also in question.  Generally speaking, inputs and outputs aren't
> impacted by most copyleft licenses of software, so the license of the model
> probably stems from input to output (which is what many are already saying is
> true with Copilot anyway).
>
> We moved to a tangential discussion about how the training set license might
> impact the output license.  It's an easy assumption to make that the entire
> model is impacted by the license of all the data that's input.  (Microsoft
> and GitHub obviously argue the opposite.)  However, the most conservative and
> easiest legal analysis would obviously be that the input and output must be
> under the same license.  Nevertheless, the attribution and patent clauses of
> licenses are making it even more difficult to determine the licensing terms
> of the model.  (In some sense, copyleft is easier because it requires the
> whole work (including derivatives and works based on it) to carry the same
> license.  IOW, ASLv2 and attribution-only FOSS licenses provide complexities
> of metadata that copyleft doesn't.
>
> We could make the simple conclusion as a committee that “Our Rule” is: “if
> License A is in the training set, then License A terms apply to the model”.
> If we build an alternative system that respects this rule, and we simply
> conclude that APAS's that fail to do this *are* violating the licenses.  That
> conclusion is, at least, nor more or less reasonable than the conclusions
> that GitHub's position that “if License A is in the training set, the user of
> the APAS can ignore license A no matter what happens”.  It's harder for make
> that claim that “Our Rule” is correct if we don't have an APAS to offer that
> follows “Our Rule”.  But, if we produce a working APAS follows “Our Rule”, it
> gives us a non-hypothetical example of the conservative licensing approach.
> As an activist matter, this turns it back to others to argue that “Our Rule”
> is too strict.  In other words, given that these issues are entirely novel,
> there is at least just as much chance that “Our Rule” is correct as any
> other, and as such, we should really create an APAS that follows “Our Rule”.
>
> The license compliance industry *now* generally encourages companies to
> respect other people's licenses.  They are in some way unlikely allies to
> “Our Rule” here.  While the compliance industry are often fear-mongerers,
> they do have a valid point that tracking what the external inputs are to your
> software development process *do* impact the licensing conclusions; copyleft
> activists agree with the compliance industry on that point.  We should
> frankly ask the compliance industry: “Do you really think Microsoft and
> GitHub know some legal decision we don't?  Can you really be sure that using
> their system will not cause an infringement problem for you?”  We expect that
> is the key manner in which the compliance industry is an unlikely ally.
>
> Regardless, these are hard issues to raise when there is nothing to compare
> Copilot too, but if we build this FOSS APAS that follows “Our Rule”, it's
> much easier to reject APAS's that *aren't* license-respecting.  Regardless,
> open compliance is central: a proprietary APAS would never give enough
> information about compliance because it's a black box, just as proprietary
> scanning tools from the compliance industry are useless black boxes, too.
>
> [In conclusion on this line of thinking: we have more power if we have built
> the tool that does it right, and we can create a industry standard that you
> *should* be tracking the license in your APAS's.  Ultimately, this is the
> “hackavist” way to approach the problem.
>
> However, are we looking to the solution of writing new FOSS because writing
> new FOSS is a task our community knows how to do?  Is there a different
> approach we could or should take that we're not seeing because we're enticed
> by an (albeit difficult) problem that we already know how to solve?
>
> A counter-argument to that worry is that, with a FOSS APAS following “Our
> Rule”, we can poke at very specific parts of the system and examine its
> ethics outside of the fact that that the APAS itself is proprietary.  IOW, we
> all don't like Copilot no matter *what* its job since it is, itself, a
> proprietary/trade-secret system from top to bottom.  Without looking at FOSS
> system that does this task, it's hard to consider the broader ethical
> problems that are unique to APAS's.
>
> A committee member changed topics to ask a fundamental question: “Is it
> morally — not legally, but morally/ethically — wrong to create new
> proprietary software with an APAS?”
>
> One ethical argument that's been made is that a company that makes money from
> APAS's are profiting from the labor of others without following the wishes of
> those who did the labor (e.g., the license terms).
>
> Another approach is the argument that, is a standard Free Software purist
> argument, with logic like this: since we agree any proprietary software is a
> moral affront to users' rights, then *any* system (be it FOSS, proprietary or
> otherwise) that assists someone to write more proprietary software
> (particularly one that helps the user write proprietary software better,
> faster and/or more easily) is itself, a morally wrong system.  (By
> comparison: if you're opposed to more fossil fuel usage, you usually would
> oppose systems any system that can pump oil faster out of the ground.)
> However, there are very few people radical enough such that they agree with
> the premise that “any proprietary software is a moral affront to users'
> rights”.  We could be splitting the fragile coalition of FOSS activists if
> our ethical arguments rely on agreement to that premise.
>
> Meanwhile, it could be that APAS's make writing software *so* easy that it's
> easier to write FOSS, too.  We could easily theorize a science-fiction-like
> system that makes software so easy to write, that it would be strange to
> write any proprietary software.  In such a world, writing new code would
> always be easier than using someone else's proprietary software.  This
> hypothetical points to fact that our view is always skewed by the fact that
> proprietary software is the *norm*, while FOSS is rarity.  We should make
> effort to at least consider how various ideas and arguments play out *if*
> FOSS were to become the norm — even though that doesn't seem likely in the
> future, could APAS's be the innovation that turns that tide?
>
> On a related point, historically, we *have* written a lot of FOSS that helps
> people write proprietary software, and they do so with those tools: Emacs,
> GCC, Eclipse, etc.  One committee member suggested that perhaps it's just a
> weird side-effect and/or a licensing bug that Emacs' license allows the user
> to write proprietary software with it.  Is “input/output excluded from
> copyleft” (i.e., that copyleft licenses went to great effort to avoid having
> the license govern the input and output of the software) a principle, or
> simply a historical accident?  The fact that we think it's problematic to
> restrict “field of use” (to use the OSD framing of the same question) may be
> more “cargo cult” than it is based on some central principle; and, it could
> be true that our historical thinking was overly influenced by the question
> of how copyright covers software, and the various derivative works standards
> in different legal systems.  IOW, maybe “no field of use restriction” was
> simply inherent in the license design because the licenses were contemplated
> first as copyright licenses, so “field of use” including “proprietary
> software development” was more a practicality of license design rather than
> an ethically-founded moral principle.
>
> The problem, however, is if we put the ethical restrictions that go too far
> (in any direction), we have a hard time building/maintaining a broad
> coalition of FOSS supporters.  Even We all on this very committee don't
> necessary agree that all inputs and outputs of FOSS should be FOSS, yet we
> all consider ourselves strong software freedom activists.
>
> In response to point, one member attempted to narrow the question: Could we
> all agree that, as an ethical matter (not necessary a legal matter) that:
>
>    If the inputs to a training set for APAS is FOSS, *then* the output the
>    users gets from the APAS should (as a moral matter) be FOSS?
>
> If we did agree to that, purely as a moral/ethical matter, would we be losing
> a coalition that exists elsewhere in FOSS?
>
> Unfortunately, we realized, the fragile FOSS coalition may well be built more
> about what the legal conclusion is rather than what activists believe is the
> correct moral conclusion.  There may not actually be an overriding moral
> principle that binds FOSS activism together; it may merely be that we all
> have simply historically agreed on the legal conclusions about copyleft's
> scope.  This issue could well be testing that coalition, since it may be the
> first time the copyleft scope is not obvious to everyone in the broader FOSS
> coalition.
>
> Some also noted we'd probably need to use terms of service to keep the thing
> FOSS if we wanted to enforce a moral requirement (as opposed to, say, having
> the copyleft license alone “take care of it”).  Using terms of service to
> mandate software freedom has not been used in past, and may be another type
> of activity that would dilute coalition between more and less strident
> activists.
>
> As a final matter for the meeting, we discussed what one committee member
> dubbed the “creepiness factor” of AI-assistive systems.  We've found there to
> be creepy and systemic bias issues in, for example, AI systems that assist
> with hiring, or those that decide what alleged criminals receive bail.  We
> considered: do these kinds of problems exist with APAS's?
>
> The general consensus of the committee was that *if* we are training on all
> FOSS anyway, even comments, or other personal information, all that would
> have been made public already.  It's unlikely to be racially/ethnically
> biased, violent, or triggering output to the user.  The worst we could
> imagine would be curse words and the like.  Once the data is public, it's
> unlikely something could be rehashed in a way that would likely upset
> developers.
>
> There *is* a lot of personal stuff in code repositories, and that's been well
> confirmed by others.  However, the problem here may ultimately be identical
> to the problem of accidentally committing data to a Git repository that you
> didn't mean to, which means similar solutions that solve the problems there
> will need to be applied here.  The big hurdle is how to remove items from a
> model without complete retraining.
>
> However, if we decide the right approach to respond to APAS is to put full
> force of support behind a copylefted, FOSS APAS (end-to-end) that follows
> “Our Rule”, then these problems *will* be our problems.  It's admittedly much
> easier to criticize bad actors if you are not trying to also solve the same
> problems.
>
> END MINUTES, AI Assist Committee on Tuesday 2022-06-22, 18:00-19:00 UTC