Comparing perspectives: Trainer/Model and Input

Tue Jul 5 01:03:25 UTC 2022

One line of discussion has followed the topic of possible additional restrictions beyond GPL that could be applied to the Model in order to restrict its use in production of proprietary code from copylefted inputs.

In that discussion, there was concern that those additional restrictions might violate Freedom 0. Another thing to consider is an additional licensing restriction to the input code which states something to the effect of "if you use my code to produce a Model, then anything produced by the Model must also carry my code's license." IIUC, a clause like this would clarify the legal definition of a derivative work.

The Model's license is still a relevant concern here, but I think it's interesting to consider the issue from both the perspective of the rights and wishes of author of software used as input to the Model and also that of the rights and wishes of the author of the Trainer/Model.

Joseph Turner

On June 29, 2022 10:43:10 AM PDT, "Bradley M. Kuhn" <bkuhn at sfconservancy.org> wrote:
>We make minutes from the Committee meetings public on this list.  We welcome
>discussion from the public, and some members of the Committee are watching
>this mailing list and will bring useful points raised in public discussion
>back to the committee.
>
>We ask that when you reply to the list, be mindful to keep the subject line
>descriptive.  A long thread with “Re: Minutes of the AI Assist Committee” is
>going to be difficult to follow the different threads of response that come
>from the Committee's minutes.  Thus, while using threading (via
>In-Reply-To:, References: and other RFC-5322-compliant headers) is still
>useful, making sure that you change Subject: to match the content is much
>appreciated for delineation of conversation.  Thanks!
>
>BEGIN MINUTES, AI Assist Committee on Tuesday 2022-06-22, 18:00-19:00 UTC
>
>The meeting began with a summary of the public mailing list discussion,
>including the level of Computer Science academic interest in this particular
>problem.
>
>We discussed whether to invite academic Computer Science researchers focused
>on APAS, either as full committee members or as a guest attendee.  We noted
>that most researchers would focus on study about how the utility and
>viability of APAS for developers, so there are few (if any) academics
>studying the *ethics* related to that (as we are).  Nevertheless, their input
>may be useful (at least) for complementary knowledge.
>
>We discussed such engagement might also help to educate the academics about
>community concerns — even if they wouldn't necessarily take us seriously.
>Knowledge committee members on the topic did note that researchers tend to
>always focus the technology — not whether or not the technology is ethical.
>
>We also considered briefly the possibility that when the Committee is a bit
>further a long, some subset of the committee should write an academic paper
>ourselves.  We thought that such a paper might well have a impact on the
>debate, and would potentially be of interest to the (likely minority of)
>academics who are concerned about the ethics of technology.
>
>ACTION ITEM: Two committee members agreed to follow up on the academic links
>             posted to the list and follow up on these issues.
>
>We had previously talked about plausibility of FOSS communities to reproduce
>the results of large companies on creating APAS's.  Some committee members
>thought it may in fact not be so expensive to train our own APAS using 100%
>FOSS.  One committee member felt this was in the order of tens of thousands
>of USD — which isn't hobbyist amounts — but is amounts that we could actually
>effectively raise through donations.  The community and this committee itself
>might well benefit from a system that's completely FOSS to better determine
>if APAS's are an existential threat to FOSS, or if, in fact, APAS is merely
>another key tasks that users want that we simply need to replace with FOSS
>(i.e., perhaps APAS's *are* no different any other proprietary systems).
>
>As a counter point, it's been reported that GPT-J is estimated to have cost
>*hundreds* of thousands of dollars in retail compute time to train, so the
>costs of running the software may greatly outweigh even the cost to pay
>people to develop and maintain it.
>
>A key additional task is curating the training set, since the model will
>likely only be particularly useful if the training set includes actually good
>choices of what software was fed in.  APAS (and machine learning in general)
>are well-known to suffer from traditional “garbage in, garbage out” problems.
>
>We discussed that there is still the threshold question of whether the
>software itself (regardless of whether it is FOSS or proprietary) is ethical
>at all.  A committee member compared APAS to FOSS implementations of Digital
>Restrictions Management (DRM): while it's *possible* to write FOSS that does
>the job of DRM, no one who writes FOSS wants to create such software, as most
>people who support FOSS are also opposed to DRM entirely.  Should activists
>argue that APAS's are just as dangerous to the future of computing as DRM?
>
>However, a *FOSS* APAS (end-to-end — that spans from training, to model
>tweaking tools, to APIs, to editor plugins — quite a bit of software) *would*
>allow the Committee to more easily delineate the problems with these
>systems.  It could well turn out that there is no existential threat: we
>simply need FOSS solutions for this “new task in town”.
>
>Furthermore, a FOSS system could be designed to annotate licensing
>information (e.g., tagging it) in the model itself, which could then be used
>as part of output to help users.  AI training systems are notoriously bad at
>telling you “Why!?!?” it got a particular answer.  But licensing information
>is generally well-marked with the code, so including the information as part
>of the metadata may be a subset of the “Why!?!?” problem that we *can* solve.
>This is difficult, but might not be impossible.  It *is*, however, an active
>area of research as it's asking roughly the same scientific question as is
>asked regarding “explainability of machine learning solutions” — so engaging
>in this approach would require us to be involved in PhD-level CS work.
>
>Nevertheless, if we were successful at annotating in this way, it would also
>set the narrative and industry standard that we *do* expect licensing
>information to be carried through AI models.  A committee member noted that
>this approach may be more doable than we think — while researchers and others
>are saying carrying this information along is very difficult, their
>motivation is *not* carry this information along as annotations, and as such,
>there has been limited scientific work in this area thus far (compared to
>machine learning in general).  Another member noted that the motivations
>might also be that, given that creating models is expensive, they want to
>have *one* model, and not separate copyleft and non-copyleft models.
>
>Multiple committee members noted that if we *do* write a FOSS system, it
>definitely needs to be copylefted itself.  It's notable that the license of
>the model is also in question.  Generally speaking, inputs and outputs aren't
>impacted by most copyleft licenses of software, so the license of the model
>probably stems from input to output (which is what many are already saying is
>true with Copilot anyway).
>
>We moved to a tangential discussion about how the training set license might
>impact the output license.  It's an easy assumption to make that the entire
>model is impacted by the license of all the data that's input.  (Microsoft
>and GitHub obviously argue the opposite.)  However, the most conservative and
>easiest legal analysis would obviously be that the input and output must be
>under the same license.  Nevertheless, the attribution and patent clauses of
>licenses are making it even more difficult to determine the licensing terms
>of the model.  (In some sense, copyleft is easier because it requires the
>whole work (including derivatives and works based on it) to carry the same
>license.  IOW, ASLv2 and attribution-only FOSS licenses provide complexities
>of metadata that copyleft doesn't.
>
>We could make the simple conclusion as a committee that “Our Rule” is: “if
>License A is in the training set, then License A terms apply to the model”.
>If we build an alternative system that respects this rule, and we simply
>conclude that APAS's that fail to do this *are* violating the licenses.  That
>conclusion is, at least, nor more or less reasonable than the conclusions
>that GitHub's position that “if License A is in the training set, the user of
>the APAS can ignore license A no matter what happens”.  It's harder for make
>that claim that “Our Rule” is correct if we don't have an APAS to offer that
>follows “Our Rule”.  But, if we produce a working APAS follows “Our Rule”, it
>gives us a non-hypothetical example of the conservative licensing approach.
>As an activist matter, this turns it back to others to argue that “Our Rule”
>is too strict.  In other words, given that these issues are entirely novel,
>there is at least just as much chance that “Our Rule” is correct as any
>other, and as such, we should really create an APAS that follows “Our Rule”.
>
>The license compliance industry *now* generally encourages companies to
>respect other people's licenses.  They are in some way unlikely allies to
>“Our Rule” here.  While the compliance industry are often fear-mongerers,
>they do have a valid point that tracking what the external inputs are to your
>software development process *do* impact the licensing conclusions; copyleft
>activists agree with the compliance industry on that point.  We should
>frankly ask the compliance industry: “Do you really think Microsoft and
>GitHub know some legal decision we don't?  Can you really be sure that using
>their system will not cause an infringement problem for you?”  We expect that
>is the key manner in which the compliance industry is an unlikely ally.
>
>Regardless, these are hard issues to raise when there is nothing to compare
>Copilot too, but if we build this FOSS APAS that follows “Our Rule”, it's
>much easier to reject APAS's that *aren't* license-respecting.  Regardless,
>open compliance is central: a proprietary APAS would never give enough
>information about compliance because it's a black box, just as proprietary
>scanning tools from the compliance industry are useless black boxes, too.
>
>[In conclusion on this line of thinking: we have more power if we have built
>the tool that does it right, and we can create a industry standard that you
>*should* be tracking the license in your APAS's.  Ultimately, this is the
>“hackavist” way to approach the problem.
>
>However, are we looking to the solution of writing new FOSS because writing
>new FOSS is a task our community knows how to do?  Is there a different
>approach we could or should take that we're not seeing because we're enticed
>by an (albeit difficult) problem that we already know how to solve?
>
>A counter-argument to that worry is that, with a FOSS APAS following “Our
>Rule”, we can poke at very specific parts of the system and examine its
>ethics outside of the fact that that the APAS itself is proprietary.  IOW, we
>all don't like Copilot no matter *what* its job since it is, itself, a
>proprietary/trade-secret system from top to bottom.  Without looking at FOSS
>system that does this task, it's hard to consider the broader ethical
>problems that are unique to APAS's.
>
>A committee member changed topics to ask a fundamental question: “Is it
>morally — not legally, but morally/ethically — wrong to create new
>proprietary software with an APAS?”
>
>One ethical argument that's been made is that a company that makes money from
>APAS's are profiting from the labor of others without following the wishes of
>those who did the labor (e.g., the license terms).
>
>Another approach is the argument that, is a standard Free Software purist
>argument, with logic like this: since we agree any proprietary software is a
>moral affront to users' rights, then *any* system (be it FOSS, proprietary or
>otherwise) that assists someone to write more proprietary software
>(particularly one that helps the user write proprietary software better,
>faster and/or more easily) is itself, a morally wrong system.  (By
>comparison: if you're opposed to more fossil fuel usage, you usually would
>oppose systems any system that can pump oil faster out of the ground.)
>However, there are very few people radical enough such that they agree with
>the premise that “any proprietary software is a moral affront to users'
>rights”.  We could be splitting the fragile coalition of FOSS activists if
>our ethical arguments rely on agreement to that premise.
>
>Meanwhile, it could be that APAS's make writing software *so* easy that it's
>easier to write FOSS, too.  We could easily theorize a science-fiction-like
>system that makes software so easy to write, that it would be strange to
>write any proprietary software.  In such a world, writing new code would
>always be easier than using someone else's proprietary software.  This
>hypothetical points to fact that our view is always skewed by the fact that
>proprietary software is the *norm*, while FOSS is rarity.  We should make
>effort to at least consider how various ideas and arguments play out *if*
>FOSS were to become the norm — even though that doesn't seem likely in the
>future, could APAS's be the innovation that turns that tide?
>
>On a related point, historically, we *have* written a lot of FOSS that helps
>people write proprietary software, and they do so with those tools: Emacs,
>GCC, Eclipse, etc.  One committee member suggested that perhaps it's just a
>weird side-effect and/or a licensing bug that Emacs' license allows the user
>to write proprietary software with it.  Is “input/output excluded from
>copyleft” (i.e., that copyleft licenses went to great effort to avoid having
>the license govern the input and output of the software) a principle, or
>simply a historical accident?  The fact that we think it's problematic to
>restrict “field of use” (to use the OSD framing of the same question) may be
>more “cargo cult” than it is based on some central principle; and, it could
>be true that our historical thinking was overly influenced by the question
>of how copyright covers software, and the various derivative works standards
>in different legal systems.  IOW, maybe “no field of use restriction” was
>simply inherent in the license design because the licenses were contemplated
>first as copyright licenses, so “field of use” including “proprietary
>software development” was more a practicality of license design rather than
>an ethically-founded moral principle.
>
>The problem, however, is if we put the ethical restrictions that go too far
>(in any direction), we have a hard time building/maintaining a broad
>coalition of FOSS supporters.  Even We all on this very committee don't
>necessary agree that all inputs and outputs of FOSS should be FOSS, yet we
>all consider ourselves strong software freedom activists.
>
>In response to point, one member attempted to narrow the question: Could we
>all agree that, as an ethical matter (not necessary a legal matter) that:
>
>   If the inputs to a training set for APAS is FOSS, *then* the output the
>   users gets from the APAS should (as a moral matter) be FOSS?
>
>If we did agree to that, purely as a moral/ethical matter, would we be losing
>a coalition that exists elsewhere in FOSS?
>
>Unfortunately, we realized, the fragile FOSS coalition may well be built more
>about what the legal conclusion is rather than what activists believe is the
>correct moral conclusion.  There may not actually be an overriding moral
>principle that binds FOSS activism together; it may merely be that we all
>have simply historically agreed on the legal conclusions about copyleft's
>scope.  This issue could well be testing that coalition, since it may be the
>first time the copyleft scope is not obvious to everyone in the broader FOSS
>coalition.
>
>Some also noted we'd probably need to use terms of service to keep the thing
>FOSS if we wanted to enforce a moral requirement (as opposed to, say, having
>the copyleft license alone “take care of it”).  Using terms of service to
>mandate software freedom has not been used in past, and may be another type
>of activity that would dilute coalition between more and less strident
>activists.
>
>As a final matter for the meeting, we discussed what one committee member
>dubbed the “creepiness factor” of AI-assistive systems.  We've found there to
>be creepy and systemic bias issues in, for example, AI systems that assist
>with hiring, or those that decide what alleged criminals receive bail.  We
>considered: do these kinds of problems exist with APAS's?
>
>The general consensus of the committee was that *if* we are training on all
>FOSS anyway, even comments, or other personal information, all that would
>have been made public already.  It's unlikely to be racially/ethnically
>biased, violent, or triggering output to the user.  The worst we could
>imagine would be curse words and the like.  Once the data is public, it's
>unlikely something could be rehashed in a way that would likely upset
>developers.
>
>There *is* a lot of personal stuff in code repositories, and that's been well
>confirmed by others.  However, the problem here may ultimately be identical
>to the problem of accidentally committing data to a Git repository that you
>didn't mean to, which means similar solutions that solve the problems there
>will need to be applied here.  The big hurdle is how to remove items from a
>model without complete retraining.
>
>However, if we decide the right approach to respond to APAS is to put full
>force of support behind a copylefted, FOSS APAS (end-to-end) that follows
>“Our Rule”, then these problems *will* be our problems.  It's admittedly much
>easier to criticize bad actors if you are not trying to also solve the same
>problems.
>
>END MINUTES, AI Assist Committee on Tuesday 2022-06-22, 18:00-19:00 UTC
>-- 
>ai-assist mailing list
>ai-assist at lists.copyleft.org
>https://lists.copyleft.org/mailman/listinfo/ai-assist
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.copyleft.org/pipermail/ai-assist/attachments/20220704/8d300c3b/attachment-0001.htm>