Mailing List Archive

RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
Hello,

Given the recent spread of the "AI" bubble, I think we really need to
look into formally addressing the related concerns. In my opinion,
at this point the only reasonable course of action would be to safely
ban "AI"-backed contribution entirely. In other words, explicitly
forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
create ebuilds, code, documentation, messages, bug reports and so on for
use in Gentoo.

Just to be clear, I'm talking about our "original" content. We can't do
much about upstream projects using it.


Rationale:

1. Copyright concerns. At this point, the copyright situation around
generated content is still unclear. What's pretty clear is that pretty
much all LLMs are trained on huge corpora of copyrighted material, and
all fancy "AI" companies don't give shit about copyright violations.
In particular, there's a good risk that these tools would yield stuff we
can't legally use.

2. Quality concerns. LLMs are really great at generating plausibly
looking bullshit. I suppose they can provide good assistance if you are
careful enough, but we can't really rely on all our contributors being
aware of the risks.

3. Ethical concerns. As pointed out above, the "AI" corporations don't
give shit about copyright, and don't give shit about people. The AI
bubble is causing huge energy waste. It is giving a great excuse for
layoffs and increasing exploitation of IT workers. It is driving
enshittification of the Internet, it is empowering all kinds of spam
and scam.


Gentoo has always stood out as something different, something that
worked for people for whom mainstream distros were lacking. I think
adding "made by real people" to the list of our advantages would be
a good thing — but we need to have policies in place, to make sure shit
doesn't flow in.

Compare with the shitstorm at:
https://github.com/pkgxdev/pantry/issues/5358

--
Best regards,
Micha? Górny
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
Micha? Górny <mgorny@gentoo.org> writes:

> Hello,
>
> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns. In my opinion,
> at this point the only reasonable course of action would be to safely
> ban "AI"-backed contribution entirely. In other words, explicitly
> forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> create ebuilds, code, documentation, messages, bug reports and so on for
> use in Gentoo.
>
> Just to be clear, I'm talking about our "original" content. We can't do
> much about upstream projects using it.
>
>
> Rationale:
>
> 1. Copyright concerns. At this point, the copyright situation around
> generated content is still unclear. What's pretty clear is that pretty
> much all LLMs are trained on huge corpora of copyrighted material, and
> all fancy "AI" companies don't give shit about copyright violations.
> In particular, there's a good risk that these tools would yield stuff we
> can't legally use.
>
> 2. Quality concerns. LLMs are really great at generating plausibly
> looking bullshit. I suppose they can provide good assistance if you are
> careful enough, but we can't really rely on all our contributors being
> aware of the risks.
>
> 3. Ethical concerns. As pointed out above, the "AI" corporations don't
> give shit about copyright, and don't give shit about people. The AI
> bubble is causing huge energy waste. It is giving a great excuse for
> layoffs and increasing exploitation of IT workers. It is driving
> enshittification of the Internet, it is empowering all kinds of spam
> and scam.
>
>
> Gentoo has always stood out as something different, something that
> worked for people for whom mainstream distros were lacking. I think
> adding "made by real people" to the list of our advantages would be
> a good thing — but we need to have policies in place, to make sure shit
> doesn't flow in.
>
> Compare with the shitstorm at:
> https://github.com/pkgxdev/pantry/issues/5358

+1. All I've seen from "generatative" (read: auto-plagiarizing) A"I" is
spam and theft, and have the full intention of blocking it where-ever my
vote counts.
--
Arsen Arsenovi?
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
On 24/02/27 03:45PM, Micha? Górny wrote:
> Hello,
>
> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns. In my opinion,
> at this point the only reasonable course of action would be to safely
> ban "AI"-backed contribution entirely. In other words, explicitly
> forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> create ebuilds, code, documentation, messages, bug reports and so on for
> use in Gentoo.
>
> Just to be clear, I'm talking about our "original" content. We can't do
> much about upstream projects using it.
>
>
> Rationale:
>
> 1. Copyright concerns. At this point, the copyright situation around
> generated content is still unclear. What's pretty clear is that pretty
> much all LLMs are trained on huge corpora of copyrighted material, and
> all fancy "AI" companies don't give shit about copyright violations.
> In particular, there's a good risk that these tools would yield stuff we
> can't legally use.
>
> 2. Quality concerns. LLMs are really great at generating plausibly
> looking bullshit. I suppose they can provide good assistance if you are
> careful enough, but we can't really rely on all our contributors being
> aware of the risks.
>
> 3. Ethical concerns. As pointed out above, the "AI" corporations don't
> give shit about copyright, and don't give shit about people. The AI
> bubble is causing huge energy waste. It is giving a great excuse for
> layoffs and increasing exploitation of IT workers. It is driving
> enshittification of the Internet, it is empowering all kinds of spam
> and scam.
>
>
> Gentoo has always stood out as something different, something that
> worked for people for whom mainstream distros were lacking. I think
> adding "made by real people" to the list of our advantages would be
> a good thing — but we need to have policies in place, to make sure shit
> doesn't flow in.
>
> Compare with the shitstorm at:
> https://github.com/pkgxdev/pantry/issues/5358
>
> --
> Best regards,
> Micha? Górny
>

I completely agree.

Your rationale hits the most important concerns I have about these
technologies in open source. There is a significant opportunity for
Gentoo to set the example here.

--
Kenton Groombridge
Gentoo Linux Developer, SELinux Project
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
On Tue, 27 Feb 2024 at 15:21, Kenton Groombridge <concord@gentoo.org> wrote:
>
> On 24/02/27 03:45PM, Micha? Górny wrote:
> > Hello,
> >
> > Given the recent spread of the "AI" bubble, I think we really need to
> > look into formally addressing the related concerns. In my opinion,
> > at this point the only reasonable course of action would be to safely
> > ban "AI"-backed contribution entirely. In other words, explicitly
> > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> > create ebuilds, code, documentation, messages, bug reports and so on for
> > use in Gentoo.
> >
> > Just to be clear, I'm talking about our "original" content. We can't do
> > much about upstream projects using it.
> >
> >
> > Rationale:
> >
> > 1. Copyright concerns. At this point, the copyright situation around
> > generated content is still unclear. What's pretty clear is that pretty
> > much all LLMs are trained on huge corpora of copyrighted material, and
> > all fancy "AI" companies don't give shit about copyright violations.
> > In particular, there's a good risk that these tools would yield stuff we
> > can't legally use.
> >
> > 2. Quality concerns. LLMs are really great at generating plausibly
> > looking bullshit. I suppose they can provide good assistance if you are
> > careful enough, but we can't really rely on all our contributors being
> > aware of the risks.
> >
> > 3. Ethical concerns. As pointed out above, the "AI" corporations don't
> > give shit about copyright, and don't give shit about people. The AI
> > bubble is causing huge energy waste. It is giving a great excuse for
> > layoffs and increasing exploitation of IT workers. It is driving
> > enshittification of the Internet, it is empowering all kinds of spam
> > and scam.
> >
> >
> > Gentoo has always stood out as something different, something that
> > worked for people for whom mainstream distros were lacking. I think
> > adding "made by real people" to the list of our advantages would be
> > a good thing — but we need to have policies in place, to make sure shit
> > doesn't flow in.
> >
> > Compare with the shitstorm at:
> > https://github.com/pkgxdev/pantry/issues/5358
> >
> > --
> > Best regards,
> > Micha? Górny
> >
>
> I completely agree.
>
> Your rationale hits the most important concerns I have about these
> technologies in open source. There is a significant opportunity for
> Gentoo to set the example here.
>
> --
> Kenton Groombridge
> Gentoo Linux Developer, SELinux Project

A thousand times yes.
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
On 2024-02-27 14:45, Micha? Górny wrote:

> In my opinion, at this point the only reasonable course of action
> would be to safely ban "AI"-backed contribution entirely. In other
> words, explicitly forbid people from using ChatGPT, Bard, GitHub
> Copilot, and so on, to create ebuilds, code, documentation, messages,
> bug reports and so on for use in Gentoo.

I very much support this idea, for all the three reasons quoted.

> 2. Quality concerns. LLMs are really great at generating plausibly
> looking bullshit. I suppose they can provide good assistance if you
> are careful enough, but we can't really rely on all our contributors
> being aware of the risks.

https://arxiv.org/abs/2211.03622

> 3. Ethical concerns.

...yeah. Seeing as we failed to condemn the Russian invasion of Ukraine
in 2022, I would probably avoid quoting this as a reason for banning
LLM-generated contributions. Even though I do, as mentioned above, very
much agree with this point.

--
Marecki
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
Marek Szuba <marecki@gentoo.org> writes:

> On 2024-02-27 14:45, Micha? Górny wrote:
>
>> In my opinion, at this point the only reasonable course of action
>> would be to safely ban "AI"-backed contribution entirely. In other
>> words, explicitly forbid people from using ChatGPT, Bard, GitHub
>> Copilot, and so on, to create ebuilds, code, documentation, messages,
>> bug reports and so on for use in Gentoo.
>
> I very much support this idea, for all the three reasons quoted.
>
>> 2. Quality concerns. LLMs are really great at generating plausibly
>> looking bullshit. I suppose they can provide good assistance if you
>> are careful enough, but we can't really rely on all our contributors
>> being aware of the risks.
>
> https://arxiv.org/abs/2211.03622
>
>> 3. Ethical concerns.
>
> ...yeah. Seeing as we failed to condemn the Russian invasion of
> Ukraine in 2022, I would probably avoid quoting this as a reason for
> banning LLM-generated contributions. Even though I do, as mentioned
> above, very much agree with this point.

That's not a technical topic and we had an extended discussion about
what to do in -core, which included the risks of making life difficult
for Russian developers and contributors.

I don't think that's a helpful intervention here, sorry.
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
Am Dienstag, 27. Februar 2024, 15:45:17 CET schrieb Micha? Górny:
> Hello,
>
> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns. In my opinion,
> at this point the only reasonable course of action would be to safely
> ban "AI"-backed contribution entirely. In other words, explicitly
> forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> create ebuilds, code, documentation, messages, bug reports and so on for
> use in Gentoo.

Fully agree and support this.

>
> Just to be clear, I'm talking about our "original" content. We can't do
> much about upstream projects using it.
[...] or implementing it.

So, also, no objections against someone (a real person, by his own mental
means) packaging AI software for Gentoo.


--
Andreas K. Hüttel
dilfridge@gentoo.org
Gentoo Linux developer
(council, toolchain, base-system, perl, libreoffice)
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
On Tue, Feb 27, 2024 at 03:45:17PM +0100, Micha? Górny wrote:
> Hello,
>
> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns. In my opinion,
> at this point the only reasonable course of action would be to safely
> ban "AI"-backed contribution entirely. In other words, explicitly
> forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> create ebuilds, code, documentation, messages, bug reports and so on for
> use in Gentoo.

+1 from me, a clear stance before it really start hitting Gentoo sounds
good.
--
ionen
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
On Tue, Feb 27, 2024 at 9:45?AM Micha? Górny <mgorny@gentoo.org> wrote:
>
> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns.

> 1. Copyright concerns.

I do think it makes sense to consider some of this.

However, I feel like the proposal is redundant with the existing
requirement to signoff on the DCO, which says:

>>> By making a contribution to this project, I certify that:

>>> 1. The contribution was created in whole or in part by me, and
>>> I have the right to submit it under the free software license
>>> indicated in the file; or

>>> 2. The contribution is based upon previous work that, to the best of
>>> my knowledge, is covered under an appropriate free software license,
>>> and I have the right under that license to submit that work with
>>> modifications, whether created in whole or in part by me, under the
>>> same free software license (unless I am permitted to submit under a
>>> different license), as indicated in the file; or

>>> 3. The contribution is a license text (or a file of similar nature),
>>> and verbatim distribution is allowed; or

>>> 4. The contribution was provided directly to me by some other person
>>> who certified 1., 2., 3., or 4., and I have not modified it.

Perhaps we ought to just re-advertise the policy that already exists?

> 2. Quality concerns.

As far as quality is concerned, I again share the concerns you raise,
and I think we should just re-emphasize what many other industries are
already making clear - that individuals are responsible for the
quality of their contributions. Copy/pasting it blindly from an AI is
no different from copy/pasting it from some other random website, even
if it is otherwise legal.

> 3. Ethical concerns.

I think it is best to just avoid taking a stand on this. Our ethics
are already documented in the Social Contract.

I think everybody agrees that what is right and wrong is obvious and
clear and universal. Then we're all shocked to find that large
numbers of people have a universal perspective different from our own.
Even if 90% of contributors agree with a particular position, if we
start lopping off parts of our community 10% at a time we'll probably
find ourselves alone in a room sooner or later. We can't make every
hill the one to die on.

> I think adding "made by real people" to the list of our advantages
> would be a good thing

Somehow I doubt this is going to help us steal market share from the
numerous other popular source-based Linux distros. :)

To be clear, I don't think it is a bad idea to just reiterate that we
aren't looking for help from people who want to create scripts that
pipe things into some GPT API and pipe the output into a forum, bug,
issue, PR, or commit. I've seen other FOSS projects struggling with
people trying to be "helpful" in this way. I just don't think any of
this actually requires new policy. If we find our policy to be
inadequate I think it is better to go back to the core principles and
better articulate what we're trying to achieve, rather than adjust it
to fit the latest fashions.

--
Rich
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
On Tue, Feb 27, 2024, at 08:45 CST, Micha? Górny <mgorny@gentoo.org> wrote:

> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns. In my opinion,
> at this point the only reasonable course of action would be to safely
> ban "AI"-backed contribution entirely. In other words, explicitly
> forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> create ebuilds, code, documentation, messages, bug reports and so on for
> use in Gentoo.

+1


> 2. Quality concerns. LLMs are really great at generating plausibly
> looking bullshit. I suppose they can provide good assistance if you are
> careful enough, but we can't really rely on all our contributors being
> aware of the risks.

This is my main concern, but all of the other points are valid as well.


Best,
Matthias
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
On 2024.02.27 14:45, Micha? Górny wrote:
> Hello,
>
> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns. In my opinion,
> at this point the only reasonable course of action would be to safely
> ban "AI"-backed contribution entirely. In other words, explicitly
> forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> create ebuilds, code, documentation, messages, bug reports and so on
> for
> use in Gentoo.
>
> Just to be clear, I'm talking about our "original" content. We can't
> do
> much about upstream projects using it.
>
>
> Rationale:
>
> 1. Copyright concerns. At this point, the copyright situation around
> generated content is still unclear. What's pretty clear is that
> pretty
> much all LLMs are trained on huge corpora of copyrighted material, and
> all fancy "AI" companies don't give shit about copyright violations.
> In particular, there's a good risk that these tools would yield stuff
> we
> can't legally use.
>
> 2. Quality concerns. LLMs are really great at generating plausibly
> looking bullshit. I suppose they can provide good assistance if you
> are
> careful enough, but we can't really rely on all our contributors being
> aware of the risks.
>
> 3. Ethical concerns. As pointed out above, the "AI" corporations
> don't
> give shit about copyright, and don't give shit about people. The AI
> bubble is causing huge energy waste. It is giving a great excuse for
> layoffs and increasing exploitation of IT workers. It is driving
> enshittification of the Internet, it is empowering all kinds of spam
> and scam.
>
>
> Gentoo has always stood out as something different, something that
> worked for people for whom mainstream distros were lacking. I think
> adding "made by real people" to the list of our advantages would be
> a good thing — but we need to have policies in place, to make sure
> shit
> doesn't flow in.
>
> Compare with the shitstorm at:
> https://github.com/pkgxdev/pantry/issues/5358
>
> --
> Best regards,
> Micha? Górny
>
>

Micha?,

An excellent piece of prose setting out the rationale.
I fully support it.

--
Regards,

Roy Bamford
(Neddyseagoon) a member of
elections
gentoo-ops
forum-mods
arm64
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
Micha? Górny <mgorny@gentoo.org> writes:

> Hello,
>
> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns. In my opinion,
> at this point the only reasonable course of action would be to safely
> ban "AI"-backed contribution entirely. In other words, explicitly
> forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> create ebuilds, code, documentation, messages, bug reports and so on for
> use in Gentoo.
>
> Just to be clear, I'm talking about our "original" content. We can't do
> much about upstream projects using it.
>

I agree with the proposal, just some thoughts below.

I'm a bit worried this is slightly performative - which is not a dig at
you at all - given we can't really enforce it, and it requires honesty,
but that's also not a reason to not try ;)

>
> Rationale:
>
> 1. Copyright concerns. At this point, the copyright situation around
> generated content is still unclear. What's pretty clear is that pretty
> much all LLMs are trained on huge corpora of copyrighted material, and
> all fancy "AI" companies don't give shit about copyright violations.
> In particular, there's a good risk that these tools would yield stuff we
> can't legally use.
>

It also makes risk for anyone basing products or tools on Gentoo if
we're not confident about the integrity / provenance of our work.

> 2. Quality concerns. LLMs are really great at generating plausibly
> looking bullshit. I suppose they can provide good assistance if you are
> careful enough, but we can't really rely on all our contributors being
> aware of the risks.
>
> 3. Ethical concerns. As pointed out above, the "AI" corporations don't
> give shit about copyright, and don't give shit about people. The AI
> bubble is causing huge energy waste. It is giving a great excuse for
> layoffs and increasing exploitation of IT workers. It is driving
> enshittification of the Internet, it is empowering all kinds of spam
> and scam.
>
>
> Gentoo has always stood out as something different, something that
> worked for people for whom mainstream distros were lacking. I think
> adding "made by real people" to the list of our advantages would be
> a good thing — but we need to have policies in place, to make sure shit
> doesn't flow in.
>
> Compare with the shitstorm at:
> https://github.com/pkgxdev/pantry/issues/5358
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
>>>>> On Tue, 27 Feb 2024, Rich Freeman wrote:

> On Tue, Feb 27, 2024 at 9:45?AM Micha? Górny <mgorny@gentoo.org> wrote:
>>
>> Given the recent spread of the "AI" bubble, I think we really need to
>> look into formally addressing the related concerns.

First of all, I fully support mgorny's proposal.

>> 1. Copyright concerns.

> I do think it makes sense to consider some of this.

> However, I feel like the proposal is redundant with the existing
> requirement to signoff on the DCO, which says:

>>>> By making a contribution to this project, I certify that:

>>>> 1. The contribution was created in whole or in part by me, and
>>>> I have the right to submit it under the free software license
>>>> indicated in the file; or

>>>> 2. The contribution is based upon previous work that, to the best of
>>>> my knowledge, is covered under an appropriate free software license,
>>>> and I have the right under that license to submit that work with
>>>> modifications, whether created in whole or in part by me, under the
>>>> same free software license (unless I am permitted to submit under a
>>>> different license), as indicated in the file; or

>>>> 3. The contribution is a license text (or a file of similar nature),
>>>> and verbatim distribution is allowed; or

>>>> 4. The contribution was provided directly to me by some other person
>>>> who certified 1., 2., 3., or 4., and I have not modified it.

I have been thinking about this aspect too. Certainly there is some
overlap with our GLEP 76 policy, but I don't think that it is redundant.

I'd rather see it as a (much needed) clarification how to deal with AI
generated code. All the better if the proposal happens to agree with
policies that are already in place.

Ulrich
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
On 24/02/27 07:07PM, Ulrich Mueller wrote:
> >>>>> On Tue, 27 Feb 2024, Rich Freeman wrote:
>
> > On Tue, Feb 27, 2024 at 9:45?AM Micha? Górny <mgorny@gentoo.org> wrote:
> >>
> >> Given the recent spread of the "AI" bubble, I think we really need to
> >> look into formally addressing the related concerns.
>
> First of all, I fully support mgorny's proposal.
>
> >> 1. Copyright concerns.
>
> > I do think it makes sense to consider some of this.
>
> > However, I feel like the proposal is redundant with the existing
> > requirement to signoff on the DCO, which says:
>
> >>>> By making a contribution to this project, I certify that:
>
> >>>> 1. The contribution was created in whole or in part by me, and
> >>>> I have the right to submit it under the free software license
> >>>> indicated in the file; or
>
> >>>> 2. The contribution is based upon previous work that, to the best of
> >>>> my knowledge, is covered under an appropriate free software license,
> >>>> and I have the right under that license to submit that work with
> >>>> modifications, whether created in whole or in part by me, under the
> >>>> same free software license (unless I am permitted to submit under a
> >>>> different license), as indicated in the file; or
>
> >>>> 3. The contribution is a license text (or a file of similar nature),
> >>>> and verbatim distribution is allowed; or
>
> >>>> 4. The contribution was provided directly to me by some other person
> >>>> who certified 1., 2., 3., or 4., and I have not modified it.
>
> I have been thinking about this aspect too. Certainly there is some
> overlap with our GLEP 76 policy, but I don't think that it is redundant.
>
> I'd rather see it as a (much needed) clarification how to deal with AI
> generated code. All the better if the proposal happens to agree with
> policies that are already in place.
>
> Ulrich

This is my interpretation of it as well, especially when it comes to
para. 2:

>>> 2. The contribution is based upon previous work that, to the best of
>>> my knowledge, is covered under an appropriate free software license,
>>> [...]

It is extremely difficult (if not impossible) to verify this with some of
these tools, and that's assuming that the user of these tools knows
enough about how they work where this is a concern to them. I would
argue it's best to stay away from these tools at least until there is more
clear and concise legal interpretation of their usage in relation to
copyright.

--
Kenton Groombridge
Gentoo Linux Developer, SELinux Project
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
Am Dienstag, 27. Februar 2024, 18:50:15 CET schrieb Roy Bamford:
> On 2024.02.27 14:45, Micha? Górny wrote:
> > Hello,
> >
> > [...]
> >
> > Gentoo has always stood out as something different, something that
> > worked for people for whom mainstream distros were lacking. I think
> > adding "made by real people" to the list of our advantages would be
> > a good thing — but we need to have policies in place, to make sure
> > shit
> > doesn't flow in.
> >
> > Compare with the shitstorm at:
> > https://github.com/pkgxdev/pantry/issues/5358
>
> Micha?,
>
> An excellent piece of prose setting out the rationale.
> I fully support it.

I would like to add the following:

Last year we had a chatbot in our Gentoo forum that posted 76 posts on
2024-12-19. An inexperienced moderator (me) then asked his colleagues on the
basis of which forum rules we can ban this chatbot:

"Do we have a rule somewhere that an AI and a chatbot are not allowed to log
in? I have read our Guide?ines ( https://forums.gentoo.org/viewtopic-t-525.html ) and found no such prohibition. On what basis could we even block
a chatbot ?"

The answer from two experienced colleagues was that this is already covered by
our forum rules, because chatbots usually cannot (yet) fulfill the requirements
of a forum post and therefore violate our Guide?ines.

To be honest, I asked myself at the time what would happen if we had a clearly
recognizable AI as a user that made (reasonably) sensible posts. We would then
have no chance of banning this AI user without an explicit prohibition. I
would be much more comfortable if we clearly communicated that we do not
accept an AI as a user.

Yes, I would also be very happy to see this proposal implemented.

--
Best regards,
Peter (aka pietinger)
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
On 2/27/24 9:45 AM, Micha? Górny wrote:
> Hello,
>
> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns. In my opinion,
> at this point the only reasonable course of action would be to safely
> ban "AI"-backed contribution entirely. In other words, explicitly
> forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> create ebuilds, code, documentation, messages, bug reports and so on for
> use in Gentoo.


No constructive or valuable contributions will fall afoul of the new ban.

Seems reasonable to me.


--
Eli Schwartz
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
On Tue, Feb 27, 2024 at 15:45:17 +0100, Micha? Górny wrote:
> Hello,
>
> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns. In my opinion,
> at this point the only reasonable course of action would be to safely
> ban "AI"-backed contribution entirely. In other words, explicitly
> forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> create ebuilds, code, documentation, messages, bug reports and so on for
> use in Gentoo.
>
> Just to be clear, I'm talking about our "original" content. We can't do
> much about upstream projects using it.
>

I agree.

But for the sake of discussion:

What about cases where someone, say, doesn't have an excellent grasp of
English and decides to use, for example, ChatGPT to aid in writing
documentation/comments (not code) and puts a note somewhere explicitly
mentioning what was AI-generated so that someone else can take a closer
look?

I'd personally not be the biggest fan of this if it wasn't in something
like a PR or ml post where it could be reviewed before being made final.
But the most impportant part IMO would be being up-front about it.

>
> Rationale:
>
> 1. Copyright concerns. At this point, the copyright situation around
> generated content is still unclear. What's pretty clear is that pretty
> much all LLMs are trained on huge corpora of copyrighted material, and
> all fancy "AI" companies don't give shit about copyright violations.
> In particular, there's a good risk that these tools would yield stuff we
> can't legally use.
>

I really dislike the lack of audit trail for where the bits and pieces
come from. Not to mention the examples from early on where Copilot was
filling in incorrect attribution.

- Oskari
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
On Tue, 2024-02-27 at 21:05 -0600, Oskari Pirhonen wrote:
> What about cases where someone, say, doesn't have an excellent grasp of
> English and decides to use, for example, ChatGPT to aid in writing
> documentation/comments (not code) and puts a note somewhere explicitly
> mentioning what was AI-generated so that someone else can take a closer
> look?
>
> I'd personally not be the biggest fan of this if it wasn't in something
> like a PR or ml post where it could be reviewed before being made final.
> But the most impportant part IMO would be being up-front about it.

I'm afraid that wouldn't help much. From my experiences, it would be
less effort for us to help writing it from scratch, than trying to
untangle whatever verbose shit ChatGPT generates. Especially that
a person with poor grasp of the language could have trouble telling
whether the generated text is actually meaningful.

--
Best regards,
Micha? Górny
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
>>>>> On Wed, 28 Feb 2024, Micha? Górny wrote:

> On Tue, 2024-02-27 at 21:05 -0600, Oskari Pirhonen wrote:
>> What about cases where someone, say, doesn't have an excellent grasp of
>> English and decides to use, for example, ChatGPT to aid in writing
>> documentation/comments (not code) and puts a note somewhere explicitly
>> mentioning what was AI-generated so that someone else can take a closer
>> look?
>>
>> I'd personally not be the biggest fan of this if it wasn't in something
>> like a PR or ml post where it could be reviewed before being made final.
>> But the most impportant part IMO would be being up-front about it.

> I'm afraid that wouldn't help much. From my experiences, it would be
> less effort for us to help writing it from scratch, than trying to
> untangle whatever verbose shit ChatGPT generates. Especially that
> a person with poor grasp of the language could have trouble telling
> whether the generated text is actually meaningful.

But where do we draw the line? Are translation tools like DeepL allowed?
I don't see much of a copyright issue for these.

Ulrich

[1] https://www.deepl.com/translator
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
On Tue, 2024-02-27 at 15:45 +0100, Micha? Górny wrote:
> Hello,
>
> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns.  In my opinion,
> at this point the only reasonable course of action would be to safely
> ban "AI"-backed contribution entirely.  In other words, explicitly
> forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> create ebuilds, code, documentation, messages, bug reports and so on
> for
> use in Gentoo.
>
> Just to be clear, I'm talking about our "original" content.  We can't
> do
> much about upstream projects using it.
>
>
> Rationale:
>
> 1. Copyright concerns.  At this point, the copyright situation around
> generated content is still unclear.  What's pretty clear is that
> pretty
> much all LLMs are trained on huge corpora of copyrighted material, and
> all fancy "AI" companies don't give shit about copyright violations.
> In particular, there's a good risk that these tools would yield stuff
> we
> can't legally use.
>
> 2. Quality concerns.  LLMs are really great at generating plausibly
> looking bullshit.  I suppose they can provide good assistance if you
> are
> careful enough, but we can't really rely on all our contributors being
> aware of the risks.
>
> 3. Ethical concerns.  As pointed out above, the "AI" corporations
> don't
> give shit about copyright, and don't give shit about people.  The AI
> bubble is causing huge energy waste.  It is giving a great excuse for
> layoffs and increasing exploitation of IT workers.  It is driving
> enshittification of the Internet, it is empowering all kinds of spam
> and scam.
>
>
> Gentoo has always stood out as something different, something that
> worked for people for whom mainstream distros were lacking.  I think
> adding "made by real people" to the list of our advantages would be
> a good thing — but we need to have policies in place, to make sure
> shit
> doesn't flow in.
>
> Compare with the shitstorm at:
> https://github.com/pkgxdev/pantry/issues/5358
>

+1

Can we get this added to the agenda for the next council meeting?
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
> But where do we draw the line? Are translation tools like DeepL
> allowed? I don't see much of a copyright issue for these.

I'd also like to jump in and play devil's advocate. There's a fair
chance that this is because I just got back from a
supercomputing/research conf where LLMs were the hot topic in every keynote.

As mentioned by Sam, this RFC is performative. Any users that are going
to abuse LLMs are going to do it _anyway_, regardless of the rules. We
already rely on common sense to filter these out; we're always going to
have BS/Spam PRs and bugs - I don't really think that the content being
generated by LLM is really any worse.

This doesn't mean that I think we should blanket allow poor quality LLM
contributions. It's especially important that we take into account the
potential for bias, factual errors, and outright plagarism when these
tools are used incorrectly. We already have methods for weeding out low
quality contributions and bad faith contributors - let's trust in these
and see what we can do to strengthen these tools and processes.

A bit closer to home for me, what about using a LLMs as an assistive
technology / to reduce boilerplate? I'm recovering from RSI - I don't
know when (if...) I'll be able to type like I used to again. If a model
is able to infer some mostly salvagable boilerplate from its context
window I'm going to use it and spend the effort I would writing that to
fix something else; an outright ban on LLM use will reduce my _ability_
to contribute to the project.

What about using a LLM for code documentation? Some models can do a
passable job of writing decent quality function documentation and, in
production, I _have_ caught real issues in my logic this way. Why should
I type that out (and write what I think the code does rather than what
it actually does) if an LLM can get 'close enough' and I only need to do
light editing?

In line with the above, if the concern is about code quality / potential
for plagiarised code, What about indirect use of LLMs? Imagine a
hypothetical situation where a contributor asks a LLM to summarise a
topic and uses that knowledge to implement a feature. Is this now
tainted / forbidden knowledge according to the Gentoo project?

As a final not-so-hypothetical, what about a LLM trained on Gentoo docs
and repos, or more likely trained on exclusively open-source
contributions and fine-tuned on Gentoo specifics? I'm in the process of
spinning up several models at work to get a handle on the tech / turn
more electricity into heat - this is a real possibility (if I can ever
find the time).

The cat is out of the bag when it comes to LLMs. In my real-world job I
talk to scientists and engineers using these things (for their
strengths) to quickly iterate on designs, to summarise experimental
results, and even to generate testable hypotheses. We're only going to
see increasing use of this technology going forward.

TL;DR: I think this is a bad idea. We already have effective mechanisms
for dealing with spam and bad faith contributions. Banning LLM use by
Gentoo contributors at this point is just throwing the baby out with the
bathwater.

As an alternative I'd be very happy some guidelines for the use of LLMs
and other assistive technologies like "Don't use LLM code snippets
unless you understand them", "Don't blindly copy and paste LLM output",
or, my personal favourite, "Don't be a jerk to our poor bug wranglers".

A blanket "No completely AI/LLM generated works" might be fine, too.

Let's see how the legal issues shake out before we start pre-emptively
banning useful tools. There's a lot of ongoing action in this space - at
the very least I'd like to see some thorough discussion of the legal
issues separately if we're making a case for banning an entire class of
technology.

A Gentoo LLM project formed of experts who could actually provide good
advice / some actual guidelines for LLM use within the project (and
engaging some real-world legal advice) might be a good starting point.
Are there any volunteers in the audience?

Thanks for listening to my TED talk,

Matt
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
On Wed, 2024-02-28 at 11:08 +0100, Ulrich Mueller wrote:
> > > > > > On Wed, 28 Feb 2024, Micha? Górny wrote:
>
> > On Tue, 2024-02-27 at 21:05 -0600, Oskari Pirhonen wrote:
> > > What about cases where someone, say, doesn't have an excellent grasp of
> > > English and decides to use, for example, ChatGPT to aid in writing
> > > documentation/comments (not code) and puts a note somewhere explicitly
> > > mentioning what was AI-generated so that someone else can take a closer
> > > look?
> > >
> > > I'd personally not be the biggest fan of this if it wasn't in something
> > > like a PR or ml post where it could be reviewed before being made final.
> > > But the most impportant part IMO would be being up-front about it.
>
> > I'm afraid that wouldn't help much. From my experiences, it would be
> > less effort for us to help writing it from scratch, than trying to
> > untangle whatever verbose shit ChatGPT generates. Especially that
> > a person with poor grasp of the language could have trouble telling
> > whether the generated text is actually meaningful.
>
> But where do we draw the line? Are translation tools like DeepL allowed?
> I don't see much of a copyright issue for these.

I have a strong suspicion that these translation tools are trained
on copyrighted translations of books and other copyrighted material.

--
Best regards,
Micha? Górny
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
On 27/02/2024 16.45, Micha? Górny wrote:
> Hello,
>
> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns. In my opinion,
> at this point the only reasonable course of action would be to safely
> ban "AI"-backed contribution entirely. In other words, explicitly
> forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> create ebuilds, code, documentation, messages, bug reports and so on for
> use in Gentoo.
>
> Just to be clear, I'm talking about our "original" content. We can't do
> much about upstream projects using it.

I support this motion.

>
> Rationale:
>
> 1. Copyright concerns. At this point, the copyright situation around
> generated content is still unclear. What's pretty clear is that pretty
> much all LLMs are trained on huge corpora of copyrighted material, and
> all fancy "AI" companies don't give shit about copyright violations.
> In particular, there's a good risk that these tools would yield stuff we
> can't legally use.

I know that GitHub Copilot can be limited to licenses, and even to just
the current repository. Even though, I'm not sure that the copyright can
be attributed to "me" and not the "AI" - so still gray area.

> 2. Quality concerns. LLMs are really great at generating plausibly
> looking bullshit. I suppose they can provide good assistance if you are
> careful enough, but we can't really rely on all our contributors being
> aware of the risks.

Let me tell a story. I was interested if I can teach an LLM the ebuild
format, as a possible helper tool for devs/non-devs. My prompt got so
huge, where I was teaching it all the stuff of ebuilds, where to input
the source code (eclasses), and such. At one point, it even managed to
output a close enough python distutils-r1 ebuild - the same level that
`vim dev-python/${PN}/${PN}-${PV}.ebuild` creates using the gentoo
template. Yes, my long work resulted in no gain.

For each other ebuild type: cmake, meson, go, rust - I always got
garbage ebuild. Yes, it was generating a good DESCRIPTION and HOMEPAGE
(simple stuff to copy from upstream) and even 60% accuracy for LICENSE.
But did you know we have "intel80386" arch for KEYWORDS? We can
RESTRICT="install"? We can use "^cat-pkg/pkg-1" syntax in deps? PATCHES
with http urls inside? And the list goes on. Sometimes it was even funny.

So until a good prompt can be created for gentoo, upon which we *might*
reopen discussion, I'm strongly supporting banning AI generating
ebuilds. Currently good templates per category, and just copying other
ebuilds as starting point, and even just skel.ebuild - all those 3
options bring much better result and less time waste for developers.

> 3. Ethical concerns. As pointed out above, the "AI" corporations don't
> give shit about copyright, and don't give shit about people. The AI
> bubble is causing huge energy waste. It is giving a great excuse for
> layoffs and increasing exploitation of IT workers. It is driving
> enshittification of the Internet, it is empowering all kinds of spam
> and scam.
>

Many companies who use AI as reason for layoff are just creating a
reasoning out of bad will, or ignorance. The company I work at is using
AI tools as a boost for productivity, but at all levels of management
they know that AI can't replace a person - best case boost him 5-10%.
The current real reason for layoffs is tightening of budget movement
cross the industry (just a normal cycle, soon it would get better), so
management prefer to layoff not themselves. So yeah, sad world.

>
> Gentoo has always stood out as something different, something that
> worked for people for whom mainstream distros were lacking. I think
> adding "made by real people" to the list of our advantages would be
> a good thing — but we need to have policies in place, to make sure shit
> doesn't flow in.
>
> Compare with the shitstorm at:
> https://github.com/pkgxdev/pantry/issues/5358
>

Great read, really much WTF. This whole repo is just a cluster of AIs
competing against each other.

--
Arthur Zamarin
arthurzam@gentoo.org
Gentoo Linux developer (Python, pkgcore stack, Arch Teams, GURU)
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
On Wed, Feb 28, 2024 at 1:50?PM Arthur Zamarin <arthurzam@gentoo.org> wrote:
>
> I know that GitHub Copilot can be limited to licenses, and even to just
> the current repository. Even though, I'm not sure that the copyright can
> be attributed to "me" and not the "AI" - so still gray area.

So, AI copyright is a bit of a poorly defined area simply due to a
lack of case law. I'm not all that confident that courts won't make
an even bigger mess of it.

There are half a dozen different directions I think a court might rule
on the matter of authorship and derived works, but I think it is VERY
unlikely that a court will rule that the copyright will be attributed
to the AI itself, or that the AI itself ever was an author or held any
legal rights to the work at any point in time. An AI is not a legal
entity. The company that provides the service, its
employees/developers, the end user, and the authors and copyright
holders of works used to train the AI are all entities a court is
likely to consider as having some kind of a role.

That said, we live in a world where it isn't even clear if APIs can be
copyrighted, though in practice enforcing such a copyright might be
impossible. It could be a while before AI copyright concerns are
firmly settled. When they are, I suspect it will be done in a way
that frustrates just about everybody on every side...

IMO the main risk to an organization (especially a transparent one
like ours) from AI code isn't even whether it is copyrightable or not,
but rather getting pulled into arguments and debates and possibly
litigation over what is likely to be boilerplate code that needs a lot
of cleanup anyway. Even if you "win" in court or the court of public
opinion, the victory can be pyrrhic.

--
Rich
Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]
On 2/28/24 6:06 AM, Matt Jolly wrote:
>
>> But where do we draw the line? Are translation tools like DeepL
>> allowed? I don't see much of a copyright issue for these.
>
> I'd also like to jump in and play devil's advocate. There's a fair
> chance that this is because I just got back from a
> supercomputing/research conf where LLMs were the hot topic in every
> keynote.
>
> As mentioned by Sam, this RFC is performative. Any users that are going
> to abuse LLMs are going to do it _anyway_, regardless of the rules. We
> already rely on common sense to filter these out; we're always going to
> have BS/Spam PRs and bugs - I don't really think that the content being
> generated by LLM is really any worse.
>
> This doesn't mean that I think we should blanket allow poor quality LLM
> contributions. It's especially important that we take into account the
> potential for bias, factual errors, and outright plagarism when these
> tools are used incorrectly.  We already have methods for weeding out low
> quality contributions and bad faith contributors - let's trust in these
> and see what we can do to strengthen these tools and processes.


Why is this an argument *against* performative statement of intent?

There are too many ways for bad faith contributors to maliciously engage
with the community, and no one is proposing a need to lay down rules
that forbid such people.

It is meaningful on its own to specify good faith rules that people
should abide by in order to produce a smoother experience. And telling
people that they are not supposed to do XXX is a good way to reduce the
amount of low quality contributions that Devs need to sift through...


> A bit closer to home for me, what about using a LLMs as an assistive
> technology / to reduce boilerplate? I'm recovering from RSI - I don't
> know when (if...) I'll be able to type like I used to again. If a model
> is able to infer some mostly salvagable boilerplate from its context
> window I'm going to use it and spend the effort I would writing that to
> fix something else; an outright ban on LLM use will reduce my _ability_
> to contribute to the project.


So by this appeal to emotion, you can claim anything is assistive
technology and therefore should be allowed because it's discriminatory
against the disabled if you don't allow it?

Is there some special attribute of disabled persons that means they are
exempted from copyright law?

What counts as assistive technology? Is it any technology that disabled
persons use, or technology designed to bridge the gap for the disabled?
If a disabled person uses vim because shortcuts, does that mean vim is
"assistive technology" because someone used it to "assist" them?

...

I somehow feel like I maybe heard about assistive technology existing
that assisted disabled persons in the process of dictating their
thoughts while avoiding physically stressful typing activities.

It didn't involve having the "assistive technology" provide both the
content and the typing, as that's not really *assisting*.


> In line with the above, if the concern is about code quality / potential
> for plagiarised code, What about indirect use of LLMs? Imagine a
> hypothetical situation where a contributor asks a LLM to summarise a
> topic and uses that knowledge to implement a feature. Is this now
> tainted / forbidden knowledge according to the Gentoo project?


Since your imagined hypothetical involves the use of copyrighted works
by and from a person, which cannot be said to be derivative copyrighted
works of the training data from the LLM -- for the same reason that
reading an article in a handwritten, copyrighted journal about "a topic"
to learn about that topic and then writing software based on the ideas
from the article is not a *derivative copyrighted work* -- the answer is
extremely trivially no?

The copyright issue with LLMs isn't that they ingest blogposts about how
cool ebuilds are and use that knowledge to write ebuilds. The copyright
issue with LLMs is that they ingest github repos full of non-Gentoo
ebuilds copyrighted under who knows what license and then regurgitate
those ebuilds. It is *derivative works*.

Prose summaries about generic topics is a good way to break the link
when it comes to derived works, it doesn't have anything to do with LLMs.


Nonetheless, any credible form of scholarship is going to demand that
participants be well versed in where the line is between saying
something in your own words with citation, and plagiarism.



> As a final not-so-hypothetical, what about a LLM trained on Gentoo docs
> and repos, or more likely trained on exclusively open-source
> contributions and fine-tuned on Gentoo specifics? I'm in the process of
> spinning up several models at work to get a handle on the tech / turn
> more electricity into heat - this is a real possibility (if I can ever
> find the time).


If you can state for a fact that you have done so, then clearly it's not
a copyright violation.

"exclusively open-source contributions" is NOT however a good bar. There
are lots of open-source licenses, but not all of them are compatible
with the GPL2 at all, and the ones that are compatible -- in fact,
licenses in general -- tend to require you to include copyright notices.

The LLM would have to know how to do that. Or if it is trained
exclusively on gentoo repositories it may be able to say "okay all
inputs are copyright GPL2 The Gentoo Authors".


> The cat is out of the bag when it comes to LLMs. In my real-world job I
> talk to scientists and engineers using these things (for their
> strengths) to quickly iterate on designs, to summarise experimental
> results, and even to generate testable hypotheses. We're only going to
> see increasing use of this technology going forward.


Huh? "The cat is out of the bag". What does this even mean? I'm not sure
how to read this other than:

Everyone else is breaking the law anyways so who cares. You can't stop
them, so might as well join them.

If it's something good or acceptable to do, then it is good or
acceptable without needing to be defended by "but lots of people are
doing it so you can't stop us".

That being said, here's some food for thought: if something bad happens,
and we *agree* it's bad, but every time the topic comes up people say
"well, it's bad but everyone else is doing it so what can we do, might
as well give in"...

... how do you think it became so popular to begin with? Maybe someone
before you said "the cat is out of the bag"?



--
Eli Schwartz

1 2  View All