Mailing List Archive: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo

Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]

Feb 29, 2024, 10:33 PM

Post #26 of 39 (205 views)

Hi,

> Compare with the shitstorm at:
> https://github.com/pkgxdev/pantry/issues/5358

Thank you for this, it made my day.

Though I'm just a proxy maintainer for now, I also support this initiative,
there should be some guard rails set up around LLM usage.

> 1. Copyright concerns. At this point, the copyright situation around
> generated content is still unclear. What's pretty clear is that pretty
> much all LLMs are trained on huge corpora of copyrighted material, and
> all fancy "AI" companies don't give shit about copyright violations.
> In particular, there's a good risk that these tools would yield stuff we
> can't legally use.

IANAL, but IMHO if we stop respecting copyright law, even if indirectly via
LLMs, why should we expect others to respect our licenses? It could be prudent
to wait and see where will this land.

> 2. Quality concerns. LLMs are really great at generating plausibly
> looking bullshit. I suppose they can provide good assistance if you are
> careful enough, but we can't really rely on all our contributors being
> aware of the risks.

From my personal experience of using Github Copilot fine tuned on a large
private code base, it functions mostly okay as a more smart auto complete on a
single line of code, but when it comes to multiple lines of code, even when it
comes to filling out boiler plate code, it's at best a 'meh'. The problem is
that while the output looks okay-ish, often it will have subtle mistakes or will
hallucinate some random additional stuff not relevant to the source file in
question, so one ends up having to read and analyze the entire output of the LLM
to fix problems with the code. I found that the mental and time overhead rarely
makes it worth it, especially when a template can do a better job (e.g. this
would be the case for ebuilds).

Since during reviews we are supposed to be reading the entire contribution, not
sure how much difference this makes, but I can see a developer trusting LLM
too much might end up outsourcing the checking of the code to the reviewers,
which means we need to be extra vigilant and could lead to reduced trust of
contributions.

> 3. Ethical concerns. As pointed out above, the "AI" corporations don't
> give shit about copyright, and don't give shit about people. The AI
> bubble is causing huge energy waste. It is giving a great excuse for
> layoffs and increasing exploitation of IT workers. It is driving
> enshittification of the Internet, it is empowering all kinds of spam
> and scam.

I agree. I'm already tired of AI generated blog spam and so forth, such a waste
of time and quite annoying. I'd rather not have that on our wiki pages too. The
purpose of documenting things is to explain an area to someone new to it or
writing down unique quirks of a setup or a system. Since LLMs cannot write new
original things, just rehash information it has seen I'm not sure how could it
be helpful for this at all to be honest.

Overall my time is too valuable to shift through AI generated BS when I'm trying
to solve a problem, I'd prefer we keep a well curated high quality documentation
where possible.

Zoltan

Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]

sam at gentoo

Feb 29, 2024, 11:06 PM

Post #27 of 39 (205 views)

Permalink

Matt Jolly <kangie@gentoo.org> writes:

>> But where do we draw the line? Are translation tools like DeepL
>> allowed? I don't see much of a copyright issue for these.
>
> I'd also like to jump in and play devil's advocate. There's a fair
> chance that this is because I just got back from a
> supercomputing/research conf where LLMs were the hot topic in every keynote.
>
> As mentioned by Sam, this RFC is performative. Any users that are going
> to abuse LLMs are going to do it _anyway_, regardless of the rules. We
> already rely on common sense to filter these out; we're always going to
> have BS/Spam PRs and bugs - I don't really think that the content being
> generated by LLM is really any worse.
>
> This doesn't mean that I think we should blanket allow poor quality LLM
> contributions. It's especially important that we take into account the
> potential for bias, factual errors, and outright plagarism when these
> tools are used incorrectly. We already have methods for weeding out low
> quality contributions and bad faith contributors - let's trust in these
> and see what we can do to strengthen these tools and processes.
>
> A bit closer to home for me, what about using a LLMs as an assistive
> technology / to reduce boilerplate? I'm recovering from RSI - I don't
> know when (if...) I'll be able to type like I used to again. If a model
> is able to infer some mostly salvagable boilerplate from its context
> window I'm going to use it and spend the effort I would writing that to
> fix something else; an outright ban on LLM use will reduce my _ability_
> to contribute to the project.

Another person approached me after this RFC and asked whether tooling
restricted to the current repo would be okay. For me, that'd be mostly
acceptable, given it won't make suggestions based on copyrighted code.

I also don't have a problem with LLMs being used to help refine commit
messages as long as someone is being sensible about it (e.g. if, as in
your situation, you know what you want to say but you can't type much).

I don't know how to phrase a policy off the top of my head which allows
those two things but not the rest.

>
> What about using a LLM for code documentation? Some models can do a
> passable job of writing decent quality function documentation and, in
> production, I _have_ caught real issues in my logic this way. Why should
> I type that out (and write what I think the code does rather than what
> it actually does) if an LLM can get 'close enough' and I only need to do
> light editing?

I suppose in that sense, it's the same as blindly listening to any
linting tool or warning without understanding what it's flagging and if
it's correct.

> [...]
> As a final not-so-hypothetical, what about a LLM trained on Gentoo docs
> and repos, or more likely trained on exclusively open-source
> contributions and fine-tuned on Gentoo specifics? I'm in the process of
> spinning up several models at work to get a handle on the tech / turn
> more electricity into heat - this is a real possibility (if I can ever
> find the time).

I think that'd be interesting. It also does a good job as a rhetorical
point wrt the policy being a bit too blanket here.

See https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/
too.

>
> The cat is out of the bag when it comes to LLMs. In my real-world job I
> talk to scientists and engineers using these things (for their
> strengths) to quickly iterate on designs, to summarise experimental
> results, and even to generate testable hypotheses. We're only going to
> see increasing use of this technology going forward.
>
> TL;DR: I think this is a bad idea. We already have effective mechanisms
> for dealing with spam and bad faith contributions. Banning LLM use by
> Gentoo contributors at this point is just throwing the baby out with the
> bathwater.

The problem is that in FOSS, a lot of people are getting flooded with AI
spam and therefore have little regard for any possibly-good parts of it.

I count myself as part of that group - it's very much sludge and I feel
tired just seeing it talked about at the moment.

Is that super rational? No, but we're also volunteers and it's not
unreasonable for said volunteers to then say "well I don't want any more
of that".

I think this colours a lot of the responses here, and it doesn't
invalidate them, but it also explains why nobody is really interested
in being open to this for now. Who can blame them (me included)?

>
> As an alternative I'd be very happy some guidelines for the use of LLMs
> and other assistive technologies like "Don't use LLM code snippets
> unless you understand them", "Don't blindly copy and paste LLM output",
> or, my personal favourite, "Don't be a jerk to our poor bug wranglers".
>
> A blanket "No completely AI/LLM generated works" might be fine, too.
>
> Let's see how the legal issues shake out before we start pre-emptively
> banning useful tools. There's a lot of ongoing action in this space - at
> the very least I'd like to see some thorough discussion of the legal
> issues separately if we're making a case for banning an entire class of
> technology.

I'm sympathetic to the arguments you've made here and I don't want to
act like this sinks your whole argument (it doesn't), but this is
typically not how legal issues are approached. People act conservatively
if there's risk to them, not the other way around ;)

> [...]

Thanks for making me think a bit more about it and considering some
use cases I hadn't really thought about.

I still don't really want ebuilds generated by LLMs, but I could live
with:
a) LLMs being used to refine commit messages;
b) LLMs being used if restricted to suggestions from a FOSS-licenced
codebase

> Matt
>

thanks,
sam

Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]

robbat2 at gentoo

Mar 4, 2024, 10:12 PM

Post #28 of 39 (186 views)

Permalink

(Full disclosure: I presently work for a non-FAANG cloud company
with a primary business focus in providing GPU access, for AI & other
workloads; I don't feel that is a conflict of interest, but understand
that others might not feel the same way).

Yes, we need to formally address the concerns.
However, I don't come to the same conclusion about an outright ban.

I think we need to:
1. Short-term, clearly point out why much of the present outputs
would violate existing policies. Esp. the low-grade garbage output.
2. Short & medium-term: a time-limited policy saying "no AI-backend
works temporarily, while waiting for legal precedent", which clear
guidelines about what is being the blocking deal.
3. Longer-term, produce a policy that shows how AI generation can be
used for good, in a safe way**.
4. Keep the human in the loop; no garbage reinforcing garbage.

Further points inline.

On Tue, Feb 27, 2024 at 03:45:17PM +0100, Micha? Górny wrote:
> Hello,
>
> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns. In my opinion,
> at this point the only reasonable course of action would be to safely
> ban "AI"-backed contribution entirely. In other words, explicitly
> forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> create ebuilds, code, documentation, messages, bug reports and so on for
> use in Gentoo.
Are there footholds where you see AI tooling would be acceptable to you
today? AI-summarization of inputs, if correct & free of hallucinations,
is likely to be of immediate value. I see this coming up in terms of
analyzing code backtraces as well as better license analysis tooling.
The best tools here include citations that should be verified as to why
the system thinks the outcome is correct: buyer-beware if you don't
verify the citations.

> Just to be clear, I'm talking about our "original" content. We can't do
> much about upstream projects using it.
>
> Rationale:
>
> 1. Copyright concerns. At this point, the copyright situation around
> generated content is still unclear. What's pretty clear is that pretty
> much all LLMs are trained on huge corpora of copyrighted material, and
> all fancy "AI" companies don't give shit about copyright violations.
> In particular, there's a good risk that these tools would yield stuff we
> can't legally use.
The Gentoo Foundation (and SPI) are both US legal entities. That means
at least abiding by US copyright law...
As of writing this, the present US Copyright office says AI-generated
works are NOT eligible for their *own* copyright registration. The
outputs are either un-copyrightable or if they are sufficiently
similarly to existing works, that original copyright stands (with
license and authorship markings required).

That's going to be a problem if the EU, UK & other major WIPO members
come to a different conclusion, but for now, as a US-based organization,
Gentoo has the rules it must follow.

The fact that it *might* be uncopyrightable, and NOT tagged as such
gives me equal concern to the missing attribution & license statements.
Enough untagged uncopyrightable material present MAY invalidate larger
copyrights.

Clearer definitions about the distinction between public domain vs
uncopyrightable are also required in our Gentoo documentation (at a high level
ineligible vs not copyrighted vs expired vs laws/acts-of-government vs
works-of-government, but there is nuance).

>
> 2. Quality concerns. LLMs are really great at generating plausibly
> looking bullshit. I suppose they can provide good assistance if you are
> careful enough, but we can't really rely on all our contributors being
> aware of the risks.
100% agree; The quality of output is the largest concern *right now*.
The consistency of output is strongly related: given similar inputs
(including best practices not changing over time), it should give
similar outputs.

How good must the output be to negate this concern?
Current-state-of-the-art can probably write ebuilds with fewer QA
violations than most contributors, esp. given automated QA checking
tools for a positive reinforcement loop.

Besides the actual output being low-quality, the larger problem is that
users submitting it don't realize that it's low-quality (or in a few
cases don't care).

Gentoo's existing policies may only need tweaks & re-iteration here.
- GLEP76 does not set out clear guidelines for uncopyrightable works.
- GLEP76 should have a clarification that asserting GCO/DCO over
AI-generated works at this time is not acceptable.

> 3. Ethical concerns. As pointed out above, the "AI" corporations don't
> give shit about copyright, and don't give shit about people. The AI
> bubble is causing huge energy waste. It is giving a great excuse for
> layoffs and increasing exploitation of IT workers. It is driving
> enshittification of the Internet, it is empowering all kinds of spam
> and scam.
Is an ethical AI entity possible? Your argument here is really an
extension of a much older maxim: "There's no ethical consumption under
capitalism". This can encompass most tech corporations, AI or not.
It's just much more readily exposed with AI than other "big tech"
movements, because AI and the name of AI is being used do immoral &
unethical things far more frequently that before.

An truly ethical AI entity should also not be the outcome of
rent-seeking behaviors (maybe profit-seeking, but that returns to the
perils of capitalism).

The energy waste argument is also one that needs to be made carefully: The
training & fine-tuning phases today are energy wastes, only compared to the
lifetime energy usage of a human to learn the same things. When that gets more
efficient, the human may be the energy waste ;-) [1].

The generation/inference phases may be able to generate correct output
MUCH more efficiently than a human. If I think of how many times I run
"ebuild ... test" and "pkgcheck scan" some packaging, trying to get it
correct: the AI will be able to do a better job than most developers in
reasonable course of time...

Gentoo's purpose as an organization, is not to be arbiters of ethics: we
can stand against unethical actions. Where is that middle ground?

At the top, I noted that it will be possible in future for AI generation
to be used in a good, safe way, and we should provide some signals to
the researchers behind the AI industry on this matter.

What should it have?
- The output has correct license & copyright attributions for portions that are copyrightable.
- The output explicitly disclaims copyright for uncopyrightable portions
(yes, this is a higher bar than we set for humans today).
- The output is provably correct (QA checks, actually running tests etc)
- The output is free of non-functional/nonsense garbage.
- The output is free of hallucinations (aka don't invent dependencies that don't exist).

Can you please contribute other requirements that you feel "good" AI output should have?

[1]
Citation needed; Best estimate I have says:
https://www.eia.gov/tools/faqs/faq.php?id=85&t=1 76 MMBtu/person/year
https://www.wolframalpha.com/input?i=+76+MMBtu+to+MWh => 22.27 MWh/person/year
vs
Facebook claims entire model development energy consumption on all 4 sizes of LLaMA was 2,638 MWh
https://kaspergroesludvigsen.medium.com/facebook-disclose-the-carbon-footprint-of-their-new-llama-models-9629a3c5c28b

2638 / 22.27 => 118.45 people
So Development energy was the same as 118 average people doing average things for a year.
(not CompSci students compiling their code many times).

The outcome here: don't use AI where a human would be much more efficient,
unless you have strong reasons why it would be better to use the AI than a
human. We haven't crossed that threshold YET, but the day is coming, esp. with
amortized costs that training is a rare event compared to inference.

--
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer
E-Mail : robbat2@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]

xxc3ncoredxx at gmail

Mar 5, 2024, 10:53 PM

Post #29 of 39 (178 views)

Permalink

On Tue, Mar 05, 2024 at 06:12:06 +0000, Robin H. Johnson wrote:
> At the top, I noted that it will be possible in future for AI generation
> to be used in a good, safe way, and we should provide some signals to
> the researchers behind the AI industry on this matter.
>
> What should it have?
> - The output has correct license & copyright attributions for portions that are copyrightable.
> - The output explicitly disclaims copyright for uncopyrightable portions
> (yes, this is a higher bar than we set for humans today).
> - The output is provably correct (QA checks, actually running tests etc)
> - The output is free of non-functional/nonsense garbage.
> - The output is free of hallucinations (aka don't invent dependencies that don't exist).
>
> Can you please contribute other requirements that you feel "good" AI output should have?
>

- The output is not overly clever even if correct.

It should resemble something a reasonable human might write. For
example, some contrived sequence of Bash parameter expansions vs using
sed.

- The output is succinct enough.

This continues the "reasonable human" theme from above. For example, it
should not first increment some value by 4, then 3, then 2, and finaly 1
when incrementing by 10 right off the bat makes more sense.

- The output domain is able to be restricted in some form.

Given a problem, some things are simply outside of the space of valid
answers. For example,

sudo rm -rf --no-preserve-root /

should not be a line that can be generated in the context of ebuilds.

- Simply enumerating restrictions should be considered intractable.

While it may be trivial to create a list of forbidden words in the
context of a basic family-friendly environment, how can you effectively
guard against forbidden constructs when you might not know them all
beforehand? For example, how do you define what constitutes "malicious
output"?

- Oskari

Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]

martin-kokos at protonmail

Mar 6, 2024, 5:53 AM

Post #30 of 39 (176 views)

Permalink

On Tuesday, February 27th, 2024 at 3:45 PM, Micha? Górny <mgorny@gentoo.org> wrote:

> Hello,
>
> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns. In my opinion,
> at this point the only reasonable course of action would be to safely
> ban "AI"-backed contribution entirely. In other words, explicitly
> forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> create ebuilds, code, documentation, messages, bug reports and so on for
> use in Gentoo.
>
> Just to be clear, I'm talking about our "original" content. We can't do
> much about upstream projects using it.
>
>
> Rationale:
>
> 1. Copyright concerns. At this point, the copyright situation around
> generated content is still unclear. What's pretty clear is that pretty
> much all LLMs are trained on huge corpora of copyrighted material, and
> all fancy "AI" companies don't give shit about copyright violations.
> In particular, there's a good risk that these tools would yield stuff we
> can't legally use.
>
> 2. Quality concerns. LLMs are really great at generating plausibly
> looking bullshit. I suppose they can provide good assistance if you are
> careful enough, but we can't really rely on all our contributors being
> aware of the risks.
>
> 3. Ethical concerns. As pointed out above, the "AI" corporations don't
> give shit about copyright, and don't give shit about people. The AI
> bubble is causing huge energy waste. It is giving a great excuse for
> layoffs and increasing exploitation of IT workers. It is driving
> enshittification of the Internet, it is empowering all kinds of spam
> and scam.
>
>
> Gentoo has always stood out as something different, something that
> worked for people for whom mainstream distros were lacking. I think
> adding "made by real people" to the list of our advantages would be
> a good thing — but we need to have policies in place, to make sure shit
> doesn't flow in.
>
> Compare with the shitstorm at:
> https://github.com/pkgxdev/pantry/issues/5358
>
> --
> Best regards,
> Micha? Górny

While I understand the concerns that may have triggered feeling the need for a rule like this. As someone from the field of machine learning (AI) engineer, I feel I need to add my brief opinion.

The pkgxdev thing very artificial and if there is a threat to quality/integrity it will not manifest itself as obviously which brings me to..

A rule like this is just not enforceable.

The contributor as they're signed is responsible for the quality of the contribution, even if it's been written by plain editor, dev environment with smart plugins (LSP) or their dog.

Other organizations have already had to deal with automated contributions which can sometimes go wrong for *all different* kinds of reasons for much longer and their approach may be an inspiration:
[0] OpenStreetMap: automated edits - https://wiki.openstreetmap.org/wiki/Automated_Edits_code_of_conduct
[1] Wikipedia: bot policy - https://en.wikipedia.org/wiki/Wikipedia:Bot_policy
The AI that we are dealing right now is just another means of automation after all.

As a machine learning engineer myself, I was contemplating creating an instance of a generative model myself for my own use from my own data, in which case the copyright and ethical point would absolutely not apply.
Also, there are ethically and copyright-ok language model projects such as project Bergamo [2] vetted by universities and EU, also used by [3] Mozilla (one of the prominent ethical AI proponents).

Banning all tools, just because some might be not up to moral standards, puts the ones that are, in a disadvantage in our world as a whole.

[2] Project Bergamo - https://browser.mt/
[3] Mozilla blog: training translation models - https://hacks.mozilla.org/2022/06/training-efficient-neural-network-models-for-firefox-translations/

- Martin

Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]

1i5t5.duncan at cox

Mar 7, 2024, 7:59 PM

Post #31 of 39 (164 views)

Permalink

Robin H. Johnson posted on Tue, 5 Mar 2024 06:12:06 +0000 as excerpted:

> The energy waste argument is also one that needs to be made carefully:

Indeed. In a Gentoo context, condemning AI for the computative energy
waste? Maybe someone could argue that effectively. That someone isn't
Gentoo. Something about people living in glass houses throwing stones...

(And overall, I just don't see the original proposal aging well; like a
regulation that all drivers must carry a buggy-whip... =:^ Absolutely,
tweak existing policies with some added AI context here or there as others
have already suggested, but let's leave it at that.)

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]

web at inode64

Mar 7, 2024, 11:09 PM

Post #32 of 39 (163 views)

Permalink

El 27/2/24 a las 15:45, Micha? Górny escribió:
> Hello,
>
> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns. In my opinion,
> at this point the only reasonable course of action would be to safely
> ban "AI"-backed contribution entirely. In other words, explicitly
> forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> create ebuilds, code, documentation, messages, bug reports and so on for
> use in Gentoo.
>
> Just to be clear, I'm talking about our "original" content. We can't do
> much about upstream projects using it.
>

I think it would be a big mistake, because in the end we are not
shooting in the foot (I use translate, it doesn't mean the same thing in
English anyway)
In the end it is a helping tool and in the end there is always human
intervention to finish the job.

In the end we are going to have to live with AIs in all the environments
of our lives. The sooner we know how to manage them, the more productive
we will be.

> Rationale:
>
> 1. Copyright concerns. At this point, the copyright situation around
> generated content is still unclear. What's pretty clear is that pretty
> much all LLMs are trained on huge corpora of copyrighted material, and
> all fancy "AI" companies don't give shit about copyright violations.
> In particular, there's a good risk that these tools would yield stuff we
> can't legally use.
>
> 2. Quality concerns. LLMs are really great at generating plausibly
> looking bullshit. I suppose they can provide good assistance if you are
> careful enough, but we can't really rely on all our contributors being
> aware of the risks.
>
> 3. Ethical concerns. As pointed out above, the "AI" corporations don't
> give shit about copyright, and don't give shit about people. The AI
> bubble is causing huge energy waste. It is giving a great excuse for
> layoffs and increasing exploitation of IT workers. It is driving
> enshittification of the Internet, it is empowering all kinds of spam
> and scam.
>
>
> Gentoo has always stood out as something different, something that
> worked for people for whom mainstream distros were lacking. I think
> adding "made by real people" to the list of our advantages would be
> a good thing — but we need to have policies in place, to make sure shit
> doesn't flow in.
>
> Compare with the shitstorm at:
> https://github.com/pkgxdev/pantry/issues/5358
>

Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]

mgorny at gentoo

Mar 9, 2024, 6:57 AM

Post #33 of 39 (155 views)

Permalink

On Tue, 2024-02-27 at 18:04 +0000, Sam James wrote:
> I'm a bit worried this is slightly performative - which is not a dig at
> you at all - given we can't really enforce it, and it requires honesty,
> but that's also not a reason to not try ;)

I don't think it's really possible or feasible to reliably detect such
contributions, and even if it were, I don't think we want to go as far
as to actively pursue anything that looks like one. The point
of the policy is rather to make a statement that we don't want these,
and to kindly ask users not to do that.

--
Best regards,
Micha? Górny

Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]

mgorny at gentoo

Mar 9, 2024, 7:00 AM

Post #34 of 39 (155 views)

Permalink

On Fri, 2024-03-01 at 07:06 +0000, Sam James wrote:
> Another person approached me after this RFC and asked whether tooling
> restricted to the current repo would be okay. For me, that'd be mostly
> acceptable, given it won't make suggestions based on copyrighted code.

I think an important question is: how is it restricted? Are we talking
about a tool that was clearly trained on specific code, or about a tool
that was trained on potentially copyright material, then artificially
restricted to the repository (to paper over the concerns)? Can we trust
the latter?

--
Best regards,
Micha? Górny

Re: Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]

mgorny at gentoo

Mar 9, 2024, 7:04 AM

Post #35 of 39 (155 views)

Permalink

On Fri, 2024-03-08 at 03:59 +0000, Duncan wrote:
> Robin H. Johnson posted on Tue, 5 Mar 2024 06:12:06 +0000 as excerpted:
>
> > The energy waste argument is also one that needs to be made carefully:
>
> Indeed. In a Gentoo context, condemning AI for the computative energy
> waste? Maybe someone could argue that effectively. That someone isn't
> Gentoo. Something about people living in glass houses throwing stones...

Could you support that claim with actual numbers? Particularly,
on average energy use specifically due to use of Gentoo on machines vs.
energy use of dedicated data centers purely for training LLMs? I'm not
even talking of all the energy wasted as a result of these LLMs at work.

--
Best regards,
Micha? Górny

Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]

1i5t5.duncan at cox

Mar 9, 2024, 1:13 PM

Post #36 of 39 (154 views)

Permalink

Micha? Górny posted on Sat, 09 Mar 2024 16:04:58 +0100 as excerpted:

> On Fri, 2024-03-08 at 03:59 +0000, Duncan wrote:
>> Robin H. Johnson posted on Tue, 5 Mar 2024 06:12:06 +0000 as excerpted:
>>
>> > The energy waste argument is also one that needs to be made
>> > carefully:
>>
>> Indeed. In a Gentoo context, condemning AI for the computative energy
>> waste? Maybe someone could argue that effectively. That someone isn't
>> Gentoo. Something about people living in glass houses throwing
>> stones...
>
> Could you support that claim with actual numbers? Particularly,
> on average energy use specifically due to use of Gentoo on machines vs.
> energy use of dedicated data centers purely for training LLMs? I'm not
> even talking of all the energy wasted as a result of these LLMs at work.

Fair question. Actual numbers? No. But...

I'm not saying don't use gentoo -- I'm a gentooer after all -- I'm saying
gentoo simply isn't in a good position to condemn AI for its energy
inefficiency. In fact, I'd claim that in the Gentoo case there are
demonstrably more energy efficient practical alternatives (can anyone
sanely argue otherwise?, there are binary distros after all), while in the
AI case, for some usage AI is providing practical solutions where there
simply /weren't/ practical solutions /at/ /all/ before. In others,
availability and scale was practically and severely cost-limiting compared
to the situation with AI. At least in those cases despite high energy
usage, AI *is* the most efficient -- arguably including energy efficient
-- practical alternative, being the _only_ practical alternative, at least
at scale. Can Gentoo _ever_ be called the _only_ practical alternative,
at scale or not?

Over all, I'd suggest that Gentoo is in as bad or worse a situation in
terms of most energy efficient practical alternative than AI, so it simply
can't credibly make the energy efficiency argument against AI. Debian/
RedHat/etc, perhaps, a case could be reasonably made at least, Gentoo, no,
not credibly.

That isn't to say that Gentoo can't credibly take an anti-AI position
based on the /other/ points discussed in-thread. But energy usage is just
not an argument that can be persuasively made by Gentoo, thereby bringing
down the credibility of the other arguments made with it that are
otherwise viable.

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

Re: Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]

eschwartz93 at gmail

Mar 9, 2024, 5:53 PM

Post #37 of 39 (153 views)

Permalink

On 3/9/24 4:13 PM, Duncan wrote:
> I'm not saying don't use gentoo -- I'm a gentooer after all -- I'm saying
> gentoo simply isn't in a good position to condemn AI for its energy
> inefficiency. In fact, I'd claim that in the Gentoo case there are
> demonstrably more energy efficient practical alternatives (can anyone
> sanely argue otherwise?, there are binary distros after all), while in the
> AI case, for some usage AI is providing practical solutions where there
> simply /weren't/ practical solutions /at/ /all/ before. In others,
> availability and scale was practically and severely cost-limiting compared
> to the situation with AI. At least in those cases despite high energy
> usage, AI *is* the most efficient -- arguably including energy efficient
> -- practical alternative, being the _only_ practical alternative, at least
> at scale. Can Gentoo _ever_ be called the _only_ practical alternative,
> at scale or not?
>
> Over all, I'd suggest that Gentoo is in as bad or worse a situation in
> terms of most energy efficient practical alternative than AI, so it simply
> can't credibly make the energy efficiency argument against AI. Debian/
> RedHat/etc, perhaps, a case could be reasonably made at least, Gentoo, no,
> not credibly.

FWIW I am not really convinced of this claim... gentoo is not a
monoculture, I could have installed Gentoo in 2012 and was strongly
tempted but did not because it didn't have binpkgs, but being an early
adopter of https://www.gentoo.org/news/2023/12/29/Gentoo-binary.html is
the single reason I have a Gentoo system today.

There you go, Gentoo is a binary distro. (If you want it to be one.) You
are not required to waste energy in order to use Gentoo.

Leaving that aside, I think it's a bit of a red herring to claim that
one must be *as energy efficient as possible* in order to have the right
to criticize technologies that use orders of magnitude more energy and
don't come with an option to avoid spending said energy.

You also note that AI is providing practical solutions "where none
existed before, for some cases". But I really, really, REALLY don't
think this is the case for AI-backed contributions to Gentoo, which
plainly do have an exceedingly practical solution that has been in use
for a couple decades so far.

So we could perhaps agree that LLMs may not be intrinsically an
impractical energy waste, but using them to contribute to Gentoo *is*?

:)

--
Eli Schwartz

Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]

mgorny at gentoo

Mar 21, 2024, 8:25 AM

Post #38 of 39 (70 views)

Permalink

On Tue, 2024-02-27 at 15:45 +0100, Micha? Górny wrote:
> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns. In my opinion,
> at this point the only reasonable course of action would be to safely
> ban "AI"-backed contribution entirely. In other words, explicitly
> forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> create ebuilds, code, documentation, messages, bug reports and so on for
> use in Gentoo.
>
> Just to be clear, I'm talking about our "original" content. We can't do
> much about upstream projects using it.

Since I've been asked to flesh out a specific motion, here's what I
propose specifically:

"""
It is expressly forbidden to contribute to Gentoo any content that has
been created with the assistance of Natural Language Processing
artificial intelligence tools. This motion can be revisited, should
a case been made over such a tool that does not pose copyright, ethical
and quality concerns.
"""

This explicitly covers all GPTs, including ChatGPT and Copilot, which is
the category causing the most concern at the moment. At the same time,
it doesn't block more specific uses of machine learning to problem
solving.

Special thanks to Arthur Zamarin for consulting me on this.

--
Best regards,
Micha? Górny

Re: RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo [ In reply to ]

cJ-gentoo at zougloub

Apr 15, 2024, 12:50 PM

Post #39 of 39 (43 views)

Permalink

Hi,

It's a good thing that
https://wiki.gentoo.org/wiki/Project:Council/AI_policy
has been voted, and that it mentions:

> This motion can be revisited, should a case been made over such a
tool that does not pose copyright, ethical and quality concerns.

I wanted to provide some meat to discuss improvements of the specific
phrasing "created with the assistance of Natural Language
Processing artificial intelligence tools" which may not be the most
optimal.

First, I think we should not limit this to LLMs / NLP stuff, when it
should be about all algorithmically/automatically generated content,
which could all cause a flood of time-wasting, low-quality information.

Second, I think we should define what would be acceptable use cases of
algorithmically-generated content; I'd suggest for a starting point,
the combination of:

- The algorithm generating such content is proper F/LOSS

- In the case of a machine learning algorithm, the dataset allowing
to generate such algorithm is proper F/LOSS itself (with traceability
of all of its bits)

- The algorithm generating such content is reproducible (training
produces the exact same bits)

- The algorithm did not publish the content automatically: all the
content was reviewed and approved by a human, who bears responsibility
for their contribution, and the content has been flagged as having been
generated using $tool.

Third, I think a "developer certificate of origin" policy could be
augmented with the "bot did not publish the content automatically" bits
and should also be mandated in the context of bug reporting, so as to
have a "human gate" for issues discovered by automation / tinderboxes.

Best regards,

--
Jérôme

Mailing List Archive

Attached Files:

Attached Files:

Attached Files:

Attached Files:

Attached Files:

Attached Files:

Attached Files:

Attached Files:

Attached Files:

Attached Files: