Mailing List Archive

[Wikimedia-l] Re: Bing-ChatGPT
In european union there is in pipeline "Regulation Laying Down Harmonized
Rules on Artificial Intelligence" aka AI Act and "AI Liability Directive"
(AILD). AI act text afaik is currently in finalizing phase.
*
https://www.insideprivacy.com/artificial-intelligence/eu-ai-policy-and-regulation-what-to-look-out-for-in-2023/
*
https://www.dentons.com/en/insights/articles/2023/february/1/regulating-ai-in-eu-three-things-that-you-need-to-know

Interesting note is that there is no wikipedia article about these or even
wikidata items.

Br,
-- Kimmo Virtanen, Zache

On Sat, Mar 18, 2023 at 4:06?AM Steven Walling <steven.walling@gmail.com>
wrote:

> On Fri, Mar 17, 2023 at 6:03?PM The Cunctator <cunctator@gmail.com> wrote:
>
>> I really feel like we're getting into pretty aggressive corporate abuse
>> of the Wikipedia copyleft.
>>
>
> I completely agree. It makes me pretty angry that Wikipedians have spent
> millions of volunteer hours creating content to educate and inform people
> as accurately as we can, and it's being used to generate convincing but
> often wildly misleading bullshit.
>
> The ground truth on what generated AI content is (from a copyright
> position) and where authorship/ownership lies seems to be rapidly evolving.
> The U.S. Copyright Office recently refused to issue copyrights for some
> AI-generated works, seemingly on the principle that they lack human
> authorship / are essential to contracting work for hire from an artist or
> writer.
>
> IANAL of course, but to me this implies that responsibility for the
> *egregious* lack of attribution in models that rely substantially on
> Wikipedia is violating the Attribution requirements of CC licenses. Just
> like the Foundation took a principled position in testing the legality of
> warrantless mass surveillance, I would love to see us push back on the
> notion that it's legal or moral for OpenAI or any of these other companies
> to take our content and use it to flood the Internet with machine-generated
> word diarrhea.
>
>
>> On Fri, Mar 17, 2023, 4:45 PM Adam Sobieski <adamsobieski@hotmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I would like to indicate "Copilot" in the Edge browser as being
>>> potentially relevant to Wikipedia [1][2].
>>>
>>> It is foreseeable that end-users will be able to open sidebars in their
>>> Web browsers and subsequently chat with large language models about the
>>> contents of specific Web documents, e.g., encyclopedia articles. Using Web
>>> browsers, there can be task contexts available, including the documents or
>>> articles in users' current tabs, potentially including users' scroll
>>> positions, potentially including users' selections or highlightings of
>>> content.
>>>
>>> I, for one, am thinking about how Web standards, e.g., Web schema, can
>>> be of use for amplifying these features and capabilities for end-users.
>>>
>>>
>>> Best regards,
>>> Adam Sobieski
>>>
>>> [1]
>>> https://learn.microsoft.com/en-us/deployedge/microsoft-edge-relnote-stable-channel?ranMID=24542#version-1110166141-march-13-2023
>>> [2] https://www.engadget.com/microsoft-edge-ai-copilot-184033427.html
>>>
>>> ------------------------------
>>> *From:* Kimmo Virtanen <kimmo.virtanen@wikimedia.fi>
>>> *Sent:* Friday, March 17, 2023 8:17 AM
>>> *To:* Wikimedia Mailing List <wikimedia-l@lists.wikimedia.org>
>>> *Subject:* [Wikimedia-l] Re: Bing-ChatGPT
>>>
>>> Hi,
>>>
>>> The development of open-source large language models is going forward.
>>> The GPT-4 was released and it seems that it passed the Bar exam and tried
>>> to hire humans to solve catchpas which were too complex. However, the
>>> development in the open source and hacking side has been pretty fast and it
>>> seems that there are all the pieces for running LLM models in personal
>>> hardware (and in web browsers). Biggest missing piece is fine tuning of
>>> open source models such as Neox for the English language. For multilingual
>>> and multimodal (for example images+text) the model is also needed.
>>>
>>>
>>> So this is kind of a link dump for relevant things for creation of open
>>> source LLM model and service and also recap where the hacker community is
>>> now.
>>>
>>>
>>> 1.) Creation of an initial unaligned model.
>>>
>>> - Possible models
>>> - 20b Neo(X) <https://github.com/EleutherAI/gpt-neox> by
>>> EleutherAI (Apache 2.0)
>>> - Fairseq Dense
>>> <https://huggingface.co/KoboldAI/fairseq-dense-13B> by Facebook
>>> (MIT-licence)
>>> - LLaMa
>>> <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> by
>>> Facebook (custom license, leaked research use only)
>>> - Bloom <https://huggingface.co/bigscience/bloom> by Bigscience (custom
>>> license <https://huggingface.co/spaces/bigscience/license>. open,
>>> non-commercial)
>>>
>>>
>>> 2.) Fine-tuning or align
>>>
>>> - Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
>>> - Alpaca: A Strong, Replicable Instruction-Following Model
>>> <https://crfm.stanford.edu/2023/03/13/alpaca.html>
>>> - Train and run Stanford Alpaca on your own machine
>>> <https://replicate.com/blog/replicate-alpaca>
>>> - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
>>> <https://github.com/tloen/alpaca-lora>
>>>
>>>
>>> 3.) 8,4,3 bit-quantization of model for reduced hardware requirements
>>>
>>> - Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
>>> <https://til.simonwillison.net/llms/llama-7b-m2>
>>> - Github: bloomz.cpp <https://github.com/NouamaneTazi/bloomz.cpp> &
>>> llama.cpp <https://github.com/ggerganov/llama.cpp> (C++ only
>>> versions)
>>> - Int-4 LLaMa is not enough - Int-3 and beyond
>>> <https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and>
>>> - How is LLaMa.cpp possible?
>>> <https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible>
>>>
>>>
>>> 4.) Easy-to-use interfaces
>>>
>>> - Transformer.js <https://xenova.github.io/transformers.js/> (WebAssembly
>>> libraries to run LLM models in the browser)
>>> - Dalai <https://github.com/cocktailpeanut/dalai> ( run LLaMA and
>>> Alpaca in own computer as Node.js web service)
>>> - web-stable-diffusion
>>> <https://github.com/mlc-ai/web-stable-diffusion> (stable diffusion
>>> image generation in browser)
>>>
>>>
>>> Br,
>>> -- Kimmo Virtanen
>>>
>>> On Fri, Mar 17, 2023 at 1:53?PM Kimmo Virtanen <kimmo.virtanen@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> The development of open-source large language models is going forward.
>>> The GPT-4 was released and it seems that it passed the Bar exam and tried
>>> to hire humans to solve catchpas which were too complex to it. However, the
>>> development in open source and hacking side has been pretty fast and it
>>> seems that there is all the pieces for running LLM models in personal
>>> hardware (and in web browser). Biggest missing piece is fine tuning of
>>> open source model such as Neox for english language. For multilingual and
>>> multimodal (for example images+text) the model is also needed.
>>>
>>>
>>> So this is kind of link dump for relevant things for creation of open
>>> source LLM model and service and also recap where hacker community is now.
>>>
>>>
>>> 1.) Creation of an initial unaligned model.
>>>
>>> - Possible models
>>> - 20b Neo(X) <https://github.com/EleutherAI/gpt-neox> by
>>> EleutherAI (Apache 2.0)
>>> - Fairseq Dense
>>> <https://huggingface.co/KoboldAI/fairseq-dense-13B> by Facebook
>>> (MIT-licence)
>>> - LLaMa
>>> <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> by
>>> Facebook (custom license, leaked research use only)
>>> - Bloom <https://huggingface.co/bigscience/bloom> by Bigscience (custom
>>> license <https://huggingface.co/spaces/bigscience/license>. open,
>>> non-commercial)
>>>
>>>
>>> 2.) Fine-tuning or align
>>>
>>> - Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
>>> - Alpaca: A Strong, Replicable Instruction-Following Model
>>> <https://crfm.stanford.edu/2023/03/13/alpaca.html>
>>> - Train and run Stanford Alpaca on your own machine
>>> <https://replicate.com/blog/replicate-alpaca>
>>> - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
>>> <https://github.com/tloen/alpaca-lora>
>>>
>>>
>>> 3.) 8,4,3 bit-quantization of model for reduced hardware requirements
>>>
>>> - Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
>>> <https://til.simonwillison.net/llms/llama-7b-m2>
>>> - Github: bloomz.cpp <https://github.com/NouamaneTazi/bloomz.cpp> &
>>> llama.cpp <https://github.com/ggerganov/llama.cpp> (C++ only
>>> versions)
>>> - Int-4 LLaMa is not enough - Int-3 and beyond
>>> <https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and>
>>> - How is LLaMa.cpp possible?
>>> <https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible>
>>>
>>>
>>> 4.) Easy-to-use interfaces
>>>
>>> - Transformer.js <https://xenova.github.io/transformers.js/> (WebAssembly
>>> libraries to run LLM models in the browser)
>>> - Dalai <https://github.com/cocktailpeanut/dalai> ( run LLaMA and
>>> Alpaca in own computer as Node.js web service)
>>> - web-stable-diffusion
>>> <https://github.com/mlc-ai/web-stable-diffusion> (stable diffusion
>>> image generation in browser)
>>>
>>>
>>> Br,
>>> -- Kimmo Virtanen
>>>
>>> On Mon, Mar 6, 2023 at 6:50?AM Steven Walling <steven.walling@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>> On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) <luis@lu.is> wrote:
>>>
>>> On Feb 22, 2023 at 9:28 AM -0800, Sage Ross <
>>> ragesoss+wikipedia@gmail.com>, wrote:
>>>
>>> Luis,
>>>
>>> OpenAI researchers have released some info about data sources that
>>> trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
>>>
>>> See section 2.2, starting on page 8 of the PDF.
>>>
>>> The full text of English Wikipedia is one of five sources, the others
>>> being CommonCrawl, a smaller subset of scraped websites based on
>>> upvoted reddit links, and two unrevealed datasets of scanned books.
>>> (I've read speculation that one of these datasets is basically the
>>> Library Genesis archive.) Wikipedia is much smaller than the other
>>> datasets, although they did weight it somewhat more heavily than any
>>> other dataset. With the extra weighting, they say Wikipedia accounts
>>> for 3% of the total training.
>>>
>>>
>>> Thanks, Sage. Facebook’s recently-released LLaMa also shares some of
>>> their training sources, it turns out, with similar weighting for Wikipedia
>>> - only 4.5% of training text, but more heavily weighted than most other
>>> sources:
>>>
>>> https://twitter.com/GuillaumeLample/status/1629151234597740550
>>>
>>>
>>> Those stats are undercounting, since the top source (CommonCrawl) also
>>> itself includes Wikipedia as its third largest source.
>>>
>>> https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
>>>
>>> <https://twitter.com/GuillaumeLample/status/1629151234597740550>
>>> _______________________________________________
>>> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
>>> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
>>> https://meta.wikimedia.org/wiki/Wikimedia-l
>>> Public archives at
>>> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/W3HAFQIMQWBZDTZL6EYZKFG3D2KL7XDL/
>>> To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
>>>
>>> _______________________________________________
>>> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
>>> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
>>> https://meta.wikimedia.org/wiki/Wikimedia-l
>>> Public archives at
>>> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6UKCJWOUR2KVTS7QZYKPMKQGONXZ72QR/
>>> To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
>>>
>>> _______________________________________________
>>> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
>>> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
>>> https://meta.wikimedia.org/wiki/Wikimedia-l
>>> Public archives at
>>> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/AWUNEC7JCIHFPE3LS5M2MDZTMVG25V3H/
>>> To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
>>
>> _______________________________________________
>> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
>> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
>> https://meta.wikimedia.org/wiki/Wikimedia-l
>> Public archives at
>> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/MJFBIBW7WPMOEKNCUN4OEDX3FU2KE3ED/
>> To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
>
> _______________________________________________
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/V62URIS3LRPNJI4DYWS3H5Z6UYT5RSXT/
> To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org