Mailing List Archive: [Wikimedia-l] Re: Bing-ChatGPT

I really feel like we're getting into pretty aggressive corporate abuse of
the Wikipedia copyleft.

On Fri, Mar 17, 2023, 4:45 PM Adam Sobieski <adamsobieski@hotmail.com>
wrote:

> Hello,
>
> I would like to indicate "Copilot" in the Edge browser as being
> potentially relevant to Wikipedia [1][2].
>
> It is foreseeable that end-users will be able to open sidebars in their
> Web browsers and subsequently chat with large language models about the
> contents of specific Web documents, e.g., encyclopedia articles. Using Web
> browsers, there can be task contexts available, including the documents or
> articles in users' current tabs, potentially including users' scroll
> positions, potentially including users' selections or highlightings of
> content.
>
> I, for one, am thinking about how Web standards, e.g., Web schema, can be
> of use for amplifying these features and capabilities for end-users.
>
>
> Best regards,
> Adam Sobieski
>
> [1]
> https://learn.microsoft.com/en-us/deployedge/microsoft-edge-relnote-stable-channel?ranMID=24542#version-1110166141-march-13-2023
> [2] https://www.engadget.com/microsoft-edge-ai-copilot-184033427.html
>
> ------------------------------
> *From:* Kimmo Virtanen <kimmo.virtanen@wikimedia.fi>
> *Sent:* Friday, March 17, 2023 8:17 AM
> *To:* Wikimedia Mailing List <wikimedia-l@lists.wikimedia.org>
> *Subject:* [Wikimedia-l] Re: Bing-ChatGPT
>
> Hi,
>
> The development of open-source large language models is going forward. The
> GPT-4 was released and it seems that it passed the Bar exam and tried to
> hire humans to solve catchpas which were too complex. However, the
> development in the open source and hacking side has been pretty fast and it
> seems that there are all the pieces for running LLM models in personal
> hardware (and in web browsers). Biggest missing piece is fine tuning of
> open source models such as Neox for the English language. For multilingual
> and multimodal (for example images+text) the model is also needed.
>
>
> So this is kind of a link dump for relevant things for creation of open
> source LLM model and service and also recap where the hacker community is
> now.
>
>
> 1.) Creation of an initial unaligned model.
>
> - Possible models
> - 20b Neo(X) <https://github.com/EleutherAI/gpt-neox> by EleutherAI
> (Apache 2.0)
> - Fairseq Dense <https://huggingface.co/KoboldAI/fairseq-dense-13B> by
> Facebook (MIT-licence)
> - LLaMa
> <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> by
> Facebook (custom license, leaked research use only)
> - Bloom <https://huggingface.co/bigscience/bloom> by Bigscience (custom
> license <https://huggingface.co/spaces/bigscience/license>. open,
> non-commercial)
>
>
> 2.) Fine-tuning or align
>
> - Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
> - Alpaca: A Strong, Replicable Instruction-Following Model
> <https://crfm.stanford.edu/2023/03/13/alpaca.html>
> - Train and run Stanford Alpaca on your own machine
> <https://replicate.com/blog/replicate-alpaca>
> - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
> <https://github.com/tloen/alpaca-lora>
>
>
> 3.) 8,4,3 bit-quantization of model for reduced hardware requirements
>
> - Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
> <https://til.simonwillison.net/llms/llama-7b-m2>
> - Github: bloomz.cpp <https://github.com/NouamaneTazi/bloomz.cpp> &
> llama.cpp <https://github.com/ggerganov/llama.cpp> (C++ only versions)
> - Int-4 LLaMa is not enough - Int-3 and beyond
> <https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and>
> - How is LLaMa.cpp possible?
> <https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible>
>
>
> 4.) Easy-to-use interfaces
>
> - Transformer.js <https://xenova.github.io/transformers.js/> (WebAssembly
> libraries to run LLM models in the browser)
> - Dalai <https://github.com/cocktailpeanut/dalai> ( run LLaMA and
> Alpaca in own computer as Node.js web service)
> - web-stable-diffusion <https://github.com/mlc-ai/web-stable-diffusion> (stable
> diffusion image generation in browser)
>
>
> Br,
> -- Kimmo Virtanen
>
> On Fri, Mar 17, 2023 at 1:53?PM Kimmo Virtanen <kimmo.virtanen@gmail.com>
> wrote:
>
> Hi,
>
> The development of open-source large language models is going forward. The
> GPT-4 was released and it seems that it passed the Bar exam and tried to
> hire humans to solve catchpas which were too complex to it. However, the
> development in open source and hacking side has been pretty fast and it
> seems that there is all the pieces for running LLM models in personal
> hardware (and in web browser). Biggest missing piece is fine tuning of
> open source model such as Neox for english language. For multilingual and
> multimodal (for example images+text) the model is also needed.
>
>
> So this is kind of link dump for relevant things for creation of open
> source LLM model and service and also recap where hacker community is now.
>
>
> 1.) Creation of an initial unaligned model.
>
> - Possible models
> - 20b Neo(X) <https://github.com/EleutherAI/gpt-neox> by EleutherAI
> (Apache 2.0)
> - Fairseq Dense <https://huggingface.co/KoboldAI/fairseq-dense-13B> by
> Facebook (MIT-licence)
> - LLaMa
> <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> by
> Facebook (custom license, leaked research use only)
> - Bloom <https://huggingface.co/bigscience/bloom> by Bigscience (custom
> license <https://huggingface.co/spaces/bigscience/license>. open,
> non-commercial)
>
>
> 2.) Fine-tuning or align
>
> - Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
> - Alpaca: A Strong, Replicable Instruction-Following Model
> <https://crfm.stanford.edu/2023/03/13/alpaca.html>
> - Train and run Stanford Alpaca on your own machine
> <https://replicate.com/blog/replicate-alpaca>
> - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
> <https://github.com/tloen/alpaca-lora>
>
>
> 3.) 8,4,3 bit-quantization of model for reduced hardware requirements
>
> - Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
> <https://til.simonwillison.net/llms/llama-7b-m2>
> - Github: bloomz.cpp <https://github.com/NouamaneTazi/bloomz.cpp> &
> llama.cpp <https://github.com/ggerganov/llama.cpp> (C++ only versions)
> - Int-4 LLaMa is not enough - Int-3 and beyond
> <https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and>
> - How is LLaMa.cpp possible?
> <https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible>
>
>
> 4.) Easy-to-use interfaces
>
> - Transformer.js <https://xenova.github.io/transformers.js/> (WebAssembly
> libraries to run LLM models in the browser)
> - Dalai <https://github.com/cocktailpeanut/dalai> ( run LLaMA and
> Alpaca in own computer as Node.js web service)
> - web-stable-diffusion <https://github.com/mlc-ai/web-stable-diffusion> (stable
> diffusion image generation in browser)
>
>
> Br,
> -- Kimmo Virtanen
>
> On Mon, Mar 6, 2023 at 6:50?AM Steven Walling <steven.walling@gmail.com>
> wrote:
>
>
>
> On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) <luis@lu.is> wrote:
>
> On Feb 22, 2023 at 9:28 AM -0800, Sage Ross <ragesoss+wikipedia@gmail.com>,
> wrote:
>
> Luis,
>
> OpenAI researchers have released some info about data sources that
> trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
>
> See section 2.2, starting on page 8 of the PDF.
>
> The full text of English Wikipedia is one of five sources, the others
> being CommonCrawl, a smaller subset of scraped websites based on
> upvoted reddit links, and two unrevealed datasets of scanned books.
> (I've read speculation that one of these datasets is basically the
> Library Genesis archive.) Wikipedia is much smaller than the other
> datasets, although they did weight it somewhat more heavily than any
> other dataset. With the extra weighting, they say Wikipedia accounts
> for 3% of the total training.
>
>
> Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their
> training sources, it turns out, with similar weighting for Wikipedia - only
> 4.5% of training text, but more heavily weighted than most other sources:
>
> https://twitter.com/GuillaumeLample/status/1629151234597740550
>
>
> Those stats are undercounting, since the top source (CommonCrawl) also
> itself includes Wikipedia as its third largest source.
>
> https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
>
> <https://twitter.com/GuillaumeLample/status/1629151234597740550>
> _______________________________________________
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/W3HAFQIMQWBZDTZL6EYZKFG3D2KL7XDL/
> To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
>
> _______________________________________________
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6UKCJWOUR2KVTS7QZYKPMKQGONXZ72QR/
> To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
>
> _______________________________________________
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/AWUNEC7JCIHFPE3LS5M2MDZTMVG25V3H/
> To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org