Mailing List Archive: [Wikimedia-l] Re: Bing-ChatGPT

This is an important development for editors to be aware of - we're going
to have to be increasingly on the lookout for sources using ML-generated
bullshit. Here are two instances I'm aware of this week:

https://www.thenation.com/article/culture/internet-archive-publishers-lawsuit-chatbot/
> In late February, Tyler Cowen, a libertarian economics professor at George
> Mason University, published a blog post titled
> <https://web.archive.org/web/20230305055906/https:/marginalrevolution.com/marginalrevolution/2023/02/who-was-the-most-important-critic-of-the-printing-press-in-the-17th-century.html>,
> “Who was the most important critic of the printing press in the 17th
> century?” Cowen’s post contended that the polymath and statesman Francis
> Bacon was an “important” critic of the printing press; unfortunately, the
> post contains long, fake quotes attributed to Bacon’s *The Advancement of
> Learning *(1605), complete with false chapter and section numbers.
> Tech writer Mathew Ingram drew attention to the fabrications a few days
> later
> <https://newsletter.mathewingram.com/tyler-cowen-francis-bacon-and-the-chatgpt-engine/>,
> noting that Cowen has been writing approvingly about the AI chatbot
> ChatGPT
> <https://marginalrevolution.com/marginalrevolution/2023/02/how-should-you-talk-to-chatgpt-a-users-guide.html> for
> some time now; several commenters on Cowen’s post assumed the fake quotes
> must be the handiwork of ChatGPT. (Cowen did not reply to e-mailed
> questions regarding the post by press time, and later removed the post
> entirely, with no explanation whatsoever. However, a copy remains at the
> Internet Archive’s Wayback Machine).
>

>
> https://www.vice.com/en/article/3akz8y/ai-injected-misinformation-into-article-claiming-misinformation-in-navalny-doc
> An article claiming to identify misinformation in an Oscar-winning
> documentary about imprisoned Russian dissident Alexei Navalny is itself
> full of misinformation, thanks to the author using AI.
> Investigative news outlet *The Grayzone* recently published an article
> <https://thegrayzone.com/2023/03/13/oscar-navalny-documentary-misinformation/>
> that included AI-generated text as a source for its information. The
> piece
> <http://web.archive.org/web/20230314131551/https://thegrayzone.com/2023/03/13/oscar-navalny-documentary-misinformation/>,
> “Oscar-winning ‘Navalny’ documentary is packed with misinformation” by Lucy
> Komisar, included hyperlinks to PDFs
> <http://web.archive.org/web/20230314121144/https://www.thekomisarscoop.com/wp-content/uploads/2023/02/Many-contributors-have-backgrounds-that-suggest-they-are-biased-in-favor-of-western-governments-and-against-its-enemies.pdf>
> uploaded to the author’s personal website that appear to be screenshots
> of conversations she had with ChatSonic, a free generative AI chatbot that
> advertises itself as a ChatGPT alternative that can “write factual trending
> content” using Google search results.

That said, I don't think this is anything to be too stressed about; the
Grayzone is already a deprecated source and blogs like Marginal Revolution
are treated with caution, though Cowen has sufficient credentials to be
treated as a reliable expert.

On Fri, Mar 17, 2023 at 11:23?AM Kimmo Virtanen <kimmo.virtanen@wikimedia.fi>
wrote:

> Hi,
>
> The development of open-source large language models is going forward. The
> GPT-4 was released and it seems that it passed the Bar exam and tried to
> hire humans to solve catchpas which were too complex. However, the
> development in the open source and hacking side has been pretty fast and it
> seems that there are all the pieces for running LLM models in personal
> hardware (and in web browsers). Biggest missing piece is fine tuning of
> open source models such as Neox for the English language. For multilingual
> and multimodal (for example images+text) the model is also needed.
>
>
> So this is kind of a link dump for relevant things for creation of open
> source LLM model and service and also recap where the hacker community is
> now.
>
>
> 1.) Creation of an initial unaligned model.
>
> - Possible models
> - 20b Neo(X) <https://github.com/EleutherAI/gpt-neox> by EleutherAI
> (Apache 2.0)
> - Fairseq Dense <https://huggingface.co/KoboldAI/fairseq-dense-13B> by
> Facebook (MIT-licence)
> - LLaMa
> <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> by
> Facebook (custom license, leaked research use only)
> - Bloom <https://huggingface.co/bigscience/bloom> by Bigscience (custom
> license <https://huggingface.co/spaces/bigscience/license>. open,
> non-commercial)
>
>
> 2.) Fine-tuning or align
>
> - Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
> - Alpaca: A Strong, Replicable Instruction-Following Model
> <https://crfm.stanford.edu/2023/03/13/alpaca.html>
> - Train and run Stanford Alpaca on your own machine
> <https://replicate.com/blog/replicate-alpaca>
> - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
> <https://github.com/tloen/alpaca-lora>
>
>
> 3.) 8,4,3 bit-quantization of model for reduced hardware requirements
>
> - Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
> <https://til.simonwillison.net/llms/llama-7b-m2>
> - Github: bloomz.cpp <https://github.com/NouamaneTazi/bloomz.cpp> &
> llama.cpp <https://github.com/ggerganov/llama.cpp> (C++ only versions)
> - Int-4 LLaMa is not enough - Int-3 and beyond
> <https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and>
> - How is LLaMa.cpp possible?
> <https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible>
>
>
> 4.) Easy-to-use interfaces
>
> - Transformer.js <https://xenova.github.io/transformers.js/> (WebAssembly
> libraries to run LLM models in the browser)
> - Dalai <https://github.com/cocktailpeanut/dalai> ( run LLaMA and
> Alpaca in own computer as Node.js web service)
> - web-stable-diffusion <https://github.com/mlc-ai/web-stable-diffusion> (stable
> diffusion image generation in browser)
>
>
> Br,
> -- Kimmo Virtanen
>
> On Fri, Mar 17, 2023 at 1:53?PM Kimmo Virtanen <kimmo.virtanen@gmail.com>
> wrote:
>
>> Hi,
>>
>> The development of open-source large language models is going forward.
>> The GPT-4 was released and it seems that it passed the Bar exam and tried
>> to hire humans to solve catchpas which were too complex to it. However, the
>> development in open source and hacking side has been pretty fast and it
>> seems that there is all the pieces for running LLM models in personal
>> hardware (and in web browser). Biggest missing piece is fine tuning of
>> open source model such as Neox for english language. For multilingual and
>> multimodal (for example images+text) the model is also needed.
>>
>>
>> So this is kind of link dump for relevant things for creation of open
>> source LLM model and service and also recap where hacker community is now.
>>
>>
>> 1.) Creation of an initial unaligned model.
>>
>> - Possible models
>> - 20b Neo(X) <https://github.com/EleutherAI/gpt-neox> by
>> EleutherAI (Apache 2.0)
>> - Fairseq Dense <https://huggingface.co/KoboldAI/fairseq-dense-13B> by
>> Facebook (MIT-licence)
>> - LLaMa
>> <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> by
>> Facebook (custom license, leaked research use only)
>> - Bloom <https://huggingface.co/bigscience/bloom> by Bigscience (custom
>> license <https://huggingface.co/spaces/bigscience/license>. open,
>> non-commercial)
>>
>>
>> 2.) Fine-tuning or align
>>
>> - Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
>> - Alpaca: A Strong, Replicable Instruction-Following Model
>> <https://crfm.stanford.edu/2023/03/13/alpaca.html>
>> - Train and run Stanford Alpaca on your own machine
>> <https://replicate.com/blog/replicate-alpaca>
>> - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
>> <https://github.com/tloen/alpaca-lora>
>>
>>
>> 3.) 8,4,3 bit-quantization of model for reduced hardware requirements
>>
>> - Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
>> <https://til.simonwillison.net/llms/llama-7b-m2>
>> - Github: bloomz.cpp <https://github.com/NouamaneTazi/bloomz.cpp> &
>> llama.cpp <https://github.com/ggerganov/llama.cpp> (C++ only versions)
>> - Int-4 LLaMa is not enough - Int-3 and beyond
>> <https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and>
>> - How is LLaMa.cpp possible?
>> <https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible>
>>
>>
>> 4.) Easy-to-use interfaces
>>
>> - Transformer.js <https://xenova.github.io/transformers.js/> (WebAssembly
>> libraries to run LLM models in the browser)
>> - Dalai <https://github.com/cocktailpeanut/dalai> ( run LLaMA and
>> Alpaca in own computer as Node.js web service)
>> - web-stable-diffusion
>> <https://github.com/mlc-ai/web-stable-diffusion> (stable diffusion
>> image generation in browser)
>>
>>
>> Br,
>> -- Kimmo Virtanen
>>
>> On Mon, Mar 6, 2023 at 6:50?AM Steven Walling <steven.walling@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) <luis@lu.is> wrote:
>>>
>>>> On Feb 22, 2023 at 9:28 AM -0800, Sage Ross <
>>>> ragesoss+wikipedia@gmail.com>, wrote:
>>>>
>>>> Luis,
>>>>
>>>> OpenAI researchers have released some info about data sources that
>>>> trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
>>>>
>>>> See section 2.2, starting on page 8 of the PDF.
>>>>
>>>> The full text of English Wikipedia is one of five sources, the others
>>>> being CommonCrawl, a smaller subset of scraped websites based on
>>>> upvoted reddit links, and two unrevealed datasets of scanned books.
>>>> (I've read speculation that one of these datasets is basically the
>>>> Library Genesis archive.) Wikipedia is much smaller than the other
>>>> datasets, although they did weight it somewhat more heavily than any
>>>> other dataset. With the extra weighting, they say Wikipedia accounts
>>>> for 3% of the total training.
>>>>
>>>>
>>>> Thanks, Sage. Facebook’s recently-released LLaMa also shares some of
>>>> their training sources, it turns out, with similar weighting for Wikipedia
>>>> - only 4.5% of training text, but more heavily weighted than most other
>>>> sources:
>>>>
>>>> https://twitter.com/GuillaumeLample/status/1629151234597740550
>>>>
>>>
>>> Those stats are undercounting, since the top source (CommonCrawl) also
>>> itself includes Wikipedia as its third largest source.
>>>
>>> https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
>>>
>>> <https://twitter.com/GuillaumeLample/status/1629151234597740550>
>>>> _______________________________________________
>>>> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org,
>>>> guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
>>>> and https://meta.wikimedia.org/wiki/Wikimedia-l
>>>> Public archives at
>>>> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/W3HAFQIMQWBZDTZL6EYZKFG3D2KL7XDL/
>>>> To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
>>>
>>> _______________________________________________
>>> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
>>> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
>>> https://meta.wikimedia.org/wiki/Wikimedia-l
>>> Public archives at
>>> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6UKCJWOUR2KVTS7QZYKPMKQGONXZ72QR/
>>> To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
>>
>> _______________________________________________
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/KODAIRDAW6TESXS6DHIX2QLLCYYFDKCB/
> To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org