Mailing List Archive: [Wikimedia-l] Re: Bing-ChatGPT

On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) <luis@lu.is> wrote:

> On Feb 22, 2023 at 9:28 AM -0800, Sage Ross <ragesoss+wikipedia@gmail.com>,
> wrote:
>
> Luis,
>
> OpenAI researchers have released some info about data sources that
> trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
>
> See section 2.2, starting on page 8 of the PDF.
>
> The full text of English Wikipedia is one of five sources, the others
> being CommonCrawl, a smaller subset of scraped websites based on
> upvoted reddit links, and two unrevealed datasets of scanned books.
> (I've read speculation that one of these datasets is basically the
> Library Genesis archive.) Wikipedia is much smaller than the other
> datasets, although they did weight it somewhat more heavily than any
> other dataset. With the extra weighting, they say Wikipedia accounts
> for 3% of the total training.
>
>
> Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their
> training sources, it turns out, with similar weighting for Wikipedia - only
> 4.5% of training text, but more heavily weighted than most other sources:
>
> https://twitter.com/GuillaumeLample/status/1629151234597740550
>

Those stats are undercounting, since the top source (CommonCrawl) also
itself includes Wikipedia as its third largest source.

https://commoncrawl.github.io/cc-crawl-statistics/plots/domains

<https://twitter.com/GuillaumeLample/status/1629151234597740550>
> _______________________________________________
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/W3HAFQIMQWBZDTZL6EYZKFG3D2KL7XDL/
> To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org