Mailing List Archive

[Wikimedia-l] Re: Bing-ChatGPT
On Feb 22, 2023 at 9:28 AM -0800, Sage Ross <>, wrote:
> Luis,
> OpenAI researchers have released some info about data sources that
> trained GPT-3 (and hence ChatGPT):
> See section 2.2, starting on page 8 of the PDF.
> The full text of English Wikipedia is one of five sources, the others
> being CommonCrawl, a smaller subset of scraped websites based on
> upvoted reddit links, and two unrevealed datasets of scanned books.
> (I've read speculation that one of these datasets is basically the
> Library Genesis archive.) Wikipedia is much smaller than the other
> datasets, although they did weight it somewhat more heavily than any
> other dataset. With the extra weighting, they say Wikipedia accounts
> for 3% of the total training.

Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their training sources, it turns out, with similar weighting for Wikipedia - only 4.5% of training text, but more heavily weighted than most other sources: