Mailing List Archive

[Wikimedia-l] Re: Accessing wikipedia metadata
Hi Cristina,

I'd recommend Toolforge, which I used to run regular queries that power
some of my bot tools. For an example of a Python script I run there to
query info and ftp it to somewhere I can easily access, see:
https://bitbucket.org/mikepeel/wikicode/src/master/query_enwp_articles_no_wikidata.py

Thanks,
Mike

On 16/9/21 16:42:31, Gava, Cristina via Wikimedia-l wrote:
> Hello everyone,
>
> It is my first time interacting in this mailing list, so I will be happy
> to receive further feedbacks on how to better interact with the community :)
>
> I am trying to access Wikipedia meta data in a streaming and
> time/resource sustainable manner. By meta data I mean many of the voices
> that can be found in the statistics of a wiki article, such as edits,
> editors list, page views etc.
>
> I would like to do such for an online classifier type of structure:
> retrieve the data from a big number of wiki pages every tot time and use
> it as input for predictions.
>
> I tried to use the Wiki API, however it is time and resource expensive,
> both for me and Wikipedia.
>
> My preferred choice now would be to query the specific tables in the
> Wikipedia database, in the same way this is done through the Quarry
> tool. The problem with Quarry is that I would like to build a standalone
> script, without having to depend on a user interface like Quarry. Do you
> think that this is possible? I am still fairly new to all of this and I
> don?t know exactly which is the best direction.
>
> I saw [1] <https://meta.wikimedia.org/wiki/Research:Data> that I could
> access wiki replicas both through Toolforge and PAWS, however I didn?t
> understand which one would serve me better, could I ask you for some
> feedback?
>
> Also, as far as I understood [2]
> <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>, directly
> accessing the DB through Hive is too technical for what I need, right?
> Especially because it seems that I would need an account with production
> shell access and I honestly don?t think that I would be granted access
> to it. Also, I am not interested in accessing sensible and private data.
>
> Last resource is parsing analytics dumps, however this seems less
> organic in the way of retrieving and polishing the data. As also, it
> would be strongly decentralised and physical-machine dependent, unless I
> upload the polished data online every time.
>
> Sorry for this long message, but I thought it was better to give you a
> clearer picture (hoping this is clear enough). If you could give me even
> some hint it would be highly appreciated.
>
> Best,
>
> Cristina
>
> [1] https://meta.wikimedia.org/wiki/Research:Data
> <https://meta.wikimedia.org/wiki/Research:Data>
>
> [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
> <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>
>
>
> _______________________________________________
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6OZE7WIRDCMRA7TESD6XVCVB6ZQV4OFP/
> To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
>
_______________________________________________
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/B3TS4PSMBHQXXGR3XRB2LUOYQXAX62IQ/
To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
[Wikimedia-l] Re: Accessing wikipedia metadata [ In reply to ]
Mike's suggestion is good. You would likely get better responses by asking
this question to the Wikimedia developers, so I am forwarding to that list.

Risker

On Thu, 16 Sept 2021 at 14:04, Gava, Cristina via Wikimedia-l <
wikimedia-l@lists.wikimedia.org> wrote:

> Hello everyone,
>
>
>
> It is my first time interacting in this mailing list, so I will be happy
> to receive further feedbacks on how to better interact with the community :)
>
>
>
> I am trying to access Wikipedia meta data in a streaming and time/resource
> sustainable manner. By meta data I mean many of the voices that can be
> found in the statistics of a wiki article, such as edits, editors list,
> page views etc.
>
> I would like to do such for an online classifier type of structure:
> retrieve the data from a big number of wiki pages every tot time and use it
> as input for predictions.
>
>
>
> I tried to use the Wiki API, however it is time and resource expensive,
> both for me and Wikipedia.
>
>
>
> My preferred choice now would be to query the specific tables in the
> Wikipedia database, in the same way this is done through the Quarry tool.
> The problem with Quarry is that I would like to build a standalone script,
> without having to depend on a user interface like Quarry. Do you think that
> this is possible? I am still fairly new to all of this and I don’t know
> exactly which is the best direction.
>
> I saw [1] <https://meta.wikimedia.org/wiki/Research:Data> that I could
> access wiki replicas both through Toolforge and PAWS, however I didn’t
> understand which one would serve me better, could I ask you for some
> feedback?
>
>
>
> Also, as far as I understood [2]
> <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>, directly
> accessing the DB through Hive is too technical for what I need, right?
> Especially because it seems that I would need an account with production
> shell access and I honestly don’t think that I would be granted access to
> it. Also, I am not interested in accessing sensible and private data.
>
>
>
> Last resource is parsing analytics dumps, however this seems less organic
> in the way of retrieving and polishing the data. As also, it would be
> strongly decentralised and physical-machine dependent, unless I upload the
> polished data online every time.
>
>
>
> Sorry for this long message, but I thought it was better to give you a
> clearer picture (hoping this is clear enough). If you could give me even
> some hint it would be highly appreciated.
>
>
>
> Best,
>
> Cristina
>
>
>
> [1] https://meta.wikimedia.org/wiki/Research:Data
>
> [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
> _______________________________________________
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6OZE7WIRDCMRA7TESD6XVCVB6ZQV4OFP/
> To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
[Wikimedia-l] Re: Accessing wikipedia metadata [ In reply to ]
Hi Mike,

Thank you very much for the reply and for giving me sample material, I'll look into that now.

Cristina
[Wikimedia-l] Re: Accessing wikipedia metadata [ In reply to ]
Dear Christina,

You are likely to find more researchers and people who regularly work with
our metadata on the research mailing list.


Send Wiki-research-l mailing list submissions to
wiki-research-l@lists.wikimedia.org

To subscribe or unsubscribe, please visit

https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.wikimedia.org%2Fpostorius%2Flists%2Fwiki-research-l.lists.wikimedia.org%2F&amp;data=04%7C01%7Cmathieu.oneil%40anu.edu.au%7Ccf5affacf28a4c48288d08d978fbdcf5%7Ce37d725cab5c46249ae5f0533e486437%7C0%7C0%7C637673846179528728%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=7xcObPrPYLKzpdxmkxQ5Vxhb%2BZhsuEmApAfEDfg0WPQ%3D&amp;reserved=0

Regards

WSC


>
> 2. Re: Accessing wikipedia metadata (Gava, Cristina)
>
>
> On Thu, 16 Sept 2021 at 14:04, Gava, Cristina via Wikimedia-l <
> wikimedia-l@lists.wikimedia.org> wrote:
>
> > Hello everyone,
> >
> >
> >
> > It is my first time interacting in this mailing list, so I will be happy
> > to receive further feedbacks on how to better interact with the
> community :)
> >
> >
> >
> > I am trying to access Wikipedia meta data in a streaming and
> time/resource
> > sustainable manner. By meta data I mean many of the voices that can be
> > found in the statistics of a wiki article, such as edits, editors list,
> > page views etc.
> >
> > I would like to do such for an online classifier type of structure:
> > retrieve the data from a big number of wiki pages every tot time and use
> it
> > as input for predictions.
> >
> >
> >
> > I tried to use the Wiki API, however it is time and resource expensive,
> > both for me and Wikipedia.
> >
> >
> >
> > My preferred choice now would be to query the specific tables in the
> > Wikipedia database, in the same way this is done through the Quarry tool.
> > The problem with Quarry is that I would like to build a standalone
> script,
> > without having to depend on a user interface like Quarry. Do you think
> that
> > this is possible? I am still fairly new to all of this and I don’t know
> > exactly which is the best direction.
> >
> > I saw [1] <https://meta.wikimedia.org/wiki/Research:Data> that I could
> > access wiki replicas both through Toolforge and PAWS, however I didn’t
> > understand which one would serve me better, could I ask you for some
> > feedback?
> >
> >
> >
> > Also, as far as I understood [2]
> > <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>, directly
> > accessing the DB through Hive is too technical for what I need, right?
> > Especially because it seems that I would need an account with production
> > shell access and I honestly don’t think that I would be granted access to
> > it. Also, I am not interested in accessing sensible and private data.
> >
> >
> >
> > Last resource is parsing analytics dumps, however this seems less organic
> > in the way of retrieving and polishing the data. As also, it would be
> > strongly decentralised and physical-machine dependent, unless I upload
> the
> > polished data online every time.
> >
> >
> >
> > Sorry for this long message, but I thought it was better to give you a
> > clearer picture (hoping this is clear enough). If you could give me even
> > some hint it would be highly appreciated.
> >
> >
> >
> > Best,
> >
> > Cristina
> >
> >
> >
> > [1] https://meta.wikimedia.org/wiki/Research:Data
> >
> > [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
> > _______________________________________________
> > Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> > at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> > https://meta.wikimedia.org/wiki/Wikimedia-l
> > Public archives at
> >
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6OZE7WIRDCMRA7TESD6XVCVB6ZQV4OFP/
> > To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
> -------------- next part --------------
> A message part incompatible with plain text digests has been removed ...
> Name: not available
> Type: text/html
> Size: 5081 bytes
> Desc: not available
>
>
>
[Wikimedia-l] Re: Accessing wikipedia metadata [ In reply to ]
Hi Risker,

Thank you kindly for redirecting me to a more appropriate forum :)

Cristina