Mailing List Archive

Stream of recent changes diffs
Dear all,

we have developed a tool that is (in some cases) capable of checking if
formulae in <math/>-tags in the context of a wikitext fragment are likely
to be correct or not. We would like to test the tool on the recent changes.
From

https://www.mediawiki.org/wiki/API:Recent_changes_stream

we can get the stream of recent changes. However, I did not find a way to
get the diff (either in HTML or Wikitext) to figure out how the content was
changed. The only option I see is to request the revision text manually
additionally. This would be a few unnecessary requests since most of the
changes do not change <math/>-tags. I assume that others, i.e., ORES

https://www.mediawiki.org/wiki/ORES,

compute the diffs anyhow and wonder if there is an easier way to get the
diffs from the recent changes stream without additional requests.

All the best
Physikerwelt (Moritz Schubotz)
Re: Stream of recent changes diffs [ In reply to ]
This isn't helpful now, but your use case is relevant to something I hope
to pursue in the future: comprehensive mediawiki change events, including
content. I don't have a great place yet for collecting these use cases, so
I added it to Modern Event Platform parent ticket
<https://phabricator.wikimedia.org/T185233> so I don't forget. :)



On Thu, Jul 1, 2021 at 8:17 AM Physikerwelt <wiki@physikerwelt.de> wrote:

> Dear all,
>
> we have developed a tool that is (in some cases) capable of checking if
> formulae in <math/>-tags in the context of a wikitext fragment are likely
> to be correct or not. We would like to test the tool on the recent changes.
> From
>
> https://www.mediawiki.org/wiki/API:Recent_changes_stream
>
> we can get the stream of recent changes. However, I did not find a way to
> get the diff (either in HTML or Wikitext) to figure out how the content was
> changed. The only option I see is to request the revision text manually
> additionally. This would be a few unnecessary requests since most of the
> changes do not change <math/>-tags. I assume that others, i.e., ORES
>
> https://www.mediawiki.org/wiki/ORES,
>
> compute the diffs anyhow and wonder if there is an easier way to get the
> diffs from the recent changes stream without additional requests.
>
> All the best
> Physikerwelt (Moritz Schubotz)
>
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: Stream of recent changes diffs [ In reply to ]
I’m no expert, but I believe the only way to get a diff via the API is through https://www.mediawiki.org/wiki/API:Compare. I haven’t worked with it to any great degree, though, so I’m afraid I can’t help beyond pointing you in that direction.

From: Physikerwelt <wiki@physikerwelt.de>
Sent: July 1, 2021 8:17 AM
To: Wikimedia developers <wikitech-l@lists.wikimedia.org>
Cc: andre.greiner-petter <andre.greiner-petter@zbmath.org>; Aaron Halfaker <ahalfaker@wikimedia.org>
Subject: [Wikitech-l] Stream of recent changes diffs

Dear all,

we have developed a tool that is (in some cases) capable of checking if formulae in <math/>-tags in the context of a wikitext fragment are likely to be correct or not. We would like to test the tool on the recent changes. From

https://www.mediawiki.org/wiki/API:Recent_changes_stream

we can get the stream of recent changes. However, I did not find a way to get the diff (either in HTML or Wikitext) to figure out how the content was changed. The only option I see is to request the revision text manually additionally. This would be a few unnecessary requests since most of the changes do not change <math/>-tags. I assume that others, i.e., ORES

https://www.mediawiki.org/wiki/ORES,

compute the diffs anyhow and wonder if there is an easier way to get the diffs from the recent changes stream without additional requests.

All the best
Physikerwelt (Moritz Schubotz)
Re: Stream of recent changes diffs [ In reply to ]
I'm not sure diffs are going to be useful here. For example, this diff <https://en.wikipedia.org/w/index.php?title=User:RoySmith/sandbox&diff=1031413484&oldid=1031413445&diffmode=source> ostensibly introduces an error in the math markup, but due to the way I've formatted the wikisource, it's not obvious from the diff that this is within <math>...</math> tags.

You might end up having to do this using the database dumps <https://meta.wikimedia.org/wiki/Data_dumps>, which is going to entail looking at a lot more data (extreme understatement) than the recent changes stream.

> On Jul 1, 2021, at 9:18 AM, Robin Hood <RobinHood70@LIVE.CA> wrote:
>
> I’m no expert, but I believe the only way to get a diff via the API is throughhttps://www.mediawiki.org/wiki/API:Compare <https://www.mediawiki.org/wiki/API:Compare>. I haven’t worked with it to any great degree, though, so I’m afraid I can’t help beyond pointing you in that direction.
>
> From: Physikerwelt <wiki@physikerwelt.de <mailto:wiki@physikerwelt.de>>
> Sent: July 1, 2021 8:17 AM
> To: Wikimedia developers <wikitech-l@lists.wikimedia.org <mailto:wikitech-l@lists.wikimedia.org>>
> Cc: andre.greiner-petter <andre.greiner-petter@zbmath.org <mailto:andre.greiner-petter@zbmath.org>>; Aaron Halfaker <ahalfaker@wikimedia.org <mailto:ahalfaker@wikimedia.org>>
> Subject: [Wikitech-l] Stream of recent changes diffs
>
> Dear all,
>
> we have developed a tool that is (in some cases) capable of checking if formulae in <math/>-tags in the context of a wikitext fragment are likely to be correct or not. We would like to test the tool on the recent changes. From
>
> https://www.mediawiki.org/wiki/API:Recent_changes_stream <https://www.mediawiki.org/wiki/API:Recent_changes_stream>
>
> we can get the stream of recent changes. However, I did not find a way to get the diff (either in HTML or Wikitext) to figure out how the content was changed. The only option I see is to request the revision text manually additionally. This would be a few unnecessary requests since most of the changes do not change <math/>-tags. I assume that others, i.e., ORES
>
> https://www.mediawiki.org/wiki/ORES <https://www.mediawiki.org/wiki/ORES>,
>
> compute the diffs anyhow and wonder if there is an easier way to get the diffs from the recent changes stream without additional requests.
>
> All the best
> Physikerwelt (Moritz Schubotz)
>
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org <mailto:wikitech-l@lists.wikimedia.org>
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org <mailto:wikitech-l-leave@lists.wikimedia.org>
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ <https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/>
Re: Stream of recent changes diffs [ In reply to ]
Note on ORES as one of its maintainers:
ORES doesn't use recent changes for getting content and scoring edits. It
hits the API.

HTH

On Thu, Jul 1, 2021 at 3:18 PM Robin Hood <RobinHood70@live.ca> wrote:

> I’m no expert, but I believe the only way to get a diff via the API is
> through https://www.mediawiki.org/wiki/API:Compare. I haven’t worked with
> it to any great degree, though, so I’m afraid I can’t help beyond pointing
> you in that direction.
>
>
>
> *From:* Physikerwelt <wiki@physikerwelt.de>
> *Sent:* July 1, 2021 8:17 AM
> *To:* Wikimedia developers <wikitech-l@lists.wikimedia.org>
> *Cc:* andre.greiner-petter <andre.greiner-petter@zbmath.org>; Aaron
> Halfaker <ahalfaker@wikimedia.org>
> *Subject:* [Wikitech-l] Stream of recent changes diffs
>
>
>
> Dear all,
>
>
>
> we have developed a tool that is (in some cases) capable of checking if
> formulae in <math/>-tags in the context of a wikitext fragment are likely
> to be correct or not. We would like to test the tool on the recent changes.
> From
>
>
>
> https://www.mediawiki.org/wiki/API:Recent_changes_stream
>
>
>
> we can get the stream of recent changes. However, I did not find a way to
> get the diff (either in HTML or Wikitext) to figure out how the content was
> changed. The only option I see is to request the revision text manually
> additionally. This would be a few unnecessary requests since most of the
> changes do not change <math/>-tags. I assume that others, i.e., ORES
>
>
>
> https://www.mediawiki.org/wiki/ORES,
>
>
>
> compute the diffs anyhow and wonder if there is an easier way to get the
> diffs from the recent changes stream without additional requests.
>
>
>
> All the best
>
> Physikerwelt (Moritz Schubotz)
>
>
>
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/



--
Amir (he/him)
Re: Stream of recent changes diffs [ In reply to ]
Hi all,

thank you for your feedback.

@Roy this is a good point, I honestly did not think about it before.
However, in the back of my mind, I remembered this problem had been
solved before. There is a beta feature called visual diffs.

If you enable this feature and navigate to
https://en.wikipedia.org/w/index.php?title=User:RoySmith/sandbox&type=revision&diff=1031413484&oldid=1031413445&diffmode=visual

you see the text "math formula changed" this is exactly what I was
looking for. Unfortunately, I was not yet able to figure out if there
is an API to get the visual diffs. I had expected it to be in
RESTbase, but nothing there.

If I can get to that API the problem would be simple enough to start
with the implementation.

All the best
Moritz

On Thu, Jul 1, 2021 at 3:39 PM Amir Sarabadani <ladsgroup@gmail.com> wrote:
>
> Note on ORES as one of its maintainers:
> ORES doesn't use recent changes for getting content and scoring edits. It hits the API.
>
> HTH
>
> On Thu, Jul 1, 2021 at 3:18 PM Robin Hood <RobinHood70@live.ca> wrote:
>>
>> I’m no expert, but I believe the only way to get a diff via the API is through https://www.mediawiki.org/wiki/API:Compare. I haven’t worked with it to any great degree, though, so I’m afraid I can’t help beyond pointing you in that direction.
>>
>>
>>
>> From: Physikerwelt <wiki@physikerwelt.de>
>> Sent: July 1, 2021 8:17 AM
>> To: Wikimedia developers <wikitech-l@lists.wikimedia.org>
>> Cc: andre.greiner-petter <andre.greiner-petter@zbmath.org>; Aaron Halfaker <ahalfaker@wikimedia.org>
>> Subject: [Wikitech-l] Stream of recent changes diffs
>>
>>
>>
>> Dear all,
>>
>>
>>
>> we have developed a tool that is (in some cases) capable of checking if formulae in <math/>-tags in the context of a wikitext fragment are likely to be correct or not. We would like to test the tool on the recent changes. From
>>
>>
>>
>> https://www.mediawiki.org/wiki/API:Recent_changes_stream
>>
>>
>>
>> we can get the stream of recent changes. However, I did not find a way to get the diff (either in HTML or Wikitext) to figure out how the content was changed. The only option I see is to request the revision text manually additionally. This would be a few unnecessary requests since most of the changes do not change <math/>-tags. I assume that others, i.e., ORES
>>
>>
>>
>> https://www.mediawiki.org/wiki/ORES,
>>
>>
>>
>> compute the diffs anyhow and wonder if there is an easier way to get the diffs from the recent changes stream without additional requests.
>>
>>
>>
>> All the best
>>
>> Physikerwelt (Moritz Schubotz)
>>
>>
>>
>>
>>
>> _______________________________________________
>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
>> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
>
>
> --
> Amir (he/him)
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: Stream of recent changes diffs [ In reply to ]
I'm afraid that the visual differ isn't helpfully set up for this. Its
approach is to fetch the parsoid HTML for the revisions to compare, and
then generate the comparison output client-side. It's not terribly reusable
outside of the VisualEditor context -- all of the describing of changes
that it does leans heavily on interrogating VisualEditor's data model for
information about the things that changed.

The best I can say about this for your purposes is that using the parsoid
HTML *would* relieve you of having to parse wikitext to work out whether
the contents of a math tag were what changed. ????????

If you do want to dig into this further, check out:
https://doc.wikimedia.org/VisualEditor/master/js/source/ve.init.mw.DiffLoader.html

~David

On Mon, Jul 5, 2021 at 11:48 AM Physikerwelt <wiki@physikerwelt.de> wrote:

> Hi all,
>
> thank you for your feedback.
>
> @Roy this is a good point, I honestly did not think about it before.
> However, in the back of my mind, I remembered this problem had been
> solved before. There is a beta feature called visual diffs.
>
> If you enable this feature and navigate to
>
> https://en.wikipedia.org/w/index.php?title=User:RoySmith/sandbox&type=revision&diff=1031413484&oldid=1031413445&diffmode=visual
>
> you see the text "math formula changed" this is exactly what I was
> looking for. Unfortunately, I was not yet able to figure out if there
> is an API to get the visual diffs. I had expected it to be in
> RESTbase, but nothing there.
>
> If I can get to that API the problem would be simple enough to start
> with the implementation.
>
> All the best
> Moritz
>
> On Thu, Jul 1, 2021 at 3:39 PM Amir Sarabadani <ladsgroup@gmail.com>
> wrote:
> >
> > Note on ORES as one of its maintainers:
> > ORES doesn't use recent changes for getting content and scoring edits.
> It hits the API.
> >
> > HTH
> >
> > On Thu, Jul 1, 2021 at 3:18 PM Robin Hood <RobinHood70@live.ca> wrote:
> >>
> >> I’m no expert, but I believe the only way to get a diff via the API is
> through https://www.mediawiki.org/wiki/API:Compare. I haven’t worked with
> it to any great degree, though, so I’m afraid I can’t help beyond pointing
> you in that direction.
> >>
> >>
> >>
> >> From: Physikerwelt <wiki@physikerwelt.de>
> >> Sent: July 1, 2021 8:17 AM
> >> To: Wikimedia developers <wikitech-l@lists.wikimedia.org>
> >> Cc: andre.greiner-petter <andre.greiner-petter@zbmath.org>; Aaron
> Halfaker <ahalfaker@wikimedia.org>
> >> Subject: [Wikitech-l] Stream of recent changes diffs
> >>
> >>
> >>
> >> Dear all,
> >>
> >>
> >>
> >> we have developed a tool that is (in some cases) capable of checking if
> formulae in <math/>-tags in the context of a wikitext fragment are likely
> to be correct or not. We would like to test the tool on the recent changes.
> From
> >>
> >>
> >>
> >> https://www.mediawiki.org/wiki/API:Recent_changes_stream
> >>
> >>
> >>
> >> we can get the stream of recent changes. However, I did not find a way
> to get the diff (either in HTML or Wikitext) to figure out how the content
> was changed. The only option I see is to request the revision text manually
> additionally. This would be a few unnecessary requests since most of the
> changes do not change <math/>-tags. I assume that others, i.e., ORES
> >>
> >>
> >>
> >> https://www.mediawiki.org/wiki/ORES,
> >>
> >>
> >>
> >> compute the diffs anyhow and wonder if there is an easier way to get
> the diffs from the recent changes stream without additional requests.
> >>
> >>
> >>
> >> All the best
> >>
> >> Physikerwelt (Moritz Schubotz)
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> >> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> >>
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
> >
> >
> >
> > --
> > Amir (he/him)
> >
> > _______________________________________________
> > Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> > To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> >
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: Stream of recent changes diffs [ In reply to ]
On that topic, I'll share some of my experience.

First, parsing wikitext is way more difficult than you probably imagine. People are often tempted to do a poor-man's job of it with regular expressions and the like. Down that path lies madness. Don't go there.

There's only two rational ways I know of to parse wikitext.

Parsoid is one. It's complicated to get your head around, but it is the one true officially supported way.

The other is mwparserfromhell <https://github.com/earwig/mwparserfromhell>. It has the advantage of being much simpler to use. It has the disadvantage of not getting every possible edge case correct. It also is only usable in Python, which is fine if you're using Python and a problem otherwise.

In either case, once you've got parsed versions of two revisions, you'll then be faced with the problem of diffing them. That's going to be non-trivial.


> On Jul 8, 2021, at 7:01 PM, David Lynch <dlynch@wikimedia.org> wrote:
>
> The best I can say about this for your purposes is that using the parsoid HTML would relieve you of having to parse wikitext to work out whether the contents of a math tag were what changed. ????????
Re: Stream of recent changes diffs [ In reply to ]
On Thu, Jul 1, 2021 at 3:10 PM Andrew Otto <otto@wikimedia.org> wrote:

> This isn't helpful now, but your use case is relevant to something I hope
> to pursue in the future: comprehensive mediawiki change events, including
> content. I don't have a great place yet for collecting these use cases, so
> I added it to Modern Event Platform parent ticket
> <https://phabricator.wikimedia.org/T185233> so I don't forget. :)
>
>
I don't think this is the use-case at all. As someone else already pointed
out, diffs don't always give you the context and might be unparsable
wikitext. So what you can do is either:
1) Send always the full content of the page changed in the stream, along
with the diff. This is IMHO extremely wasteful, but it's also easy to
implement
2) find a way to analyze the edits and emit specialized event tags that
define what has changed. This is the correct way to go forward, IMHO, but
it requires much more engineering time.

I don't think there is really a big value in adding the full content of the
page to every edit event. I'd rather suggest that people fetch the parsoid
HTML from the API, and ensure we do good edge-side caching.


Cheers,

Giuseppe
P.S. Please note that I'm only referring to streams offered to tools and in
general to the public internet. Internally to the production cluster the
use of content in events might (or might not) prove directly useful in some
cases.


--
Giuseppe Lavagetto
Principal Site Reliability Engineer, Wikimedia Foundation
Re: Stream of recent changes diffs [ In reply to ]
Parsoid has a linter extension
<https://www.mediawiki.org/wiki/Help:Extension:Linter> which is well
suited for something like this and was effectively developed with
something like this in mind. It is currently enabled on *all* parses,
but in the future, depending on how expensives lints become, we may find
alternate ways of running this.

As it turns out, Parsoid's extension API exposes a lintHandler entry
point
<https://www.mediawiki.org/wiki/Parsoid/Extension_API#ExtensionTagHandler_abstract_class>
for extensions to run their lints. So, of course, you can only support
this in the context of Parsoid's parses. The other caveat is that we
haven't fully thought through all the details, but if you are
interested, this would be a good use case to explore this. But, see
Cite's ref-tag's implementation for this.
<https://github.com/wikimedia/parsoid/blob/20a4384b1f81366ccd72b1153c8a342f46a0318f/src/Ext/Cite/Ref.php#L59-L81>
This implementation effectively calls Parsoid's default handlers on the
wikitext encountered in the <ref> tag, but, extensions which might deal
with their own wikitext might do other processing here without calling
the default handler.

This is my recommended approach rather than mess with change streams,
diffs, etc.

Subbu.

On 7/1/21 7:16 AM, Physikerwelt wrote:
> Dear all,
>
> we have developed a tool that is (in some cases) capable of checking
> if formulae in <math/>-tags in the context of a wikitext fragment are
> likely to be correct or not. We would like to test the tool on the
> recent changes. From
>
> https://www.mediawiki.org/wiki/API:Recent_changes_stream
> <https://www.mediawiki.org/wiki/API:Recent_changes_stream>
>
> we can get the stream of recent changes. However, I did not find a way
> to get the diff (either in HTML or Wikitext) to figure out how the
> content was changed. The only option I see is to request the revision
> text manually additionally. This would be a few unnecessary requests
> since most of the changes do not change <math/>-tags. I assume that
> others, i.e., ORES
>
> https://www.mediawiki.org/wiki/ORES <https://www.mediawiki.org/wiki/ORES>,
>
> compute the diffs anyhow and wonder if there is an easier way to get
> the diffs from the recent changes stream without additional requests.
>
> All the best
> Physikerwelt (Moritz Schubotz)
>
>
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/