Mailing List Archive

Re: Frequency of Seeing Bad Versions - now with traffic data
Anthony wrote:
> On Thu, Aug 27, 2009 at 2:58 PM, Anthony <wikimail@inbox.org> wrote:
>
>> On Thu, Aug 27, 2009 at 2:50 PM, Thomas Dalton <thomas.dalton@gmail.com>wrote:
>>
>>> I would put money on a significant majority of reverts being
>>> reverts of vandalism rather than BRD reverts, it may not be an
>>> overwhelming majority, though.
>>
>> I don't know about that, though I won't take the other end of the bet.
>> Have you done much editing while not logged in? If so, I think you have to
>> admit that it's quite common to find yourself reverted for things which are
>> not properly classified as vandalism.
>>
>
> Just going through recent changes looking for "rv" (which is not the only
> thing detected by Robert's software, and is probably the most likely to be
> actual vandalism)...
>

Most vandalism reversion on enwiki (I believe) is done with automated
tools and/or rollback rather than manual reversion.

They typically leave more detailed summaries:
"Reverted N edits by X identified as vandalism to last revision by Y"
"Reverted edits by X (talk) to last version by Y"

--
Alex (wikipedia:en:User:Mr.Z-man)

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Frequency of Seeing Bad Versions - now with traffic data [ In reply to ]
On 8/27/09 9:39 PM, Thomas Dalton wrote:
> 2009/8/28 Gregory Maxwell<gmaxwell@gmail.com>:
>> If the results of this kind of study have good agreement with
>> mechanical proxy metrics (such as machine detected vandalism) our
>> confidence in those proxies will increase, if they disagree it will
>> provide an opportunity to improve the proxies.
>
> This kind of intensive study on a few small sample with a more
> automated method used on the same sample to compare would be more
> achievable. If the automated method gets similar results, we can use
> that method for larger samples.

I would certainly be interested in seeing such a result.

Generally speaking we can expect a strong correlation between vandalism
and machine-identifiable reverts -- it's a totally reasonable assumption
for a first-order approximation -- and it would be valuable to confirm
this and see how much divergence there might be between this count and
other markers.

Most interesting following this would be take into account the effects
of flagged revisions and how this could affect initially-displayed vs
edited revisions. Has there been similar work targeting German-language
Wikipedia already?

Robert, is it possible to share the source for generating the
revert-based stats with other folks who may be interested in pursuing
further work on the subject? Thanks!

-- brion vibber (brion @ wikimedia.org)
CTO & Senior Software Architect, Wikimedia Foundation

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Frequency of Seeing Bad Versions - now with traffic data [ In reply to ]
On Fri, Aug 28, 2009 at 12:43 AM, Brion Vibber <brion@wikimedia.org> wrote:

> On 8/27/09 9:39 PM, Thomas Dalton wrote:
> > 2009/8/28 Gregory Maxwell<gmaxwell@gmail.com>:
> >> If the results of this kind of study have good agreement with
> >> mechanical proxy metrics (such as machine detected vandalism) our
> >> confidence in those proxies will increase, if they disagree it will
> >> provide an opportunity to improve the proxies.
> >
> > This kind of intensive study on a few small sample with a more
> > automated method used on the same sample to compare would be more
> > achievable. If the automated method gets similar results, we can use
> > that method for larger samples.
>
> I would certainly be interested in seeing such a result.


Can you get us 5000 random article views from the http log made during the
first half of 2009? All we need is URL/date/time. Everything else can be
blanked for anonymizing. It can be from a 1/10th log or whatever. The list
should consist solely of *views*, not edits, and only of articles.

All the rest of the data is out there, unless we happen to hit on a
deleted/oversighted revision. But using http://dammit.lt/wikistats/ to
estimate the hits is less accurate. Many popular pages get popular
suddenly, and then quickly fade away. There is most likely a strong
correlation to the amount of vandalism that takes place while they are
popular to the amount of vandalism that takes place while they are not
popular, so I'd much prefer a sample from the actual http log.

If we can't get the real thing, I'll start downloading from
http://dammit.lt/wikistats/ and generate an estimated one, though.

Once we have the list, anyone is free to examine it any way they want, and
show their results. But we're talking about probably less than 200
instances of vandalism here, so it'll be quite easy (and fun) to lambaste
anyone whose methods produce false positives.

If you're going to do it, maybe we should work on a rough-consensus
objective definition of "vandalism" before you release the file, though...
_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Frequency of Seeing Bad Versions - now with traffic data [ In reply to ]
On Thu, Aug 27, 2009 at 9:43 PM, Brion Vibber<brion@wikimedia.org> wrote:
<snip>

> Robert, is it possible to share the source for generating the
> revert-based stats with other folks who may be interested in pursuing
> further work on the subject? Thanks!

Not as a complete stand-alone entity. The analysis framework I
through together for this has closed-source dependencies. I may help
with partial code or pseudocode though.

-Robert

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Frequency of Seeing Bad Versions - now with traffic data [ In reply to ]
On 8/28/09 2:49 PM, Robert Rohde wrote:
> On Thu, Aug 27, 2009 at 9:43 PM, Brion Vibber<brion@wikimedia.org> wrote:
> <snip>
>
>> Robert, is it possible to share the source for generating the
>> revert-based stats with other folks who may be interested in pursuing
>> further work on the subject? Thanks!
>
> Not as a complete stand-alone entity. The analysis framework I
> through together for this has closed-source dependencies. I may help
> with partial code or pseudocode though.

Spiffy. :)

-- brion

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l