Mailing List Archive

Research on Wikimedia Production Errors
Hi all,

is there any research on common causes of Wikimedia production errors?

Based on recent examples, I plan to analyze and discuss how production
errors could be avoided. I am considering submitting a short paper on
that to the Wikidata workshop, with the deadline
Thursday, 20 July 2023
Website: https://wikidataworkshop.github.io/2023/
However, there might be better suitable venues.

I am also open to collaboration on this effort. If you are interested
in a joint paper, drop me an email until the end of this week.

All the best
Moritz
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: Research on Wikimedia Production Errors [ In reply to ]
I'm in no way an expert in this area. But from what I have seen the
past years I think I can identify two repeating patterns:

1. Minor programming mistakes in unrelated code. This happens often
when we add more strict types to existing code, or make it throw
exceptions when it's called in a way it should never have been called.
E.g. when a method that expects a string is called with null. Tests
can rarely catch such "unthinkable" edge cases beforehand. They bubble
up in production where codebases work together in ways that have never
been part of any automated or manual sest setup. Luckily this kind of
error is often easy to fix or safe to ignore.

2. Database hickups. Errors that appear to be "random" and are really
hard, if not impossible to reproduce. Sometimes it turns out the
reason is a really, really old database row that was created with very
different constraints in mind. More recent code might have a different
idea how a particular database table works nowadays and fails when
faced with incompatible data. Or we find that the database schema on
certain replication machines is not what it should be. For example
foreign keys to tables that shouldn't exist any more since 18 years,
but somehow still do. ;-) https://phabricator.wikimedia.org/T299387

Let's say I'm interested, but have no research at hand. :-)

Best
Thiemo
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: Research on Wikimedia Production Errors [ In reply to ]
On Thu, Jun 8, 2023 at 12:40?AM Physikerwelt <wiki@physikerwelt.de> wrote:

> Hi all,
>
> is there any research on common causes of Wikimedia production errors?
>
> Based on recent examples, I plan to analyze and discuss how production
> errors could be avoided. I am considering submitting a short paper on
> that to the Wikidata workshop, with the deadline
> Thursday, 20 July 2023
> Website: https://wikidataworkshop.github.io/2023/
> However, there might be better suitable venues.
>

We (Release Engineering) file production-error tasks as part of the weekly
train and collect some data in the "train-stats" repo on GitLab[0].
Additionally, Timo Tijhof's "production excellence" blog posts and emails
to this list may be of interest to you[1].

The "train-stats" repo collects data for "software defect prediction" based
on the use of "FixCaches" or "BugCaches."[2] Each week, we record changes
that fix bugs (i.e., the change uses the git trailer `Bug: TXXX` and gets
backported to a currently deployed branch). The theory (per the paper
linked above) is that the more often a file needs a fix, the more likely it
is to cause future bugs. I have an extremely convoluted query to show the
list of commonly backported files[3].

Problems with this data:
- Many of these files are frequently touched files vs error-prone files
(e.g., "composer.json")
- Looking at the count of backports for each file means newer files are
less likely to be represented
- "Lower level" files may be overrepresented (although, that's probably to
be expected)

In 2013, a case study used data like this inside Google and found it to be
fairly accurate at predicting future bugs[4].

Also, in the case study, whenever a developer edited a file that was
present in their fixCache, researchers added a bot-generated note to the
patch in their code review tool.Their developers found this note unhelpful:
developers already knew these files were problematic—the warning just caused
confusion.

Based on that, in March 2020, we created the "Risky Change Template"[5]. My
thinking was: if developers already know what's risky, then they can flag
it in the train task for the week[6]. At the time, I hoped this would
reduce the total version deployment time (although I have no data on that).

I hope some of this helps!

– Tyler

[0]: <https://gitlab.wikimedia.org/repos/releng/train-stats>
[1]: <
https://phabricator.wikimedia.org/phame/post/view/296/production_excellence_46_july_august_2022/
>
[2]: <
https://people.csail.mit.edu/hunkim/images/3/37/Papers_kim_2007_bugcache.pdf
>
[3]: <
https://data.releng.team/train?sql=select%0D%0A++filename%2C%0D%0A++project%2C%0D%0A++count%28*%29+as+bug_count%0D%0Afrom%0D%0A++bug+b%0D%0A++join+bug_bug_patch+bbp+on+bbp.bug_id+%3D+b.id%0D%0A++join+bug_patch+bp+on+bp.id+%3D+bbp.bug_patch_id%0D%0A++join+bug_file+bf+on+bp.id+%3D+bf.bug_patch_id%0D%0Agroup+by%0D%0A++project%2C+filename%0D%0Aorder+by%0D%0A++bug_count+desc%3B
>
[4]: <https://doi.org/10.1109/ICSE.2013.6606583>
[5]: <https://wikitech.wikimedia.org/wiki/Deployments/Risky_change_template>
[6]: <https://train-blockers.toolforge.org/> (here's an example this week: <
https://phabricator.wikimedia.org/T337526#8901982>)


>
> I am also open to collaboration on this effort. If you are interested
> in a joint paper, drop me an email until the end of this week.
>
> All the best
> Moritz
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
Re: Research on Wikimedia Production Errors [ In reply to ]
Thank you for your feedback. I think we have a very large and open
corpus of documented incidents. As said, the Wikidata workshop is not
a good target conference. Following the reference

[4]: <https://doi.org/10.1109/ICSE.2013.6606583>

I think a paper on that would better fit
https://conf.researchr.org/track/icse-2024/icse-2024-software-engineering-in-practice

I will continue updating the paper on Overleaf

https://www.overleaf.com/read/swswtbdyyhmg



All the best
Moritz

On Thu, Jun 8, 2023 at 7:57?PM Tyler Cipriani <tcipriani@wikimedia.org> wrote:
>
>
>
> On Thu, Jun 8, 2023 at 12:40?AM Physikerwelt <wiki@physikerwelt.de> wrote:
>>
>> Hi all,
>>
>> is there any research on common causes of Wikimedia production errors?
>>
>> Based on recent examples, I plan to analyze and discuss how production
>> errors could be avoided. I am considering submitting a short paper on
>> that to the Wikidata workshop, with the deadline
>> Thursday, 20 July 2023
>> Website: https://wikidataworkshop.github.io/2023/
>> However, there might be better suitable venues.
>
>
> We (Release Engineering) file production-error tasks as part of the weekly train and collect some data in the "train-stats" repo on GitLab[0]. Additionally, Timo Tijhof's "production excellence" blog posts and emails to this list may be of interest to you[1].
>
> The "train-stats" repo collects data for "software defect prediction" based on the use of "FixCaches" or "BugCaches."[2] Each week, we record changes that fix bugs (i.e., the change uses the git trailer `Bug: TXXX` and gets backported to a currently deployed branch). The theory (per the paper linked above) is that the more often a file needs a fix, the more likely it is to cause future bugs. I have an extremely convoluted query to show the list of commonly backported files[3].
>
> Problems with this data:
> - Many of these files are frequently touched files vs error-prone files (e.g., "composer.json")
> - Looking at the count of backports for each file means newer files are less likely to be represented
> - "Lower level" files may be overrepresented (although, that's probably to be expected)
>
> In 2013, a case study used data like this inside Google and found it to be fairly accurate at predicting future bugs[4].
>
> Also, in the case study, whenever a developer edited a file that was present in their fixCache, researchers added a bot-generated note to the patch in their code review tool.Their developers found this note unhelpful: developers already knew these files were problematic—the warning just caused confusion.
>
> Based on that, in March 2020, we created the "Risky Change Template"[5]. My thinking was: if developers already know what's risky, then they can flag it in the train task for the week[6]. At the time, I hoped this would reduce the total version deployment time (although I have no data on that).
>
> I hope some of this helps!
>
> – Tyler
>
> [0]: <https://gitlab.wikimedia.org/repos/releng/train-stats>
> [1]: <https://phabricator.wikimedia.org/phame/post/view/296/production_excellence_46_july_august_2022/>
> [2]: <https://people.csail.mit.edu/hunkim/images/3/37/Papers_kim_2007_bugcache.pdf>
> [3]: <https://data.releng.team/train?sql=select%0D%0A++filename%2C%0D%0A++project%2C%0D%0A++count%28*%29+as+bug_count%0D%0Afrom%0D%0A++bug+b%0D%0A++join+bug_bug_patch+bbp+on+bbp.bug_id+%3D+b.id%0D%0A++join+bug_patch+bp+on+bp.id+%3D+bbp.bug_patch_id%0D%0A++join+bug_file+bf+on+bp.id+%3D+bf.bug_patch_id%0D%0Agroup+by%0D%0A++project%2C+filename%0D%0Aorder+by%0D%0A++bug_count+desc%3B>
> [4]: <https://doi.org/10.1109/ICSE.2013.6606583>
> [5]: <https://wikitech.wikimedia.org/wiki/Deployments/Risky_change_template>
> [6]: <https://train-blockers.toolforge.org/> (here's an example this week: <https://phabricator.wikimedia.org/T337526#8901982>)
>
>>
>>
>> I am also open to collaboration on this effort. If you are interested
>> in a joint paper, drop me an email until the end of this week.
>>
>> All the best
>> Moritz
>> _______________________________________________
>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
>> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/