Mailing List Archive

Production Excellence #39: December 2021
How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

One documented incident last month (Incident graphs <https://codepen.io/Krinkle/full/wbYMZK>).

2021-12-03 mx <https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-12-03_mx>
Impact: A portion of outgoing email from wikimedia.org was delivered with a delay of upto 24 hours. This affected staff Gmail, and Znuny/Phabricator notifications. No mail was lost, it was eventually delivered.

Incident follow-up

Remember to review and schedule Incident Follow-up work <https://phabricator.wikimedia.org/project/view/4758/> in Phabricator. These are preventive measures and tech debt mitigations written down after an incident. Read about past incidents at Incident status <https://wikitech.wikimedia.org/wiki/Incident_status> on Wikitech.
Recently resolved incident follow-up:

Create paging alert for high MX queues <https://phabricator.wikimedia.org/T297144>.
Filed in December after the mail delivery incident, resolved later that month by Keith (Herron).

Limit db execution time of expensive MW special pages <https://phabricator.wikimedia.org/T297708>.
Filed in December after various incidents due to high DB/appserver load, carried out by Amir (Ladsgroup).

Trends

In December we reported 22 new errors in December <https://phabricator.wikimedia.org/maniphest/query/DhZaBJ5PI1NA/#R>, of which 5 have since been resolved, and 17 remain open and have carried over to January. From the 298 issues previously carried over, we also resolved 17, thus the workboard still adds up to 298 in total.

In previous editions, we sometimes looked at the breakdown of tasks that remained unresolved. This time, I'd like to draw attention to the throughput and distribution of tasks that did get resolved.

Production errors resolved in the month of December, by team and component (query <https://phabricator.wikimedia.org/maniphest/query/vIEXYsei8lwE/#R>):

* Community-Tech (2): GlobalPreferences (1), CodeMirror (1).
* DBA: DjVuHandler (1).
* Editing-team: DiscussionTools (1).
* Fundraising Tech: CentralNotice (1).
* Growth-Team (8): GrowthExperiments (6), Image-Suggestions (1), StructuredDiscussions (1).
* Language-Team: UniversalLanguageSelector (1).
* Parsoid (1).
* Product-Infrastructure: TemplateStyles (1).
* Readers-Web (2).
* Structured-Data (2).
* Wikidata team: Wikidata-Page-Banner (1).
* Missing steward (1): MediaWiki-Logevents (T289806: Thanks Umherirrender!).
For the month-over-month numbers, refer to the spreadsheet data <https://docs.google.com/spreadsheets/d/e/2PACX-1vTrUCAI10hIroYDU-i5_8s7pony8M71ATXrFRiXXV7t5-tITZYrTRLGch-3iJbmeG41ZMcj1vGfzZ70/pubhtml>.

Outstanding errors

Oldest unresolved errors:

* (June 2020) WikibaseClient: RuntimeException in wblistentityusage API. T254334 <https://phabricator.wikimedia.org/T254334>
* (June 2020) WikibaseClient: Deadlock in EntityUsageTable::addUsages method. T255706 <https://phabricator.wikimedia.org/T255706>

Take a look at the workboard and look for tasks that could use your help.
? https://phabricator.wikimedia.org/tag/wikimedia-production-error/


Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


???? Share or read later via https://phabricator.wikimedia.org/phame/post/view/265/