Production Excellence #34: July 2021
How’d we do in our strive for operational excellence last month? Read on to
find out!

3 documented incidents last month. That's at the median for the past twelve
months, and slightly below the median of 4 over the past five years (Incident
stats graphs <>).

- 2021-07-14 eventgate latency spike
- Impact: For ~ 10min MediaWiki API clients experienced request
- 2021-07-16 codfw-a2 network
- Impact: For ~ 1 hour Restbase clients received errors, affecting
mobile apps and ContentTranslation.
- 2021-07-26 ruwikinews DynamicPageList
- Impact: For 30min, 15% of requests from contributors on all wikis
failed. There were also brief moments during which no readers could load
recently modified or uncached pages.

Learn about past incidents at Incident status
<> on Wikitech. Remember
to review and schedule Incident Follow-up
<> in Phabricator,
which are preventive measures and other action items filed after an

Last month the workboard held 154 non-old unresolved error reports. Over
the past thirty days, the collective efforts of our volunteers and
engineering teams have closed 14 of those.

In the month of July we've also introduced or discovered thirty-one new
error reports (that's an average of one production regression every day!).
Of those new error reports, fifteen were resolved and 16 remain unresolved.
The workboard now tallies up to 156 tasks.

Over on the backlog, we're continuing to ploddingly present progress on
production problems from phantoms of christmases past.

Figure 1, Figure 2: Unresolved error reports stacked by month.

For more month-over-month numbers refer to the spreadsheet data
Outstanding errors

Take a look at the workboard
<> and
look for tasks that could use your help.

Below are various older issues that may have fallen by the wayside, taken
from somewhat-random stab-in-the-dark queries.

Oldest unresolved errors that are still reproducible (Phab query

- Reported in 2015: Unable to view history of protected Flow board
(StructuredDiscussions, Growth team), T118502
- Reported in 2016: Error when deleting a heading next to a table
(VisualEditor, Editing team), T140871

Stalled error reports (Phab query

- Stalled Mar 2021: Constraints check for Q142 France times out
(Wikidata, WMDE), T212282 <>.

Oldest error with a patch for review (Phab query

- Reported in 2016: Maps broken during 2nd live preview (Maps, Product
Infra), T151524 <>.
- Reported in 2018: Corrupt connection for cross-wiki db query (Platform
team), T193565 <>.

Jan 2021 (3 of 50 issues
<> left) ??
*Unchanged. Have a look-see!*
Feb 2021 (6 of 20 issues
<> left) ??
*Unchanged. Take a gander!*
Mar 2021 (13 of 48 issues
<> left) ??
*Unchanged. Check it out!*
Apr 2021 (18 of 42 issues
<> left) -1
May 2021 (22 of 54 issues
<> left) -3
June 2021 (11 of 26 issues
<> left) -4
July 2021 (16 of 31 issues
<> left) +31;
154 issues open, as of Excellence #33 (June 2021)
-14 issues closed, of the previous 154 open issues.
+16 new issues that survived July 2021.
156 issues open, as of today

Thank you to everyone who helped by reporting, investigating, or resolving
problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

