Mailing List Archive

Performance regression expected with Debian Buster upgrade
Hi,

We're currently in the process of upgrading the MediaWiki servers to
Debian Buster and expect a performance regression to come with it.

The cause appears to be better Spectre[1] mitigations in the Buster 4.19
kernel, which we can't disable. Most of the effect is seen in code that
ends up invoking syscalls like filemtime, file_get_contents, etc.

I posted some numbers and charts on the Phabricator investigation
ticket[2]. For normal requests it looks like ~5% worse for p50/p75 and
around ~13% for p95/p99. API requests look much worse, at 10% for p50
22% for p75.

What now? We're going to continue with the upgrade as planned, but we
also need help to try and make some performance improvements to reduce
the impact of the regression.

The PHP profiling flamegraphs[3] are a great tool to use to identify
potentially slow spots. We now also have flamegraphs that only contain
Buster requests. I created a set of differential flamegraphs[4] that
compare Stretch vs Buster so you can see what specific areas slowed down.

You can also use WikimediaDebug/XHGui[5] to profile a specific request.
mwdebug1001/mwdebug1002 are Stretch and mwdebug1003 is Buster.

If you have questions or suggestions please ask or let us know. Thanks
to everyone who helped with the investigation and those who've started
working on improvements already.

[1] https://en.wikipedia.org/wiki/Spectre_(security_vulnerability)
[2] https://phabricator.wikimedia.org/T273312#6802330
[3] https://performance.wikimedia.org/php-profiling/
[4]
https://people.wikimedia.org/~legoktm/T273312/data/clean/images/flamegraphs/
[5] https://wikitech.wikimedia.org/wiki/WikimediaDebug#Request_profiling

-- Kunal

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Performance regression expected with Debian Buster upgrade [ In reply to ]
Users who would like to follow the upgrade status / want exact
information which server is currently on which distro version are
welcome to do so at:

https://docs.google.com/spreadsheets/d/1Ris18-joRFfd3OHjGJIraVUk-bpmIRORsPoms9D7BcM/edit?usp=sharing

It also tells you which servers have the special roles of scap proxy,
mcrouter proxy, canary, and which are VMs (just mwdebug).

There is currently one debug server on buster (mwdebug1003) but we are
going to provide the full set soon
(https://phabricator.wikimedia.org/T274023).

For canary servers we are aiming to have both for the transitional
period and the situation is currently as follows:

mw1261.eqiad.wmnet stretch
mw1262.eqiad.wmnet stretch
mw1263.eqiad.wmnet buster
mw1264.eqiad.wmnet buster
mw1265.eqiad.wmnet buster

Additionally one appserver (mw1403) and one API server (mw1402) on new
hardware have been designated to stay on stretch until the end to
allow for comparisons.

We appreciate reports of any issues just showing up on buster servers
of all types (app, API, jobrunner/videoscaler).

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Performance regression expected with Debian Buster upgrade [ In reply to ]
Hi,

On 2/3/21 5:35 PM, Kunal Mehta wrote:
> What now? We're going to continue with the upgrade as planned, but we
> also need help to try and make some performance improvements to reduce
> the impact of the regression.

A week later I'd like to highlight and recognize some of the performance
improvements that have been made:

* Upgrading utfnormal to use native mbstring functions instead of PHP
implementations <https://phabricator.wikimedia.org/T273338> (MaxSem,
James F, Reedy and myself)
* Optimizations to ApiResult
<https://gerrit.wikimedia.org/r/q/hashtag:%2522faster-apiresult%2522>
(Daimona, Thiemo, Krinkle and James F)
* Using PCRE for faster UTF-8 validation in Parsoid
<https://gerrit.wikimedia.org/r/656596> (Skizzerz and cscott)
* Reducing the size of the ExtensionRegistry cache in APCU
<https://gerrit.wikimedia.org/r/q/hashtag:%2522smaller-extension-cache%2522>
(Krinkle and myself)
* Reduce impact of HookContainer loading 500+ interfaces
<https://phabricator.wikimedia.org/T274041> (Skizzerz, myself, Tim
Starling and Ori)

If I missed any other improvements people have been working on, my
apologies, please share them! I've been using the Gerrit hashtag
"faster-mw-plz" <https://gerrit.wikimedia.org/r/q/hashtag:faster-mw-plz>
to try and track these.

-- Kunal

P.S. reimaging to Buster is 70% complete now.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Performance regression expected with Debian Buster upgrade [ In reply to ]
These are amazing, thanks for sharing. /me bookmarks patches for bedtime
reading

On Fri, Feb 12, 2021 at 04:25 Kunal Mehta <legoktm@member.fsf.org> wrote:

> Hi,
>
> On 2/3/21 5:35 PM, Kunal Mehta wrote:
> > What now? We're going to continue with the upgrade as planned, but we
> > also need help to try and make some performance improvements to reduce
> > the impact of the regression.
>
> A week later I'd like to highlight and recognize some of the performance
> improvements that have been made:
>
> * Upgrading utfnormal to use native mbstring functions instead of PHP
> implementations <https://phabricator.wikimedia.org/T273338> (MaxSem,
> James F, Reedy and myself)
> * Optimizations to ApiResult
> <https://gerrit.wikimedia.org/r/q/hashtag:%2522faster-apiresult%2522>
> (Daimona, Thiemo, Krinkle and James F)
> * Using PCRE for faster UTF-8 validation in Parsoid
> <https://gerrit.wikimedia.org/r/656596> (Skizzerz and cscott)
> * Reducing the size of the ExtensionRegistry cache in APCU
> <
> https://gerrit.wikimedia.org/r/q/hashtag:%2522smaller-extension-cache%2522>
>
> (Krinkle and myself)
> * Reduce impact of HookContainer loading 500+ interfaces
> <https://phabricator.wikimedia.org/T274041> (Skizzerz, myself, Tim
> Starling and Ori)
>
> If I missed any other improvements people have been working on, my
> apologies, please share them! I've been using the Gerrit hashtag
> "faster-mw-plz" <https://gerrit.wikimedia.org/r/q/hashtag:faster-mw-plz>
> to try and track these.
>
> -- Kunal
>
> P.S. reimaging to Buster is 70% complete now.
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
Re: Performance regression expected with Debian Buster upgrade [ In reply to ]
Hi all, one final follow-up,

It's been a while since 99% of appservers are on buster but we had
still kept 1 special case in each role on stretch,
so that people could make stretch vs. buster comparisons. Some people
had asked for that.

They are: mw1307 jobrunner/videoscaler, mw1402 API server,
m1403 appserver.

Now planning to finally upgrade them to buster as well tomorrow to
make that 99% a 100%.

Please stop me if you still see a reason for having any stretch appserver.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Performance regression expected with Debian Buster upgrade [ In reply to ]
And additionally I would also delete mwdebug1003, the ganeti VM on
stretch that was also there
just for the special stretch/buster comparison use case. Would anyone miss it?

mwdebug1001/1002 are on buster all this time and won't be changing.

On Thu, Apr 15, 2021 at 2:58 PM Daniel Zahn <dzahn@wikimedia.org> wrote:
>
> Hi all, one final follow-up,
>
> It's been a while since 99% of appservers are on buster but we had
> still kept 1 special case in each role on stretch,
> so that people could make stretch vs. buster comparisons. Some people
> had asked for that.
>
> They are: mw1307 jobrunner/videoscaler, mw1402 API server,
> m1403 appserver.
>
> Now planning to finally upgrade them to buster as well tomorrow to
> make that 99% a 100%.
>
> Please stop me if you still see a reason for having any stretch appserver.



--
Daniel Zahn <dzahn@wikimedia.org>
Operations Engineer

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l