Mailing List Archive

Update on DC switchover
Hi,

Today we switched over most services and traffic caches from the eqiad
(Virginia) datacenter to codfw (Texas) as part of improving our
reliability. The goal is to have this procedure working and regularly
tested in case of an emergency when we actually need it.

We're only aware of one user-facing impact, for a short time WDQS lag
detection was broken, affecting Wikidata bots that check it. This is
tracked as <https://phabricator.wikimedia.org/T285710>.

Users will experience a bit of a latency increase for now as most user
traffic will need to talk to both eqiad and codfw datacenters. This will
go away tomorrow once MediaWiki is switched over (keep reading).

Also, we were a bit delayed in starting today because of an issue
causing appservers to get stuck:
<https://phabricator.wikimedia.org/T285634>.

== Services ==
Started at 14:29 UTC, officially finished at 15:09.

The main issues we ran into were:
* the helm-charts service is unique and doesn't have a service IP,
causing the automatic switchover verification to break. This required us
to manually check the other services that come after it in the list, and
then re-run cookbook while excluding it. Tracked as
<https://phabricator.wikimedia.org/T285707>.
* the restbase-async service has some special handling, which we debated
on whether to follow that or not, opted to not special case it. Figuring
out what to do long-term is <https://phabricator.wikimedia.org/T285711>.
* the WDQS issue mentioned earlier.

== Traffic ==
Started at 15:43, finished at 15:45.

It took until ~16:25 for eqiad to mostly depool. There's not much else
to report, it went very smoothly.

== Tomorrow's MediaWiki switchover ==
Scheduled for 14:00 UTC <https://zonestamp.toolforge.org/1624888854>.

It is our goal to minimize the read-only time and make this a non-event
from a user perspective.

All of the coordination will take place in the #wikimedia-operations IRC
channel on Libera Chat You're more than welcome to follow along but if
you have questions, please ask them in #wikimedia-tech so it doesn't get
disruptive. The procedure that we'll be following is documented at
<https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki>.

I'm planning to do one more "live test" later today, will announce that
on IRC when it gets started.

-- Kunal
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: Update on DC switchover [ In reply to ]
Just wanted to emphasize that this is a great effort, and a huge step towards improving the current reliability of our
services.
We should do more of this, broader and more exhaustive.

Kudos!

On 06/28 12:33, Kunal Mehta wrote:
> Hi,
>
> Today we switched over most services and traffic caches from the eqiad
> (Virginia) datacenter to codfw (Texas) as part of improving our reliability.
> The goal is to have this procedure working and regularly tested in case of
> an emergency when we actually need it.
>
> We're only aware of one user-facing impact, for a short time WDQS lag
> detection was broken, affecting Wikidata bots that check it. This is tracked
> as <https://phabricator.wikimedia.org/T285710>.
>
> Users will experience a bit of a latency increase for now as most user
> traffic will need to talk to both eqiad and codfw datacenters. This will go
> away tomorrow once MediaWiki is switched over (keep reading).
>
> Also, we were a bit delayed in starting today because of an issue causing
> appservers to get stuck: <https://phabricator.wikimedia.org/T285634>.
>
> == Services ==
> Started at 14:29 UTC, officially finished at 15:09.
>
> The main issues we ran into were:
> * the helm-charts service is unique and doesn't have a service IP, causing
> the automatic switchover verification to break. This required us to manually
> check the other services that come after it in the list, and then re-run
> cookbook while excluding it. Tracked as
> <https://phabricator.wikimedia.org/T285707>.
> * the restbase-async service has some special handling, which we debated on
> whether to follow that or not, opted to not special case it. Figuring out
> what to do long-term is <https://phabricator.wikimedia.org/T285711>.
> * the WDQS issue mentioned earlier.
>
> == Traffic ==
> Started at 15:43, finished at 15:45.
>
> It took until ~16:25 for eqiad to mostly depool. There's not much else to
> report, it went very smoothly.
>
> == Tomorrow's MediaWiki switchover ==
> Scheduled for 14:00 UTC <https://zonestamp.toolforge.org/1624888854>.
>
> It is our goal to minimize the read-only time and make this a non-event from
> a user perspective.
>
> All of the coordination will take place in the #wikimedia-operations IRC
> channel on Libera Chat You're more than welcome to follow along but if you
> have questions, please ask them in #wikimedia-tech so it doesn't get
> disruptive. The procedure that we'll be following is documented at
> <https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki>.
>
> I'm planning to do one more "live test" later today, will announce that on
> IRC when it gets started.
>
> -- Kunal
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

--
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3

"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."
Re: Update on DC switchover [ In reply to ]
Hi again,

Today we switched MediaWiki from our eqiad datacenter to codfw. In total
there was 1 minute 57 seconds of read-only time, which is basically what
we were aiming for.

We really only had one user-facing issue in that tr.wikivoyage.org was
inaccessible for a few minutes because of a typo.
<https://phabricator.wikimedia.org/T260297> tracks making sure it
doesn't happen again.

Other than that, there's not much to report, it went pretty smoothly.
The rest of the bugs/issues filed as a result of today's switchover are
at <https://phabricator.wikimedia.org/T281515#7185775>, most are related
to improving the automation around the switch itself.

We've noticed that MediaWiki in codfw is slightly faster, most likely
because of newer hardware. Now that eqiad isn't serving traffic, we plan
on installing new hardware there too:
<https://phabricator.wikimedia.org/T279309>.

Thanks to everyone who participated in today's switchover and for all
the efforts and work ahead of time to make today so smooth.

We will be switching back to eqiad sometime in August, more details to
come as we get closer.

-- Kunal
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/