Mailing List Archive

Flexential PDX (was Re: Dreamhost MySQL outage?)
On 11/5/23 3:24 AM, Chapman, Brad (NBCUniversal) wrote:
> Counter to best practices, Flexential did not inform Cloudflare that they had failed over to generator power.
>
> Off to a good start, then...

It's proper to prefix text that's being quoted with a "> "? It's super
confusing to read your reply between, trying to separate your content from
cloud flair's and then top posting too. Adding [EXTERNAL] to the subject
doesn't make sense either.


> It is also unusual that Flexential ran both the one remaining utility feed and the generators at the same time... we haven't gotten a clear answer why they ran utility power and generator power.
>
> Yeah, there's a reason the power company tells homeowners to not improvise by backfeeding their house from a generator using a "suicide cord" when the linemen are working outside. You're supposed to install a cutover switch, or at least turn off your house main circuit breaker.

It would appear with DSG there is a bit more to it than a "suicide cord".
When you own your own substation, as a datacenter does and have a common bus
between them you have all the gear to match phase, detect faults and transfer
at full load. It would appear that something in this other substation didn't
clear the fault in time and allowed it to come over on the secondary side and
fault the generators.

I'd probably say DSG is a good idea, but only when there's not a failure on
the utility side. And I'm no expert, so take my comments with a grain of salt.


> One possible reason they may have left the utility line running is because Flexential was part of a program with PGE called DSG ... [which] allows the local utility to run a data center's generators to help supply additional power to the grid. In exchange, the power company helps maintain the generators and supplies fuel. We have been unable to locate any record of Flexential informing us about the DSG program. We've asked if DSG was active at the time and have not received an answer.
>
> You can't ask what you don't know, but it seems like power generation is one of those important things that should be told to your single largest customer who is leasing 10% of your entire facility

Here's a link to the DSG faq. Looks like it's a nice way to exercise your
gensets.

> https://assets.ctfassets.net/416ywc1laqmd/6xPDM0LVfZrHyuzUbQAMeD/04f107a741a0107a3a51bc821f62891e/dispatchable-standby-generation-faq.pdf

> At approximately 11:40 UTC, there was a ground fault on a PGE transformer at PDX-04... [and] ground faults with high voltage (12,470 volt) power lines are very bad.
>
> That's underselling it a bit.
>
> Fortunately ... PDX-04 also contains a bank of UPS batteries... [that] are supposedly sufficient to power the facility for approximately 10 minutes... In reality, the batteries started to fail after only 4 minutes ... and it took Flexential far longer than 10 minutes to get the generators restored.
>
> Correct me if I'm wrong, but aren't UPS batteries supposed to be exercised with deep-cycling on a regular basis? It sounds like they were extremely worn out when they were needed most.
>
> While we haven't gotten official confirmation, we have been told by employees that [the generators] needed to be physically accessed and manually restarted because of the way the ground fault had tripped circuits. Second, Flexential's access control system was not powered by the battery backups, so it was offline.
>
> That sounds objectively dumber than what happened at the Meta/Facebook datacenter outage a while ago, where the doors and badge readers were still online, but the badges couldn't be evaluated via the network due to the BGP crash, and the credentials weren't cached locally either.

It's almost impossible to buy off the shelf access control solutions that
don't suck. Windows 10 home edition is a common server platform...

Even if the system failed open and you have to go reset breakers on gensets
manually, it's a good run to it and then time to put on PPE (440v at thousand
of amps is where arc fault becomes real) before you can even reset it. This
is assuming you were able to correctly diagnose the reason for the fault. 4-5
min is not long enough here, and I'd argue even 15 is too little. All this
assumes the right people are on site too.


> And third, the overnight staffing at the site did not include an experienced operations or electrical expert — the overnight shift consisted of security and an unaccompanied technician who had only been on the job for a week.
>
> :picard-facepalm:

The overnight team isn't going to be the senior people, anywhere.
--
Bryan Fields

727-409-1194 - Voice
http://bryanfields.net
_______________________________________________
Outages mailing list
Outages@outages.org
https://puck.nether.net/mailman/listinfo/outages
Re: Flexential PDX (was Re: Dreamhost MySQL outage?) [ In reply to ]
----- Original Message -----
> From: "Bryan Fields via Outages" <outages@outages.org>

> On 11/5/23 3:24 AM, Chapman, Brad (NBCUniversal) wrote:
>> Counter to best practices, Flexential did not inform Cloudflare that they had
>> failed over to generator power.
>>
>> Off to a good start, then...
>
> It's proper to prefix text that's being quoted with a "> "? It's super
> confusing to read your reply between, trying to separate your content from
> cloud flair's and then top posting too. Adding [EXTERNAL] to the subject
> doesn't make sense either.

But *lots* of mail clients don't quote that way anymore -- including the one
Brad uses, clearly -- and he *did* mark the quotes; your editor just flattened
the indent/italics he used for quoting.

I don't like it either, but I was never able to do anything about the kidnapping
of the word 'hacker' either...

<admin>
When possible, though, we do recommend that members of the lists configure their
mail clients to treat the list as flat-ASCII or ISO-8859(-1), with no styling,
and use visible quote marking.
</admin>

>> It is also unusual that Flexential ran both the one remaining utility feed and
>> the generators at the same time... we haven't gotten a clear answer why they
>> ran utility power and generator power.
>>
>> Yeah, there's a reason the power company tells homeowners to not improvise by
>> backfeeding their house from a generator using a "suicide cord" when the
>> linemen are working outside. You're supposed to install a cutover switch, or
>> at least turn off your house main circuit breaker.
>
> It would appear with DSG there is a bit more to it than a "suicide cord".
> When you own your own substation, as a datacenter does and have a common bus
> between them you have all the gear to match phase, detect faults and transfer
> at full load. It would appear that something in this other substation didn't
> clear the fault in time and allowed it to come over on the secondary side and
> fault the generators.
>
> I'd probably say DSG is a good idea, but only when there's not a failure on
> the utility side. And I'm no expert, so take my comments with a grain of salt.

I believe this is more commonly called 'cogeneration', and indeed, the controls
that the utility permits you to use to do it are pretty smart, able to tell if
the utility side drops, and cutting it out.

>> You can't ask what you don't know, but it seems like power generation is one of
>> those important things that should be told to your single largest customer who
>> is leasing 10% of your entire facility
>
> Here's a link to the DSG faq. Looks like it's a nice way to exercise your
> gensets.
>
>> https://assets.ctfassets.net/416ywc1laqmd/6xPDM0LVfZrHyuzUbQAMeD/04f107a741a0107a3a51bc821f62891e/dispatchable-standby-generation-faq.pdf

A smaller case of cogen, yeah.

On point, though, I too would expect that if I was 10% of the entire leasing
base, that I would get a bit more information; my guy should be on the NOC's
call list.

>> While we haven't gotten official confirmation, we have been told by employees
>> that [the generators] needed to be physically accessed and manually restarted
>> because of the way the ground fault had tripped circuits. Second, Flexential's
>> access control system was not powered by the battery backups, so it was
>> offline.
>>
>> That sounds objectively dumber than what happened at the Meta/Facebook
>> datacenter outage a while ago, where the doors and badge readers were still
>> online, but the badges couldn't be evaluated via the network due to the BGP
>> crash, and the credentials weren't cached locally either.
>
> It's almost impossible to buy off the shelf access control solutions that
> don't suck. Windows 10 home edition is a common server platform...

Yeesh...

> Even if the system failed open and you have to go reset breakers on gensets
> manually, it's a good run to it and then time to put on PPE (440v at thousand
> of amps is where arc fault becomes real) before you can even reset it. This
> is assuming you were able to correctly diagnose the reason for the fault. 4-5
> min is not long enough here, and I'd argue even 15 is too little. All this
> assumes the right people are on site too.

Sounds about right. I wouldn't plan for less than an hour, myself, from
either side of the poker table.

>> And third, the overnight staffing at the site did not include an experienced
>> operations or electrical expert — the overnight shift consisted of security and
>> an unaccompanied technician who had only been on the job for a week.
>>
>> :picard-facepalm:
>
> The overnight team isn't going to be the senior people, anywhere.

You bet. :-)

Cheers,
-- jra
--
Jay R. Ashworth Baylink jra@baylink.com
Designer The Things I Think RFC 2100
Ashworth & Associates http://www.bcp38.info 2000 Land Rover DII
St Petersburg FL USA BCP38: Ask For It By Name! +1 727 647 1274
_______________________________________________
Outages mailing list
Outages@outages.org
https://puck.nether.net/mailman/listinfo/outages