Data-centre power outage


10/05/09 on Sunday at 4.25AM we had 30 minute un-scheduled power outage affecting our 100% power uptime SLA.

Data-centre had lost power from UPS2 yesterday morning at 04:25.  Three of date-centre's technical team attended the site and restored power to cabinets at 04:50.  A manual check of all affected cabinets was then performed to identify servers which had not automatically re-started following the power failure, and some power bars in cabinets were overloaded due to the startup current of the servers all booting at the same time.

This incident is as unacceptable to us, as it will be to you, and please be assured that we're treating this very seriously.

 

A detailed technical explanation by data-centre technicians:

"We have an issue with customers who have multiple cabinets with us, who're (usually) meant to load 10A on each 16A feed, but instead put, for example, 12A in one cabinet and 8A in another, i.e. within their total power allocation, but slightly overloading certain phases and/or UPSs, whilst other phases/UPSs have free capacity.  We try to manage this as best we can but it can be difficult to get power distribution balanced equally when customers are responsible for their own racks.

One particular phase (of the three-phase power) on one of our UPS devices (UPS2) was hovering between 98%-99% of kVA capacity.  In itself this doesn't pose an immediate risk but did need to be reduced to give us head room.  We were working with customers on the affected phase to reduce their power usage and/or move their feed to another UPS in a controlled manner.

The way our UPS devices work is to harmonise the load between phases to the utility company, by means of an inline AC-DC-AC conversion, meaning that if one particular phase is running toward capacity, this will be balanced between the other two phases, so that the feed from the utility company has the load balanced/harmonised between the three phases.

What appears to have happened, is that the UPS (UPS2) went into bypass (i.e. onto raw mains) because of a large demand spike coming from a cabinet on this phase, taking it above 100% capacity (which it should be equipped to do for a short time - so why it actually went into bypass is not yet clear).

Because of this, the load was switched to raw mains, and as such the higher-loaded phase tripped the breaker that feeds UPS2, hence the power loss. 

Upon arriving on-site, we immediately switched back on the main breaker, and went through the startup procedure with the UPS, restoring the power to most cabinets immediately.

We also took the opportunity as the cabinets were down anyway, to move some cabinets to a less heavily populated UPS.  This means that the risk has been mitigated and the affected phase on UPS2 is now running at ~90% capacity.  Again it's not a case of us not having enough UPS capacity, but simply that the load is not always evenly distributed where it should be.

You should consider the power to be stable but we will be investigating with APC as to why the UPS switched to bypass when they're actually equipped to deal with a temporary overload situation.

The moved cabinet this morning has now solved the immediate problem, but we will this week perform a full audit of power usage across the datacentre to ensure that a similar situation cannot re-occur on other UPS devices."

Latest News

Page 1 of 4  > >>

Some of you using wholesale broadband suppliers (ISP, such as BT and VirginMedia) started receiving their own emails that were marked as spam.
Please beware of "internet keyword" and domain scam emails from Asia.
Microsoft IIS allows for local file inclusion of any filetype due a bug in the way its filters handle semicolons.