Incident in maidenhead -- AAISP's status page

Incident in maidenhead

MAJOR Closed Maidenhead Colocation

STATUS

Closed

CREATED

Mar 17, 10:22 AM (13 years ago)

AFFECTED

Maidenhead Colocation

STARTED

Mar 17, 10:00 AM (13 years ago)

CLOSED

Mar 18, 11:54 AM (13 years ago)

REFERENCE

885 / AA885

PERMALINK

https://aastatus.net/885

INFORMATION

INITIAL
13 years ago

We have lost comms with Maidenhead and we have an engineer going to site now, we are not sure what the issue is but it may be power related.

Email, VOIP and some others services will be affected.

This is also affectig Ethernet customers and hosted servers in Maidenhaed

There appears to haver been a fire alarm that has gone off and data center has been evacuated. No evidence of a fire though but power is down
UPDATE
13 years ago
Staff are just approaching the data centre now.
UPDATE
13 years ago
Power is being restored now
UPDATE
13 years ago
Our engineers are on site and power has been restored, servers of ours are coming back on line, further updates will be posted when we get them
UPDATE
13 years ago
Not all power has been resotred yet. Some services (control pages, VOIP, web) are still down. They should be restored shortly.
UPDATE
13 years ago
VoIP and control pages are back. Email and web should be back soon.
UPDATE
13 years ago
The A viop server is still down.
UPDATE
13 years ago
Email servers are mostly back, and web services are back. We've still got some voip problems and are working on it.
UPDATE
13 years ago
The A voip server has a database problem, and won't let customers register.
UPDATE
13 years ago
There is now a database problem on C SIP server too. Investigating.
UPDATE
13 years ago
Database fixed on C.
UPDATE
13 years ago
Database problems fixed on A and C servers.
UPDATE
13 years ago
Most services are back up now, we have had a number of hardware fail as part of the power outage incident.

Currently the main problem is our email ticketing server - this is affecting emails to support/sales/accounts etc - and so is causing a delay in email replies.

There are also problems with:

The online ordering system
ADSL usage reporting
ADSL line status on Clueless

Other servers still have problems which we are working through, but other servers are managing with the load (may services have multiple servers).
UPDATE
13 years ago
The odd effect with lines not showing as on-line properly on clueless is fixed, and lines will clear properly over night as a result. PPP restarts of lines are needed but this is done automatically in stages to minimise disruption.
UPDATE
13 years ago
On-line ordering restored a little while ago.
UPDATE
13 years ago
I would just like to say that I am very pleased with how my staff have handled this today - tackling the issues in a sensible priority and updating status pages. This is a major issue with not just a power outage, but issues with access to the building, and possibly even a power surges as several pieces of equipment have failed totally. The backup arrangements for critical systems have worked as expected as has the maintenance of broadband internet access, DNS, and RADIUS authentication. Well done everyone. We'll try and get a more detailed explanation from the data centre in due course. Staff are working on the last of the issues now.
UPDATE
13 years ago
thankless (ticketing) still down and being rebuilt now.
UPDATE
13 years ago
We have now got our email ticketing system back online - we do apologise for the time this has taken, and the delay this has caused to email to support, sales and accounts.
UPDATE
13 years ago
We'll close this incident for now - but will add the official response fron BlueSquare when they have let us know.
UPDATE
13 years ago
This is the official report from BlueSquare (Our racks are in the building called BS2)

This is a Reason for Outage Report with details regarding the power supply in BS2/3 with BlueSquare Data Services Ltd.

At 10:06 on Thursday 17th March one of the six UPS modules located in BlueSquare 2/3 suffered a critical component failure which resulted in a dead short on the output side (critical load side) of the UPS. This failure also caused an amount of smoke to be released by the failed UPS system which resulted in the fire alarm activating and the fire service attending. Once the fire service was happy with the situation we were able to restore power to the site via the generators with the UPS system bypassed whilst we investigated the fault further.

Due to the short circuit occurring on the output side of the UPS this meant the other UPS’s immediately went into an overload condition which then switched all modules into bypass mode, as per the design of the system. This overload then transferred to the raw mains and tripped the main incomer to the site. This caused the overload condition to cease and power was lost to the site. The UPS manufactures then worked to check all the remaining UPS modules to ensure the same component was within specification, and to fully test each UPS system, replacing some components where necessary. No further faults were found on the remaining UPS modules, and load was then switched back to full UPS protection at approx 02:15 and building load was transferred back from the generators to utility mains at approx 02:25.

Due to the size of the failure we have commissioned an independent organisation to forensically examine the failed UPS module. This work is scheduled to be completed next week and we will provide further details once we receive their report. This was an extremely unusual type of failure and the manufactures have not experienced such a problem before, despite over 3,000 similar UPS units being deployed. This suggests there isn’t an inherent design problem in the units but we will not reach any conclusions until the forensic examination is complete.

The failed UPS module will be replaced within the next 4 weeks and until that time we will remain on ‘N’ redundancy level at BlueSquare 2 & 3. Further updates will be provided before this replacement work takes place.

A number of customers have asked as to why this failure could occur when we operate an N+1 UPS architecture. The reason for this is that all of the six UPS modules in BlueSquare 2/3 are paralleled together as one large UPS system. BlueSquare 2/3 only requires 5 modules to hold the critical load to the site, however we have an additional unit which provides the redundancy in the event of a UPS module failure. However, as this failure was on the common critical load side of the UPS (the same output that feeds the distribution boards which then in turn feed the racks) and all the UPS systems are paralleled together, this had the effect of causing all UPS modules to go down.

As an example, in a N+N configuration, such as in our Tier IV Milton Keynes site, a failure of this nature would not be possible as two banks of independent UPS systems operate providing true A&B feeds to each rack.
Closed

Last updated: 13 years ago

Incident in maidenhead

STATUS

CREATED

AFFECTED

STARTED

CLOSED

REFERENCE

PERMALINK

INFORMATION

INITIAL

UPDATE

UPDATE

UPDATE

UPDATE

UPDATE

UPDATE

UPDATE

UPDATE

UPDATE

UPDATE

UPDATE

UPDATE

UPDATE

UPDATE

UPDATE

UPDATE

UPDATE

UPDATE

UPDATE