Incident in maidenhead
MAJOR Closed Maidenhead Colocation
STATUS
Closed
AFFECTED
Maidenhead Colocation
STARTED
Mar 17, 10:00 AM (8 years ago)
CLOSED
Mar 18, 11:54 AM (8 years ago)
REFERENCE
885 / AA885
INFORMATION
  • INITIAL
    8 years ago

    We have lost comms with Maidenhead and we have an engineer going to site now, we are not sure what the issue is but it may be power related.

     

    Email, VOIP and some others services will be affected.

     

    This is also affectig Ethernet customers and hosted servers in Maidenhaed

     

    There appears to haver been a fire alarm that has gone off and data center has been evacuated. No evidence of a fire though but power is down

  • UPDATE
    8 years ago

    Staff are just approaching the data centre now.

  • UPDATE
    8 years ago

    Power is being restored now

  • UPDATE
    8 years ago

    Our engineers are on site and power has been restored, servers of ours are coming back on line, further updates will be posted when we get them

  • UPDATE
    8 years ago

    Not all power has been resotred yet. Some services (control pages, VOIP, web) are still down. They should be restored shortly.

  • UPDATE
    8 years ago

    VoIP and control pages are back. Email and web should be back soon.

  • UPDATE
    8 years ago

    The A viop server is still down.

  • UPDATE
    8 years ago

    Email servers are mostly back, and web services are back. We've still got some voip problems and are working on it.

  • UPDATE
    8 years ago

    The A voip server has a database problem, and won't let customers register.

  • UPDATE
    8 years ago

    There is now a database problem on C SIP server too. Investigating.

  • UPDATE
    8 years ago

    Database fixed on C.

  • UPDATE
    8 years ago

    Database problems fixed on A and C servers.

  • UPDATE
    8 years ago

    Most services are back up now, we have had a number of hardware fail as part of the power outage incident. 

    Currently the main problem is our email ticketing server - this is affecting emails to support/sales/accounts etc - and so is causing a delay in email replies.

    There are also problems with:

    The online ordering system
    ADSL usage reporting
    ADSL line status on Clueless

    Other servers still have problems which we are working through, but other servers are managing with the load (may services have multiple servers).

  • UPDATE
    8 years ago

    The odd effect with lines not showing as on-line properly on clueless is fixed, and lines will clear properly over night as a result. PPP restarts of lines are needed but this is done automatically in stages to minimise disruption.

  • UPDATE
    8 years ago

    On-line ordering restored a little while ago.

  • UPDATE
    8 years ago

    I would just like to say that I am very pleased with how my staff have handled this today - tackling the issues in a sensible priority and updating status pages. This is a major issue with not just a power outage, but issues with access to the building, and possibly even a power surges as several pieces of equipment have failed totally. The backup arrangements for critical systems have worked as expected as has the maintenance of broadband internet access, DNS, and RADIUS authentication. Well done everyone. We'll try and get a more detailed explanation from the data centre in due course. Staff are working on the last of the issues now.

  • UPDATE
    8 years ago

    thankless (ticketing) still down and being rebuilt now.

  • UPDATE
    8 years ago

    We have now got our email ticketing system back online - we do apologise for the time this has taken, and the delay this has caused to email to support, sales and accounts.

  • UPDATE
    8 years ago

    We'll close this incident for now - but will add the official response fron BlueSquare when they have let us know.

  • UPDATE
    8 years ago

    This is the official report from BlueSquare (Our racks are in the building called BS2)

     

    This is a Reason for Outage Report with details regarding the power supply in BS2/3 with BlueSquare Data Services Ltd.

     

    At 10:06 on Thursday 17th March one of the six UPS modules located in BlueSquare 2/3 suffered a critical component failure which resulted in a dead short on the output side (critical load side) of the UPS. This failure also caused an amount of smoke to be released by the failed UPS system which resulted in the fire alarm activating and the fire service attending. Once the fire service was happy with the situation we were able to restore power to the site via the generators with the UPS system bypassed whilst we investigated the fault further.

     

    Due to the short circuit occurring on the output side of the UPS this meant the other UPS’s immediately went into an overload condition which then switched all modules into bypass mode, as per the design of the system. This overload then transferred to the raw mains and tripped the main incomer to the site. This caused the overload condition to cease and power was lost to the site. The UPS manufactures then worked to check all the remaining UPS modules to ensure the same component was within specification, and to fully test each UPS system, replacing some components where necessary. No further faults were found on the remaining UPS modules, and load was then switched back to full UPS protection at approx 02:15 and building load was transferred back from the generators to utility mains at approx 02:25.

     

    Due to the size of the failure we have commissioned an independent organisation to forensically examine the failed UPS module. This work is scheduled to be completed next week and we will provide further details once we receive their report. This was an extremely unusual type of failure and the manufactures have not experienced such a problem before, despite over 3,000 similar UPS units being deployed. This suggests there isn’t an inherent design problem in the units but we will not reach any conclusions until the forensic examination is complete.

     

    The failed UPS module will be replaced within the next 4 weeks and until that time we will remain on ‘N’ redundancy level at BlueSquare 2 & 3. Further updates will be provided before this replacement work takes place.

     

    A number of customers have asked as to why this failure could occur when we operate an N+1 UPS architecture. The reason for this is that all of the six UPS modules in BlueSquare 2/3 are paralleled together as one large UPS system. BlueSquare 2/3 only requires 5 modules to hold the critical load to the site, however we have an additional unit which provides the redundancy in the event of a UPS module failure. However, as this failure was on the common critical load side of the UPS (the same output that feeds the distribution boards which then in turn feed the racks) and all the UPS systems are paralleled together, this had the effect of causing all UPS modules to go down.

     

    As an example, in a N+N configuration, such as in our Tier IV Milton Keynes site, a failure of this nature would not be possible as two banks of independent UPS systems operate providing true A&B feeds to each rack.

  • Closed