Events from the AAISP network from the last few months on a scrollable timeline. Mouseover for brief details, click incident to view the full post.
Our carrier reports: Following on from our earlier issues we are now confident that we have isolated the problem but the underlying cause is still being investigated. Our initial findings are, as you can expect, that the failover systems did not meet our expectations in this event and that itself will be taken on as a project by our team to ensure greater resilience going forward.
The work outlined below will start from Saturday 23rd November.
Background:Our FireBrick team has been working on the 'hang' problem that we faced with the LNSs earlier in the year.
The nature of the problem has made investigating the problem very time consuming as it is extremely difficult to reproduce. However, we do believe that a plausible cause has been identified, and code changes have been made to mitigate the problem.
We have been testing this new code, both in our test lab and on a few select A&A routers, for over two months. During this time the new code has not caused the hardware to hang, where older versions of the code did.
Our next step is to run the new code on our LNSs, the ones our customers connect to for their broadband connections.
We plan to do this slowly, out of hours and in a couple of phases.
We believe the cause of the hang is related to how memory is initially allocated for the tasks the FireBrick will be performing, this means that if the hardware is going to hang then this will most likely happen over the first couple of days (or first couple of hours).
Stage one:
We plan to upgrade only one of our LNSs at first. We will move broadband connections on to it in the early hours of the morning and then move them back off a few hours later. This means that during the day, customers will be on the normal set of LNSs.
Then, each night, over the course of two weeks, the LNS will be power cycled and we will move an increasing number of connections over, until it is at the point of taking twice the amount of connections that we'd normally run on an LNS. (We normally run LNSs at around 40% capacity, so twice the number of connections is not a problem.)
Stage two:
Once we have confirmed that the hang is not happening, the second phase would be to run customer connections on the upgraded for a few days at a time.
We will go through a cycle of: move connections off, reboot the LNS, move connections on, wait a few days. Repeat. We will do this with an increasing number of connections until it's at the point of taking a normal amount of connections.
More information and to opt out:
So as to minimise impact to customers, the work of moving connections off and on will happen overnight between 1AM and 5AM.
As mentioned, this phase of upgrading involved only one LNS being upgraded. This will be the one named 'i.gormless'. The connections that will be moved on to 'i.gormless' will be those currently on the LNS named 'h.gormless'. if you are currently on 'h.gormless' (as seen on the top/left) of your line quality graph and want to opt out, then please email support.
Once this phase has been completed, we will review and plan the next stages.