Over the past few weeks we have been carrying out planned work to 'shuffle' customers between some of the 'Witless' LNS routers at our side. This work was carried out to apply both a software update and configuration change. From a customer view, this work cause a line drop overnight which is usually a very short outage in the early hours of the morning, and not usually much of an inconvenience. However, in addition to this planned work we have had a few crashes over the past week. (10th at 8:30AM, 13th at 1:40PM and 14th at 2:50AM.) These have caused a few minutes interruption to some customers during the day time, which is very inconvenient and we do apologise for this.
The cause of these restarts is known to us. It is in relation to a low level processor issue that requires a complicated workaround. We are working on a longer term software fix for this. However, one of the configuration changes we made has exacerbated this bug and has made it more prevalent, we therefore plan to revert the configuration change ASAP. The longer term software will then be applied at a later date once it is ready
The configuration change does need us to reboot our routers, and so this will involve us carrying out another round of moving customers between LNSs again. This work will happen in the early hours of the morning starting on Wednesday 15th November.
Customers on "Y.Witless" will be moved during the early hours of Wednesday 15th November. Y.witless will then have it's config change and rebooted the following day.
Tuesday evening (21:03) Sadly, X.Witless crashed this evening causing disruption for those customers connected to it. X.Witless would have been scheduled to have the mentioned config change and reboot tomorrow night - that won't be needed now, as the restart means it has now applied the config change.
All our 'Witless' LNS routers have now had the configuration change applied, which should hopefully reduce the chance of further crashes. We'll keep this post open and will post further updates regarding the software upgrade which we hope will happen early next week.
Some more changes and testing of the software are still required before we update our routers.
A software upgrade is being applied this week: https://aastatus.net/42582
Newer software was applied to Z.Witless and some lines were moved across to it. However there was still a crash. Work ongoing to resolve this, and newer software is being testing at the moment.
Z.Witless crashed again at around 10:30, causing an outage for about 200 customers.
We do apologise for these recent disconnects and we are fully aware how frustrating and interrupting this problems have been for those customers affected. Our developers have been constantly working on the problem these past few weeks, and we are discussing what next steps to take.
Z.Witless has crashed again, dropping connections for around 160 customers. We're taking Z.Witless out of service, and customers will see their connections routed by x.witless or y.witless.
We've been testing newer software this week in the test lab.
[As of 13th December] It's been 12 days since the last incident, we are running slightly older software which is more stable. Unless the situation changes we will not make any changes to our live LNSs until the new year. Meanwhile our efforts are focused on the root cause investigation and are performing continuous tests in the lab.
Good progress has been made in the investigation and fix for this problem, including weeks of testing in the FireBrick test lab. We will start to load new software on to our routers. A separate Planned Work post has been created for this work: https://aastatus.net/42593
Update as of January 11th. The good news is that the fix for the original hardware lockup has been applied to two of our three 'Witless' LNSs and we've not seen the same lockup in either our test lab or the upgraded Witless LNSs. However, Z.Witless has had a couple of additional crashes which have not been seen on our other units. As a result of this we will be replacing Z.Witless with new hardware as a matter of urgency.
An update of where we are (Friday 12th January).
Some customers have had interruption to their service this week as we have seen a number of crashes on both Z.Witless and Y.Witless.
Today we replaced the hardware of Z.Witless.
Our developers have been working on investigating each crash we have. We have been saying in recent updates that progress had been made on the crashes we have seen, and this week we applied the software update to two of our three 'Witless' LNSs. In our test lab we have never seen this updated software crash during 3 weeks of testing. However, we have had crashes this week since applying the updated software.
Usually with a crash, our developers are sent a crashlog with details specifying exactly where in the code the crash happened. However, the crashes that have been affecting us are different in that the hardware locks up and restarts - with this type of crash we have less forensic to work with which is making getting to the bottom of the problem that much harder.
We are still working hard to resolve this. We various avenues of investigation to take, and during the next week we will be planning more overnight work as well as datacentre trips.
We know how disruptive this has been for those customers affected, and we are doing all we can to work towards a stable service for everyone.
Y.Witless has very few customers connected to it, and will be rebooted at 3AM on Sat 13 Jan.
This incident will be closed as we have posted a new update/summary regarding this problem: https://aastatus.net/42608