Last 10 posts, sorted by last updated
Order posts by limited to posts

MAINTENANCE Assumed Completed Broadband
AFFECTING
Broadband
STARTED
Jan 19, 03:50 PM (2¼ months ago)
DESCRIPTION

This is a summary and update regarding the problems we've been having with our network, causing line drops for some customers, interrupting their Internet connections for a few minutes at a time. It carries on from the earlier, now out of date, post: https://aastatus.net/42577

We are not only an Internet Service Provider.

We also design and build our own routers under the FireBrick brand. This equipment is what we predominantly use in our own network to provide Internet services to customers. These routers are installed between our wholesale carriers (e.g. BT, CityFibre and TalkTalk) and the A&A core IP network. The type of router is called an "LNS", which stands for L2TP Network Server.

FireBricks are also deployed elsewhere in the core; providing our L2TP and Ethernet services, as well as facing the rest of the Internet as BGP routers to multiple Transit feeds, Internet Exchanges and CDNs.

Throughout the entire existence of A&A as an ISP, we have been running various models of FireBrick in our network.

Our newest model is the FB9000. We have been running a mix of prototype, pre-production and production variants of the FB9000 within our network since early 2022.

As can sometimes happen with a new product, at a certain point we started to experience some strange behaviour; essentially the hardware would lock-up and "watchdog" (and reboot) unpredictably.

Compared to a software 'crash' a hardware lock-up is very hard to diagnose, as little information is obtainable when this happens. If the FireBrick software ever crashes, a 'core dump' is posted with specific information about where the software problem happened. This makes it a lot easier to find and fix.

After intensive work by our developers, the cause was identified as (unexpectedly) something to do with the NVMe socket on the motherboard. At design time, we had included an NVME socket connected to the PCIE pins on the CPU, for undecided possible future uses. We did not populate the NVMe socket, though. The hanging issue completely cleared up once an NVMe was installed even though it was not used for anything at all.

As a second approach, the software was then modified to force the PCIe to be switched off such that we would not need to install NVMes in all the units.

This certainly did solve the problem in our test rig (which is multiple FB9000s, PCs to generate traffic, switches etc). For several weeks FireBricks which had formerly been hanging often in "artificially worsened" test conditions, literally stopped hanging altogether, becoming extremely stable.

So, we thought the problem was resolved. And, indeed, in our test rig we still have not seen a hang. Not even once, across multiple FB9000s.

However...

We did then start seeing hangs in our Live prototype units in production (causing dropouts to our broadband customers).

At the same time, the FB9000s we have elsewhere in our network, not running as LNS routers, are stable.

We are still working on pinpointing the cause of this, which we think is highly likely to be related to the original (now, solved) problem.

Further work...

Over the next 1-2 weeks we will be installing several extra FB9000 LNS routers. We are installing these with additional low-level monitoring capabilities in the form of JTAG connections from the main PCB so that in the event of a hardware lock-up we can directly gather more information.

The enlarged pool of LNSs will also reduce the number of customers affected if there is a lock-up of one LNS.

We obviously do apologise for the blips customers have been seeing. We do take this very seriously, and are not happy when customers are inconvenienced.

We can imagine some customers might also be wondering why we bother to make our own routers, and not just do what almost all other ISPs do, and simply buy them from a major manufacturer. This is a fair question. At times like this, it is a question we ask ourselves!

Ultimately, we do still firmly believe the benefits of having the FireBrick technology under our complete control outweigh the disadvantages. CQM graphs are still almost unique to us, and these would simply not be possible without FireBrick. There have also been numerous individual cases where our direct control over the firmware has enabled us to implement individual improvements and changes that have benefitted one or many customers.

Many times over the years we have been able to diagnose problems with our carrier partners, which they themselves could not see or investigate. This level of monitoring is facilitated by having FireBricks.

But in order to have finished FireBricks, we have to develop them. And development involves testing, and testing can sometimes reveal problems, which then affect customers.

We do not feel we were irrationally premature in introducing prototype FireBricks into our network, having had them under test not routing live customer traffic for an appropriate period beforehand.

But some problems can only reveal themselves once a "real world" level and nature of traffic is being passed. This is unavoidable, and whilst we do try hard to minimise disruption, we still feel the long term benefits of having FireBricks more-than offset the short term problems in late stage of development. We hope our detailed view on this is informative, and even persuasive.


MAJOR Closed Broadband
AFFECTED
Broadband
STARTED
Feb 27, 12:00 PM (29¾ days ago)
CLOSED
Feb 27, 12:31 PM (29¾ days ago)
DESCRIPTION
A number of lines dropped and reconnected at around noon. We're investigating.
Resolution:

The X.Witless LNS hung and restarted which caused customers to disconnect and reconnect.

This incident is related to https://aastatus.net/42608 X.Witless had been running without incident for 104 days. However, it is not fitted with an NVMe drive and was running software that pre-dates our NVMe drive fixes. We suspect the hang was caused by these two factors.

Further work on our LNSs is being planned and updates will be posted to the status page in due course.


MINOR Closed Broadband
AFFECTED
Broadband
STARTED
Jan 23, 11:29 AM (2 months ago)
CLOSED
Jan 23, 11:58 AM (2 months ago)
DESCRIPTION
At around 11:29 a small number of customers dropped their sessions and reconnected a few minutes later. This was due to our planned work in the datacentre today, though these drops were not expected.
Resolution: The drop for these customers was caused by human error - I do apologise.

MINOR Closed Broadband
AFFECTED
Broadband
STARTED
Jan 10, 02:10 PM (2½ months ago)
CLOSED
Jan 10, 02:30 PM (2½ months ago)
DESCRIPTION
Investigations underway
Resolution: A crash of the Z.Witless LNS caused a number of lines to drop and reconnect. The cause has been investigated, and is related to an extremely rare race condition. (This is separate to the previous problems we've had in December which have since also been fixed)

MAINTENANCE Completed Broadband
AFFECTING
Broadband
STARTED
Dec 06, 07:15 AM (3¾ months ago)
CLOSED
Dec 06, 08:06 AM (3¾ months ago)
DESCRIPTION
We are carrying out routine work on a pair of core switches on the morning of Wednesday the 6th of December. We do not expect this work to be service affecting.
Resolution: We have backed out of this work due to the overlap with this issue: "Some BT Circuits Dropping or high packetloss - https://aastatus.net/42586" We're not yet sure if this work was related to the outage, and we're still investigating. Further updates will be posted to https://aastatus.net/42586

MINOR Closed Broadband
AFFECTED
Broadband
STARTED
Nov 17, 10:23 AM (4¼ months ago)
CLOSED
Nov 17, 05:18 PM (4¼ months ago)
DESCRIPTION
Update from supplier; We've identified an issue impacting customer connectivity. Our initial analysis shows the issue is impacting service in and around London regions. We're currently working through the impact and issue and will look to provide more detail as soon as we have it. We'll keep you informed and updated on developments. Our priority is to restore service to normal operations with minimal disruption as soon as possible.
Resolution: This was caused by a fibre break in the London area - traffic was rerouted, and has remained stable.

MINOR Closed Broadband
AFFECTED
Broadband
STARTED
Jul 01, 12:29 PM (8¾ months ago)
CLOSED
Jul 01, 07:10 PM (8¾ months ago)
DESCRIPTION
We're aware of a problem affecting some TalkTalk based lines. We are investigating with TalkTalk.
Resolution: Since the reconnects at 19:09, the packetloss problem has gone away, TalkTalk lines look to be working as normal. Interestingly, a few lines actually started to show small amounts of packet loss on Thursday at 2:20AM, which gradually increased. We'll post more information about the case as and when we get it from TalkTalk.

TalkTalk's conclusion is: We have moved traffic from a faulty card, the card has been shut down, this has resolved the issue.


MINOR Closed Broadband
AFFECTED
Broadband
STARTED
Jun 27, 05:44 PM (9 months ago)
CLOSED
Jun 28, 12:24 PM (9 months ago)
DESCRIPTION
We've noticed a number of Talk Talk line drop starting around 17:27. Most have since reconnected. We have opened a ticket with TT to investigate
Resolution: Update from TalkTalk: Some customers got re-routed to alternate core network devices following two short-duration fibre disturbances within our core network. This has been escalated to an on-call engineer for further investigation. TT have now closed the incident.

MAINTENANCE Completed Broadband
AFFECTING
Broadband
STARTED
Jun 21, 01:00 AM (9¼ months ago)
CLOSED
Jun 22, 09:00 AM (9 months ago)
DESCRIPTION
Due to the restart of the LNS earlier today we are taking y.witless out of service. This will mean moving the lines that are currently connected to it on to another LNS. This will happen overnight tonight and will mean a shot drop of service as the line reconnects. The time will be at your usual 'LNS switch over time' which is default to 1AM. This only affects a small percentage of our customers.
Resolution: This was completed - though, some lines were moved in the early hours of 22nd June.

MINOR Closed Broadband
AFFECTED
Broadband
STARTED
Jun 20, 11:06 AM (9¼ months ago)
CLOSED
Jun 20, 11:05 AM (9¼ months ago)
DESCRIPTION
We're investigating a spike of reconnecting circuits
Resolution: The LNS was reset by its auxiliary "watchdog" processer, probably as a result of over sensitive tolerances on the power rail monitoring. The LNS is running older firmware on its auxiliary processer and the tolerances have been changed in later revisions. We will schedule an upgrade and will take this LNS out of duty in the meantime.