Recent posts
Timeline view of events on our network and systems

Events from the AAISP network from the last few months on a scrollable timeline. Mouseover for brief details, click incident to view the full post.

MINOR Closed Data SIMs
AFFECTED
Data SIMs
STARTED
Mar 04, 10:45 PM (5½ hours ago)
CLOSED
Mar 04, 10:53 PM (5¼ hours ago)
DESCRIPTION
We saw Data SIMs drop and reconnect at around 22:45 this evening. This looks to be caused by something upstream of us at the carrier.
Resolution: Services are back online. A ticket has been opened with our carrier for more information.

MAINTENANCE Completed LNS
AFFECTING
LNS
STARTED
Mar 01, 03:00 AM (4 days ago)
CLOSED
Mar 01, 04:44 AM (3¾ days ago)
DESCRIPTION

We have work planned for the early hours of Friday morning that entails upgrading software on our LNSs and moving CityFibre and higher-speed BT/TalkTalk services on to separate pools of routers (LNSs) at our side.

In practice this will mean that most customers with speeds of 80Mb/s and above will experience a few PPP drops and reconnects between 3AM and 5AM as we carry out the work.

This is related to the hardware hangs we've been experiencing: https://aastatus.net/42608 and it will help us further investigate this ongoing issue.


Resolution: This work has been completed.

MINOR Closed BT DSL
AFFECTED
BT DSL
STARTED
Mar 01, 01:00 AM (4 days ago)
CLOSED
Mar 01, 02:15 AM (4 days ago)
DESCRIPTION
BT carried out planned work that affected one of our 4 links to them at 1AM and 2AM. This caused lines to drop and reconnect. BT failed to inform us of this work (again). Had BT informed us then we would have cleanly moved traffic of the affected link.
Resolution: A formal complaint has been raised with BT. (Drops between 3AM and 5AM were A&A planned work)

MINOR Closed VoIP and SIMs
AFFECTED
VoIP and SIMs
STARTED
Feb 29, 02:17 PM (4½ days ago)
CLOSED
Feb 29, 06:25 PM (4¼ days ago)
DESCRIPTION
Our SIP2SIM carrier are investigating a problem with VOICE and SMS, this is affecting the O2 Profile only We'll update this post as soon as we have further information
Resolution: Our service provider has advised that the issue was resolved at 18:25.

MAJOR Closed Broadband
AFFECTED
Broadband
STARTED
Feb 27, 12:00 PM (6½ days ago)
CLOSED
Feb 27, 12:31 PM (6½ days ago)
DESCRIPTION
A number of lines dropped and reconnected at around noon. We're investigating.
Resolution:

The X.Witless LNS hung and restarted which caused customers to disconnect and reconnect.

This incident is related to https://aastatus.net/42608 X.Witless had been running without incident for 104 days. However, it is not fitted with an NVMe drive and was running software that pre-dates our NVMe drive fixes. We suspect the hang was caused by these two factors.

Further work on our LNSs is being planned and updates will be posted to the status page in due course.


MINOR Closed DATA SIMs
AFFECTED
DATA SIMs
STARTED
Feb 26, 12:35 PM (7½ days ago)
CLOSED
Feb 26, 01:00 PM (7½ days ago)
DESCRIPTION
We're seeing a large number of Data SIMs disconnect and reconnect. This looks to be upstream of us, possibly in the mobile network.
Resolution:

MINOR Closed SMS
AFFECTED
SMS
STARTED
Feb 19, 12:36 AM (15 days ago)
CLOSED
Feb 19, 03:00 PM (14½ days ago)
DESCRIPTION
We're investigating problems with inbound/outbound SMS. (Outbound SMS will work, but our SMS API will take a while to respond due to timeouts with our main carrier before it fails over to the secondary carrier)
Resolution: Service was restored at around 3PM. We'll post further updates from our carrier as we get them. During this time outgoing SMSs were working via our secondary carrier, but our API would have taken a longer than usual amount of time to accepts messages. Incoming messages sent today would have been delayed and received after 3PM.

MAINTENANCE Completed LNS
AFFECTING
LNS
STARTED
Feb 17, 01:00 AM (17 days ago)
CLOSED
Feb 19, 01:52 AM (15 days ago)
DESCRIPTION

During the early hours of Saturday 16th and Sunday 17th we will be performing some software upgrades and rebalancing of customers that are currently on the A, B, C, G, Y, Z LNSs.

The aim here is upgrade the software on some of them and to spread the customers a bit more evenly over them. Customers on X will remain as is.
Resolution: This work has been completed - Some lines were moved to different LNSs in the early hours of Saturday and Monday.

MAINTENANCE Completed LNS
AFFECTING
LNS
STARTED
Feb 16, 12:30 PM (17½ days ago)
CLOSED
Feb 16, 12:40 PM (17½ days ago)
DESCRIPTION
The very small number of customers on G.gormless had their service drop and reconnect caused by the LNS locking up.
Resolution:

MINOR Closed LNS
AFFECTED
LNS
STARTED
Feb 14, 08:50 PM (19¼ days ago)
CLOSED
Feb 14, 09:37 PM (19¼ days ago)
DESCRIPTION
The small number of customers on a.gormless had their service drop and reconnect caused by the LNS locking up.
Resolution:

MINOR Closed Hetzner
AFFECTED
Hetzner
STARTED
Feb 13, 07:00 PM (20¼ days ago)
CLOSED
Mar 01, 04:45 AM (3¾ days ago)
DESCRIPTION
Hetzner is a German based server hosting provider. We have seen intermittent problems in routing traffic to them in recent days.
Resolution: No further update or incident. We'll close this for now.

MINOR Closed Data SIMs
AFFECTED
Data SIMs
STARTED
Feb 12, 11:40 AM (21½ days ago)
CLOSED
Feb 12, 03:50 PM (21½ days ago)
DESCRIPTION
We're seeing Data SIMs dropping and reconnecting. We're investigating
Resolution: Our carrier says: "We can see that here has been a power service disruption within the Three network." The service is now stable. Also see: https://www.ispreview.co.uk/index.php/2024/02/three-uk-suffers-third-major-mobile-outage-in-four-days.html

MAINTENANCE Completed LNS
AFFECTING
LNS
STARTED
Feb 11, 04:00 AM (23 days ago)
CLOSED
Feb 11, 05:00 AM (22¾ days ago)
DESCRIPTION
In the early hours of Sunday morning (4AM and 4:30AM) we will be performing a software upgrade on G.Gormless and Y.Witless. This will cause a drop for the small number of customers on these LNSs, but they will reconnect quickly. This software add a bit more logging to aid our diagnostics.
Resolution: This has been completed.

MAINTENANCE Completed LNS
AFFECTING
LNS
STARTED
Feb 10, 02:00 AM (24 days ago)
CLOSED
Feb 10, 10:00 AM (23¾ days ago)
DESCRIPTION

This week we have installed three additional FB9000 LNSs. These are: A.Gormless, B.Gormless, C.Gormless. These are in addition to the existing: G.Gormless, X.Witless, Y.Witless Z.Witless which are all used for customers on faster circuits (80M and above).

As of Friday 9th customers (with 80M and above connections) are mostly connected to G, X, and Y.

During the early hours of Saturday 10th and Sunday 11th we will rebalance customers that are currently on G and Y over to A, B, C, G, Y, Z - the aim here is to spread the load so fewer customers are on each LNS which means fewer customers will be affected if the LNS locks-up. Customer on X will remain as is.


Resolution: This work has been completed.

MINOR Closed BT
AFFECTED
BT
STARTED
Feb 07, 12:40 AM (27 days ago)
CLOSED
Feb 07, 12:45 AM (27 days ago)
DESCRIPTION
At around 00:40 we saw a number of BT circuits drop and reconnect. From the pattern of the lines which dropped, it looks like something within the BT network broke and re-routed traffic. We'll be in contact with BT for more information regarding this.
Resolution: BT have confirmed that this drop was caused by them performing emergency planned work - however they also said that they failed to let us, and other ISPs know about this work. They are investigating why they didn't inform us, and we have also asked them why they didn't seamlessly move traffic of the link before their work.

MINOR Closed LNS
AFFECTED
LNS
STARTED
Feb 05, 05:20 PM (28¼ days ago)
CLOSED
Feb 05, 06:19 PM (28¼ days ago)
DESCRIPTION
At around 17:20 the Z.Witless LNS hung, causing customers on it to drop and reconnect.
Resolution: The lock-up of Z.Witless was unfortunate as it did cause a disruption to some our customers this evening. However, Z.Witless is currently out of service and in it's locked state, where our developers can connect to its CPUs/memory/etc and see if they can gain more information. - this is the same as the issue with Y.Witless on Saturday afternoon - data from that has been downloaded and has been analysed today.

MAINTENANCE Completed LNS
AFFECTING
LNS
STARTED
Feb 03, 01:00 AM (1 month ago)
CLOSED
Feb 05, 06:20 PM (28¼ days ago)
DESCRIPTION
We are making plans to replace a few more of the "Gormless" LNSs with FireBrick FB9000 models. In order to free up these LNSs we will be moving the small number of customers that are currently on A, B & C Gormless to other LNS. This will happen in the early hours of Saturday and Sunday morning.
Resolution: a, b, c Gormless are free of customers. We will plan to swap these over to FB9000 on Wednesday.

MAINTENANCE Assumed Completed Broadband
AFFECTING
Broadband
STARTED
Jan 19, 03:50 PM (1½ months ago)
DESCRIPTION

This is a summary and update regarding the problems we've been having with our network, causing line drops for some customers, interrupting their Internet connections for a few minutes at a time. It carries on from the earlier, now out of date, post: https://aastatus.net/42577

We are not only an Internet Service Provider.

We also design and build our own routers under the FireBrick brand. This equipment is what we predominantly use in our own network to provide Internet services to customers. These routers are installed between our wholesale carriers (e.g. BT, CityFibre and TalkTalk) and the A&A core IP network. The type of router is called an "LNS", which stands for L2TP Network Server.

FireBricks are also deployed elsewhere in the core; providing our L2TP and Ethernet services, as well as facing the rest of the Internet as BGP routers to multiple Transit feeds, Internet Exchanges and CDNs.

Throughout the entire existence of A&A as an ISP, we have been running various models of FireBrick in our network.

Our newest model is the FB9000. We have been running a mix of prototype, pre-production and production variants of the FB9000 within our network since early 2022.

As can sometimes happen with a new product, at a certain point we started to experience some strange behaviour; essentially the hardware would lock-up and "watchdog" (and reboot) unpredictably.

Compared to a software 'crash' a hardware lock-up is very hard to diagnose, as little information is obtainable when this happens. If the FireBrick software ever crashes, a 'core dump' is posted with specific information about where the software problem happened. This makes it a lot easier to find and fix.

After intensive work by our developers, the cause was identified as (unexpectedly) something to do with the NVMe socket on the motherboard. At design time, we had included an NVME socket connected to the PCIE pins on the CPU, for undecided possible future uses. We did not populate the NVMe socket, though. The hanging issue completely cleared up once an NVMe was installed even though it was not used for anything at all.

As a second approach, the software was then modified to force the PCIe to be switched off such that we would not need to install NVMes in all the units.

This certainly did solve the problem in our test rig (which is multiple FB9000s, PCs to generate traffic, switches etc). For several weeks FireBricks which had formerly been hanging often in "artificially worsened" test conditions, literally stopped hanging altogether, becoming extremely stable.

So, we thought the problem was resolved. And, indeed, in our test rig we still have not seen a hang. Not even once, across multiple FB9000s.

However...

We did then start seeing hangs in our Live prototype units in production (causing dropouts to our broadband customers).

At the same time, the FB9000s we have elsewhere in our network, not running as LNS routers, are stable.

We are still working on pinpointing the cause of this, which we think is highly likely to be related to the original (now, solved) problem.

Further work...

Over the next 1-2 weeks we will be installing several extra FB9000 LNS routers. We are installing these with additional low-level monitoring capabilities in the form of JTAG connections from the main PCB so that in the event of a hardware lock-up we can directly gather more information.

The enlarged pool of LNSs will also reduce the number of customers affected if there is a lock-up of one LNS.

We obviously do apologise for the blips customers have been seeing. We do take this very seriously, and are not happy when customers are inconvenienced.

We can imagine some customers might also be wondering why we bother to make our own routers, and not just do what almost all other ISPs do, and simply buy them from a major manufacturer. This is a fair question. At times like this, it is a question we ask ourselves!

Ultimately, we do still firmly believe the benefits of having the FireBrick technology under our complete control outweigh the disadvantages. CQM graphs are still almost unique to us, and these would simply not be possible without FireBrick. There have also been numerous individual cases where our direct control over the firmware has enabled us to implement individual improvements and changes that have benefitted one or many customers.

Many times over the years we have been able to diagnose problems with our carrier partners, which they themselves could not see or investigate. This level of monitoring is facilitated by having FireBricks.

But in order to have finished FireBricks, we have to develop them. And development involves testing, and testing can sometimes reveal problems, which then affect customers.

We do not feel we were irrationally premature in introducing prototype FireBricks into our network, having had them under test not routing live customer traffic for an appropriate period beforehand.

But some problems can only reveal themselves once a "real world" level and nature of traffic is being passed. This is unavoidable, and whilst we do try hard to minimise disruption, we still feel the long term benefits of having FireBricks more-than offset the short term problems in late stage of development. We hope our detailed view on this is informative, and even persuasive.