Recent posts
Timeline view of events on our network and systems

Events from the AAISP network from the last few months on a scrollable timeline. Mouseover for brief details, click incident to view the full post.

MINOR Open VoIP and SIMs
AFFECTING
VoIP and SIMs
STARTED
Jun 05, 03:20 PM (7¾ days ago)
DESCRIPTION
We are again seeing an issue where calls from ONSIM SIMs are not immediately connecting, and may seem like several call attempts to the recipient. We are chasing ONSIM on this.

MINOR Closed BT
AFFECTED
BT
STARTED
Jun 05, 10:45 AM (8 days ago)
CLOSED
Jun 10, 12:30 PM (2¾ days ago)
DESCRIPTION
At around 10:45 we had a small number of BT lines drop and reconnect. This was caused by one of our interconnects to BT going down. - this also happened overnight at midnight.
Resolution:

MAINTENANCE Completed Control Pages
AFFECTING
Control Pages
STARTED
Jun 04, 08:00 PM (8½ days ago)
CLOSED
Jun 04, 08:26 PM (8½ days ago)
DESCRIPTION
On Tuesday evening we'll be doing some maintenance on the Control Pages which will mean it will be our of action for for a little while as this work is carried out.
Resolution:

MINOR Closed DSL
AFFECTED
DSL
STARTED
Jun 03, 01:05 PM (9¾ days ago)
CLOSED
Jun 03, 02:00 PM (9¾ days ago)
DESCRIPTION
Some lines dropped and reconnected at around 13:05. We're investigating the cause.
Resolution: P.Gormless suffered a hardware problem restated. Customers on P.Gormless dropped their connection and reconnected moments after. This LNS has been taken out of service.

MAINTENANCE Assumed Completed authoritative DNS
AFFECTING
authoritative DNS
STARTED
Jun 03, 10:35 AM (10 days ago)
DESCRIPTION

This is only relevant to customers who run their own authoritative DNS servers and use our secondary-dns.co.uk as an additional nameserver.

Overview: We run a "secondary" DNS service for customers where they run the master DNS server and we are secondary slaves. We have a project underway that involves migrating all our authoritative DNS services to a new platform. As part of this we are needing to disable some of the automation we do for adding and updating the customer's master IP address automatically.

The change: From June 17th, If you run your own master DNS server for your domain(s) and secondary-dns.co.uk is a slave, if you change the IP address of your master you will need to contact support@aa.net.uk to request us to update our side.

We have more information about our Authoritative DNS project on our Support Site: https://support.aa.net.uk/New_Authoritive_DNS


MINOR Closed Routing
AFFECTED
Routing
STARTED
Jun 01, 08:00 PM (11½ days ago)
CLOSED
Jun 01, 10:00 PM (11½ days ago)
DESCRIPTION
Since 8pm on Saturday evening we were seeing high latency and packetloss to some IPv6 addresses hosted by he.net (Hurricane Electric). At around 10pm we change out outbound routing and latency reduced to normal levels.
Resolution:

MINOR Closed TalkTalk
AFFECTED
TalkTalk
STARTED
May 31, 05:50 PM (12½ days ago)
CLOSED
May 31, 06:10 PM (12½ days ago)
DESCRIPTION
At 17:50 we had a large number of TalkTalk lines drop and reconnect... investigating.
Resolution: TalkTalk had a line card problem on a router in Telehouse. Customers routing through it at the time dropped and reconnected via alternative routes in to our network.

MINOR Closed Data SIMs
AFFECTED
Data SIMs
STARTED
May 31, 02:00 PM (12¾ days ago)
CLOSED
May 31, 02:15 PM (12¾ days ago)
DESCRIPTION
at 2pm we saw a fair number of our Data SIMs drop their connection to us and reconnect moments after. We suspect a problem upstream, possibly within the mobile network. Services have reconnected.
Resolution:

MINOR Closed BT
AFFECTED
BT
STARTED
May 31, 10:30 AM (13 days ago)
CLOSED
May 31, 12:00 PM (12¾ days ago)
DESCRIPTION
We have seen two blips today (10:30 and 11:48), these have affected only a small number of BT circuits but we have raised this with BT. We suspect some sort of fibre break in a specific area of the country has caused lines to be re-routed.
Resolution: A faulty network card within BT's core network caused routing flaps. The card is due to be replaced today.

MINOR Closed BT
AFFECTED
BT
STARTED
May 30, 11:39 PM (13¼ days ago)
CLOSED
May 31, 12:15 PM (12¾ days ago)
DESCRIPTION
We saw a large number of BT lines drop and reconnect at 23:39, they had mostly all recovered by midnight. We have a case open with BT regarding this.
Resolution: A faulty network card within BT's core network caused routing flaps. The card is due to be replaced today.

MINOR Closed VoIP and SIMs
AFFECTED
VoIP and SIMs
STARTED
May 29, 12:00 AM (15¼ days ago)
CLOSED
May 29, 03:18 PM (14¾ days ago)
DESCRIPTION
Making calls from mobile is not working cleanly right now. We're working with ONSIM on this. Just be patient and calls connect, but the called party may see several calls before they get one that works, over several seconds.
Resolution: Profiles on all SIMs have been changed to work around this for now. It still needs resolving, but for now it should not be impacting any customers.

MAINTENANCE Completed BT
AFFECTING
BT
STARTED
May 24, 12:01 AM (20¼ days ago)
CLOSED
May 24, 01:00 AM (20¼ days ago)
DESCRIPTION
BT have planned work on their side of one of our hostlinks, on 24th May between midnight and 6AM. We will move traffic away from this hostlink beforehand so as to minimise the impact on customers. We don't expect this to impact customers.
Resolution: The work was carried out without affecting customer traffic.

MINOR Closed TalkTalk
AFFECTED
TalkTalk
STARTED
May 21, 09:01 PM (22½ days ago)
CLOSED
May 21, 09:08 PM (22½ days ago)
DESCRIPTION
We've seen a small number of TalkTalk lines drop their connection at around 9pm. These are in specific areas, such as Andover, Nottingham, Ripley and a couple of other areas. We suspect a fault within TalkTalk's network. Usually a fault like this the lines will reconnect via alternative routes almost immediately, but thus far they have not. Updates to follow
Resolution:

MINOR Closed BT
AFFECTED
BT
STARTED
May 15, 12:46 PM (28¾ days ago)
CLOSED
May 15, 01:00 PM (28¾ days ago)
DESCRIPTION
at 12:46 we saw a small number of BT circuits drop and reconnect. We suspect a problem within BTs network, but we're awaiting more information from them. Lines reconnected right away.
Resolution: Lines reconnected right away, BT unable to identify the cause at this point in time.

MINOR Closed Graphs
AFFECTED
Graphs
STARTED
May 05, 05:00 PM (1¼ months ago)
CLOSED
May 07, 04:00 PM (1 month ago)
DESCRIPTION
Some customers are missing CQM graphs from over the weekend. This is due to an upset disk array.
Resolution:

MAINTENANCE Completed VoIP and SIMs
AFFECTING
VoIP and SIMs
STARTED
May 01, 06:00 PM (1¼ months ago)
CLOSED
May 06, 05:00 PM (1 month ago)
DESCRIPTION
Following issues this week with calls to SIP2SIM phones not working reliably, the carrier has finally updated their systems to have fixed firewall rules. This solves one of the key issues we have seen with unreliability. Their firewall rules were not properly made from the DNS records for some reason. We are now able to finish updating our routing logic over the next few days. This should not impact services, but there is a risk. Thankfully the risk is quite low, and importantly likely short lived, as it seems their main call handling servers are reasonably quick to pick up DNS changes (unlike their firewall rules). The aim is to make most of the changes over the weekend.
Resolution:

MAINTENANCE Completed TalkTalk
AFFECTING
TalkTalk
STARTED
Apr 21, 09:00 PM (1¾ months ago)
CLOSED
May 16, 06:00 AM (28 days ago)
DESCRIPTION

We have multiple interlinks to TalkTalk that carry our broadband traffic. TalkTalk have scheduled planned work on both these links during a four week period from Tuesday 23rd April until 16th May. (Specifically midnight to 6AM on 23rd, 25th, 25th April and 1st, 2nd, 9th, 16th May.

Due to the work being carried out (Software updates of their "LTSs") we are unable to move traffic seamlessly between our interlinks and so TalkTalk customers will see their connections drop and reconnect on these early mornings.


Resolution: TalkTalk's PEW window is now over.

MAINTENANCE Completed Broadband
AFFECTING
Broadband
STARTED
Jan 19, 03:50 PM (4¾ months ago)
CLOSED
Jun 12, 04:00 PM (19½ hours ago)
DESCRIPTION

This is a summary and update regarding the problems we've been having with our network, causing line drops for some customers, interrupting their Internet connections for a few minutes at a time. It carries on from the earlier, now out of date, post: https://aastatus.net/42577

We are not only an Internet Service Provider.

We also design and build our own routers under the FireBrick brand. This equipment is what we predominantly use in our own network to provide Internet services to customers. These routers are installed between our wholesale carriers (e.g. BT, CityFibre and TalkTalk) and the A&A core IP network. The type of router is called an "LNS", which stands for L2TP Network Server.

FireBricks are also deployed elsewhere in the core; providing our L2TP and Ethernet services, as well as facing the rest of the Internet as BGP routers to multiple Transit feeds, Internet Exchanges and CDNs.

Throughout the entire existence of A&A as an ISP, we have been running various models of FireBrick in our network.

Our newest model is the FB9000. We have been running a mix of prototype, pre-production and production variants of the FB9000 within our network since early 2022.

As can sometimes happen with a new product, at a certain point we started to experience some strange behaviour; essentially the hardware would lock-up and "watchdog" (and reboot) unpredictably.

Compared to a software 'crash' a hardware lock-up is very hard to diagnose, as little information is obtainable when this happens. If the FireBrick software ever crashes, a 'core dump' is posted with specific information about where the software problem happened. This makes it a lot easier to find and fix.

After intensive work by our developers, the cause was identified as (unexpectedly) something to do with the NVMe socket on the motherboard. At design time, we had included an NVME socket connected to the PCIE pins on the CPU, for undecided possible future uses. We did not populate the NVMe socket, though. The hanging issue completely cleared up once an NVMe was installed even though it was not used for anything at all.

As a second approach, the software was then modified to force the PCIe to be switched off such that we would not need to install NVMes in all the units.

This certainly did solve the problem in our test rig (which is multiple FB9000s, PCs to generate traffic, switches etc). For several weeks FireBricks which had formerly been hanging often in "artificially worsened" test conditions, literally stopped hanging altogether, becoming extremely stable.

So, we thought the problem was resolved. And, indeed, in our test rig we still have not seen a hang. Not even once, across multiple FB9000s.

However...

We did then start seeing hangs in our Live prototype units in production (causing dropouts to our broadband customers).

At the same time, the FB9000s we have elsewhere in our network, not running as LNS routers, are stable.

We are still working on pinpointing the cause of this, which we think is highly likely to be related to the original (now, solved) problem.

Further work...

Over the next 1-2 weeks we will be installing several extra FB9000 LNS routers. We are installing these with additional low-level monitoring capabilities in the form of JTAG connections from the main PCB so that in the event of a hardware lock-up we can directly gather more information.

The enlarged pool of LNSs will also reduce the number of customers affected if there is a lock-up of one LNS.

We obviously do apologise for the blips customers have been seeing. We do take this very seriously, and are not happy when customers are inconvenienced.

We can imagine some customers might also be wondering why we bother to make our own routers, and not just do what almost all other ISPs do, and simply buy them from a major manufacturer. This is a fair question. At times like this, it is a question we ask ourselves!

Ultimately, we do still firmly believe the benefits of having the FireBrick technology under our complete control outweigh the disadvantages. CQM graphs are still almost unique to us, and these would simply not be possible without FireBrick. There have also been numerous individual cases where our direct control over the firmware has enabled us to implement individual improvements and changes that have benefitted one or many customers.

Many times over the years we have been able to diagnose problems with our carrier partners, which they themselves could not see or investigate. This level of monitoring is facilitated by having FireBricks.

But in order to have finished FireBricks, we have to develop them. And development involves testing, and testing can sometimes reveal problems, which then affect customers.

We do not feel we were irrationally premature in introducing prototype FireBricks into our network, having had them under test not routing live customer traffic for an appropriate period beforehand.

But some problems can only reveal themselves once a "real world" level and nature of traffic is being passed. This is unavoidable, and whilst we do try hard to minimise disruption, we still feel the long term benefits of having FireBricks more-than offset the short term problems in late stage of development. We hope our detailed view on this is informative, and even persuasive.


Resolution: We are still running the 'Factory' release software on our production LNSs and we consider them stable.

Work is still being done away from our LNSs regarding the cause of the hangs, but due to the nature of the problem it is a time consuming process.

Moving forward: Over the coming months we are planning to migrate our FB6000 LNS pool to FB9000 (running the stable, factory software). Most of our non-LNS routers (eg those used for BGP, L2TP and Ethernet services) have already been migrated over to FB9000 hardware and have been running, in some cases, for nearly 2 years.

We will create new Status Posts regarding the work to migrate our FB6000s to FB9000.