Last 10 posts, sorted by last updated
Limit to posts

MAINTENANCE Planned BT
AFFECTING
BT
STARTING
May 24, 12:01 AM (1 day )
DESCRIPTION
BT have planned work on their side of one of our hostlinks, on 24th May between midnight and 6AM. We will move traffic away from this hostlink beforehand so as to minimise the impact on customers. We don't expect this to impact customers.

MAINTENANCE Assumed Completed SMS
AFFECTING
SMS
STARTED
Apr 08, 02:13 PM (1½ months ago)
DESCRIPTION
This work has started, but we did not do a planned work as expected it to be seamless. Sadly that was not quite the case today, so this is more detail on what we are planning over the next few weeks. The main thing is, any problems, please tell us right away.
  • Some cosmetic improvements (nicer format phone numbers) in emailed or tooted SMS (done)
  • Additional options (such as forcing the email/toots to be E.123 + format numbers) (done)
  • Additional options for posting JSON to http/https (TODO)
  • Allowing SMS to be relayed (chargeable) to other numbers (done)
  • We already allow multiple targets for a number for SMS (done)
  • Some improvements for 8 bit SMS, which are rare, as we previously treated as latin1, which is not correct (TODO)
  • Some new features for trialling a new SIP2SIM platform (TODO)
  • Improve "visible" format for content in email/toot when special characters are used (e.g. NULL as ␀) (TODO)
The 8 bit data format changes are likely to be the least "backwards compatible" changes, but should not impact anyone as they are not generally encountered. I.e. incoming SMS will rarely (if ever) be 8 bit coded, and when they were, we would get special characters wrong. Similarly, sending 8 bit SMS would only show the expected characters on some older phones, and would be wrong on many others as the specification does not say the character set to use. We will, however, handle NULLs much better, which are relevant for some special use cases.

MAINTENANCE Assumed Completed Broadband
AFFECTING
Broadband
STARTED
Jan 19, 03:50 PM (4 months ago)
DESCRIPTION

This is a summary and update regarding the problems we've been having with our network, causing line drops for some customers, interrupting their Internet connections for a few minutes at a time. It carries on from the earlier, now out of date, post: https://aastatus.net/42577

We are not only an Internet Service Provider.

We also design and build our own routers under the FireBrick brand. This equipment is what we predominantly use in our own network to provide Internet services to customers. These routers are installed between our wholesale carriers (e.g. BT, CityFibre and TalkTalk) and the A&A core IP network. The type of router is called an "LNS", which stands for L2TP Network Server.

FireBricks are also deployed elsewhere in the core; providing our L2TP and Ethernet services, as well as facing the rest of the Internet as BGP routers to multiple Transit feeds, Internet Exchanges and CDNs.

Throughout the entire existence of A&A as an ISP, we have been running various models of FireBrick in our network.

Our newest model is the FB9000. We have been running a mix of prototype, pre-production and production variants of the FB9000 within our network since early 2022.

As can sometimes happen with a new product, at a certain point we started to experience some strange behaviour; essentially the hardware would lock-up and "watchdog" (and reboot) unpredictably.

Compared to a software 'crash' a hardware lock-up is very hard to diagnose, as little information is obtainable when this happens. If the FireBrick software ever crashes, a 'core dump' is posted with specific information about where the software problem happened. This makes it a lot easier to find and fix.

After intensive work by our developers, the cause was identified as (unexpectedly) something to do with the NVMe socket on the motherboard. At design time, we had included an NVME socket connected to the PCIE pins on the CPU, for undecided possible future uses. We did not populate the NVMe socket, though. The hanging issue completely cleared up once an NVMe was installed even though it was not used for anything at all.

As a second approach, the software was then modified to force the PCIe to be switched off such that we would not need to install NVMes in all the units.

This certainly did solve the problem in our test rig (which is multiple FB9000s, PCs to generate traffic, switches etc). For several weeks FireBricks which had formerly been hanging often in "artificially worsened" test conditions, literally stopped hanging altogether, becoming extremely stable.

So, we thought the problem was resolved. And, indeed, in our test rig we still have not seen a hang. Not even once, across multiple FB9000s.

However...

We did then start seeing hangs in our Live prototype units in production (causing dropouts to our broadband customers).

At the same time, the FB9000s we have elsewhere in our network, not running as LNS routers, are stable.

We are still working on pinpointing the cause of this, which we think is highly likely to be related to the original (now, solved) problem.

Further work...

Over the next 1-2 weeks we will be installing several extra FB9000 LNS routers. We are installing these with additional low-level monitoring capabilities in the form of JTAG connections from the main PCB so that in the event of a hardware lock-up we can directly gather more information.

The enlarged pool of LNSs will also reduce the number of customers affected if there is a lock-up of one LNS.

We obviously do apologise for the blips customers have been seeing. We do take this very seriously, and are not happy when customers are inconvenienced.

We can imagine some customers might also be wondering why we bother to make our own routers, and not just do what almost all other ISPs do, and simply buy them from a major manufacturer. This is a fair question. At times like this, it is a question we ask ourselves!

Ultimately, we do still firmly believe the benefits of having the FireBrick technology under our complete control outweigh the disadvantages. CQM graphs are still almost unique to us, and these would simply not be possible without FireBrick. There have also been numerous individual cases where our direct control over the firmware has enabled us to implement individual improvements and changes that have benefitted one or many customers.

Many times over the years we have been able to diagnose problems with our carrier partners, which they themselves could not see or investigate. This level of monitoring is facilitated by having FireBricks.

But in order to have finished FireBricks, we have to develop them. And development involves testing, and testing can sometimes reveal problems, which then affect customers.

We do not feel we were irrationally premature in introducing prototype FireBricks into our network, having had them under test not routing live customer traffic for an appropriate period beforehand.

But some problems can only reveal themselves once a "real world" level and nature of traffic is being passed. This is unavoidable, and whilst we do try hard to minimise disruption, we still feel the long term benefits of having FireBricks more-than offset the short term problems in late stage of development. We hope our detailed view on this is informative, and even persuasive.


MINOR Closed TalkTalk
AFFECTED
TalkTalk
STARTED
May 21, 09:01 PM (23¾ hours ago)
CLOSED
May 21, 09:08 PM (23¾ hours ago)
DESCRIPTION
We've seen a small number of TalkTalk lines drop their connection at around 9pm. These are in specific areas, such as Andover, Nottingham, Ripley and a couple of other areas. We suspect a fault within TalkTalk's network. Usually a fault like this the lines will reconnect via alternative routes almost immediately, but thus far they have not. Updates to follow
Resolution:

MINOR Closed BT
AFFECTED
BT
STARTED
May 15, 12:46 PM (7¼ days ago)
CLOSED
May 15, 01:00 PM (7¼ days ago)
DESCRIPTION
at 12:46 we saw a small number of BT circuits drop and reconnect. We suspect a problem within BTs network, but we're awaiting more information from them. Lines reconnected right away.
Resolution: Lines reconnected right away, BT unable to identify the cause at this point in time.

MINOR Closed Graphs
AFFECTED
Graphs
STARTED
May 05, 05:00 PM (17 days ago)
CLOSED
May 07, 04:00 PM (15 days ago)
DESCRIPTION
Some customers are missing CQM graphs from over the weekend. This is due to an upset disk array.
Resolution:

NEWS Info VoIP and SIMs
AFFECTING
VoIP and SIMs
STARTED
May 04, 02:52 PM (18¼ days ago)
DESCRIPTION
We are still working on a number of the minor details, but we now have the main ordering in place for the new SIP2SIM service. https://order.aa.net.uk/simorder.cgi?sim=ONSIM Please let the Trial team (trial@aa.net.uk) know of any issues ordering or using the new service. We have physical SIMs to ship (with nano SIM knock out) and instant eSIMs now. We expect data allowances soon, and more flexibility with numbers linked to SIMs.

MINOR Closed SIP2SIM
AFFECTED
SIP2SIM
STARTED
May 02, 11:32 AM (20¼ days ago)
CLOSED
May 03, 08:53 AM (19½ days ago)
DESCRIPTION
We have had reports of some issues with calls to SIP2SIM mobiles. This is being investigated. Note that SMS is unaffected. Note calls from mobiles are unaffected.
Resolution: This looks like it is resolved now, with proper fixed firewall rules at the carrier. I have added a free day off the next bill, thank you for your patience.

MINOR Closed BT
AFFECTED
BT
STARTED
May 02, 01:30 PM (20¼ days ago)
CLOSED
May 02, 02:30 PM (20¼ days ago)
DESCRIPTION
At around 13:30 we saw a small number of BT lines drop and reconnect. Customers are back online, we're investigating the cause.
Resolution:

MINOR Closed VoIP and SIMs
AFFECTED
VoIP and SIMs
STARTED
May 02, 11:57 AM (20¼ days ago)
CLOSED
May 02, 12:05 PM (20¼ days ago)
DESCRIPTION
One of our upstream carriers have call issues. We've routed calls away from them as much as possible, but this may still affect some inbound calls. The symptom is unexpected call rejections or audio issues. They're aware and are working to fix it as soon as possible.
Resolution: One of our upstream carriers reported: at 11:04 hrs BST the call handling units in one of our nodes started to behave unexpectedly, causing some audio issues and call failures. We are investigating some unusual traffic received at that time. The units were restarted starting at 11:22 hrs BST at which point traffic quickly restored to normal. The remaining units were fully restarted by 11:44 hrs BST when we were back at full redundancy.