Outages
Maintenance
Our Network:
Major Outages

0

Minor Outages

1

Happening Now

2

Future Planned

3

Open Events:
STATUS
Open
CREATED
Feb 14, 09:18 AM (1 month ago)
AFFECTING
Hetzner
STARTED
Feb 13, 07:00 PM (1 month ago)
REFERENCE
42623 / AA42623
MASTODON
INFORMATION
  • INITIAL
    1 month ago by Andrew

    Hetzner is a German based server hosting provider. We have seen intermittent problems in routing traffic to them in recent days.

  • UPDATE
    1 month ago by Andrew

    The problem we see is that we send traffic to Hetzner's routers via the LINX Internet Exchange, and traffic goes no further. We have opened a ticket with LINX and with Hetzner regarding this.

  • UPDATE
    1 month ago by Andrew

    LINX say they have no reported problems with Hetzner.

  • UPDATE
    1 month ago by Andrew

    We have heard that another UK ISP may be seeing similar problems with routing to Hetzner, but we've not been able to verify this ourselves.

  • UPDATE
    27¾ days ago by Andrew

    The problem hasn't reoccurred, but we'll keep this post open for a week or so in case it comes back.

  • UPDATE
    12½ days ago by Andrew

    Issue reopened - we see routing problems again today (March 6th).

  • UPDATE
    12½ days ago by Andrew

    We've contacted Hetzner to further investigate this. It seems traffic leaving our network for Hetzner via either of our two LINX routers go no further. Traffic leaving Hetzner for our network also go via LINX and go no further.

  • UPDATE
    12½ days ago by Andrew

    Traffic is working again, at the moment...

  • UPDATE
    10½ days ago by Andrew

    There have been three occasions where we've had routing problems to/from Hetzner: Feb 9th, Feb 13th/14th, March 6th.

    Whilst the latest routing problem resolved itself after about 30 minutes we are still in conversation with LINX and Hetzner regarding this.

  • NEXT UPDATE...

    Due 7½ days ago (overdue)

STATUS
Open
CREATED
Mar 11, 05:01 PM (7½ days ago)
AFFECTING
Router Upgrades
STARTED
Mar 12, 03:00 AM (7 days ago)
REFERENCE
42637 / AA42637
MASTODON
INFORMATION
  • INITIAL
    7 days ago by Andrew

    We will be performing software upgrades on our BGP routers. These will be scheduled for between 3AM and 4:30AM Tuesday-SAturday this week. This is to bring all our routers up to the same level with software that introduces a few minor feature updates. This work is not expected to impact customers.

  • UPDATE
    6¾ days ago by Andrew

    Two routers were upgraded successfully on Tuesday early morning, upgrades to other routers are scheduled for the rest of this week. (3AM-4:30AM)

  • UPDATE
    17¼ hours ago by Andrew

    We've a few more routers to upgrade, so this will continue to happen between 3AM and 4:30AM week beginning 18 March

  • NEXT UPDATE...

    Due in 2 days

EXPECTED CLOSE
Mar 21, 04:30 AM ( In 1¾ days )
STATUS
Open
CREATED
Jan 19, 03:55 PM (2 months ago)
AFFECTING
Broadband
STARTED
Jan 19, 03:50 PM (2 months ago)
REFERENCE
42608 / AA42608
MASTODON
INFORMATION
  • INITIAL
    2 months ago by Andrew

    This is a summary and update regarding the problems we've been having with our network, causing line drops for some customers, interrupting their Internet connections for a few minutes at a time. It carries on from the earlier, now out of date, post: https://aastatus.net/42577

    We are not only an Internet Service Provider.

    We also design and build our own routers under the FireBrick brand. This equipment is what we predominantly use in our own network to provide Internet services to customers. These routers are installed between our wholesale carriers (e.g. BT, CityFibre and TalkTalk) and the A&A core IP network. The type of router is called an "LNS", which stands for L2TP Network Server.

    FireBricks are also deployed elsewhere in the core; providing our L2TP and Ethernet services, as well as facing the rest of the Internet as BGP routers to multiple Transit feeds, Internet Exchanges and CDNs.

    Throughout the entire existence of A&A as an ISP, we have been running various models of FireBrick in our network.

    Our newest model is the FB9000. We have been running a mix of prototype, pre-production and production variants of the FB9000 within our network since early 2022.

    As can sometimes happen with a new product, at a certain point we started to experience some strange behaviour; essentially the hardware would lock-up and "watchdog" (and reboot) unpredictably.

    Compared to a software 'crash' a hardware lock-up is very hard to diagnose, as little information is obtainable when this happens. If the FireBrick software ever crashes, a 'core dump' is posted with specific information about where the software problem happened. This makes it a lot easier to find and fix.

    After intensive work by our developers, the cause was identified as (unexpectedly) something to do with the NVMe socket on the motherboard. At design time, we had included an NVME socket connected to the PCIE pins on the CPU, for undecided possible future uses. We did not populate the NVMe socket, though. The hanging issue completely cleared up once an NVMe was installed even though it was not used for anything at all.

    As a second approach, the software was then modified to force the PCIe to be switched off such that we would not need to install NVMes in all the units.

    This certainly did solve the problem in our test rig (which is multiple FB9000s, PCs to generate traffic, switches etc). For several weeks FireBricks which had formerly been hanging often in "artificially worsened" test conditions, literally stopped hanging altogether, becoming extremely stable.

    So, we thought the problem was resolved. And, indeed, in our test rig we still have not seen a hang. Not even once, across multiple FB9000s.

    However...

    We did then start seeing hangs in our Live prototype units in production (causing dropouts to our broadband customers).

    At the same time, the FB9000s we have elsewhere in our network, not running as LNS routers, are stable.

    We are still working on pinpointing the cause of this, which we think is highly likely to be related to the original (now, solved) problem.

    Further work...

    Over the next 1-2 weeks we will be installing several extra FB9000 LNS routers. We are installing these with additional low-level monitoring capabilities in the form of JTAG connections from the main PCB so that in the event of a hardware lock-up we can directly gather more information.

    The enlarged pool of LNSs will also reduce the number of customers affected if there is a lock-up of one LNS.

    We obviously do apologise for the blips customers have been seeing. We do take this very seriously, and are not happy when customers are inconvenienced.

    We can imagine some customers might also be wondering why we bother to make our own routers, and not just do what almost all other ISPs do, and simply buy them from a major manufacturer. This is a fair question. At times like this, it is a question we ask ourselves!

    Ultimately, we do still firmly believe the benefits of having the FireBrick technology under our complete control outweigh the disadvantages. CQM graphs are still almost unique to us, and these would simply not be possible without FireBrick. There have also been numerous individual cases where our direct control over the firmware has enabled us to implement individual improvements and changes that have benefitted one or many customers.

    Many times over the years we have been able to diagnose problems with our carrier partners, which they themselves could not see or investigate. This level of monitoring is facilitated by having FireBricks.

    But in order to have finished FireBricks, we have to develop them. And development involves testing, and testing can sometimes reveal problems, which then affect customers.

    We do not feel we were irrationally premature in introducing prototype FireBricks into our network, having had them under test not routing live customer traffic for an appropriate period beforehand.

    But some problems can only reveal themselves once a "real world" level and nature of traffic is being passed. This is unavoidable, and whilst we do try hard to minimise disruption, we still feel the long term benefits of having FireBricks more-than offset the short term problems in late stage of development. We hope our detailed view on this is informative, and even persuasive.

  • UPDATE
    1¼ months ago by Andrew

    5th Feb: Both Z and Y have hung in recent days (Saturday 3rd and Monday 5th) - we are currently analysing the data from the various cache and memory systems that we were able to retrieve from the hardware whilst it was in its hung state.

  • UPDATE
    1¼ months ago by Andrew

    Latest Summary, as of 9th February: We now have a larger pool of FB9000 LNSs. Six out of seven of them have been fitted with NMVe drives and JTAG debugging capabilities. If/when they have a hardware lock-up we'll be able to gain a bit more of an insight in to the cause. The seventh LNS has not, but it has been stable with an uptime of 86 days.

  • UPDATE
    18 days ago by Andrew

    Work being carried out:

  • NEXT UPDATE...

    Due 14½ days ago (overdue)

Broadband blip graph

The graph shows the last few hours of logins and logouts of ADSL, VDSL, SIMs and L2TP circuits.

The current time is on the left. Green is login, red is logout.

If there are spikes, then this shows a large number of logouts, which may indicate an outage or planned work happening.

You can click on a spike to search for incidents or maintenance that were open around that time.

About our status page

This is the status page of Andrews & Arnold Ltd.

Our status page shows outages (problems) and maintenance (planned work) that happen on our own network and systems and also that of our suppliers networks and systems. We try and ensure this site is updated as soon as possible with incidents as they happen. Live discussion of issues is usually available on IRC.

The last update was Yesterday 14:49:44

Contacting us
Our support number is 033 33 400 999, or you can email support@aa.net.uk or text 01344 400 999 to raise a support ticket.

Spotted a Major Service Outage? (MSO)
A Major Service Outage disrupts the service of multiple customers simultaneously. If you believe that a problem affects multiple customers, and is not mentioned here already, text the number above. Begin the text with "MSO". This alerts multiple staff immediately, waking them if necessary. False alarms (i.e. raising MSO for a single line being down) may result in your number being prevented from raising MSO alerts in future. More info.

Regular Maintenance
Thursday evenings, from 10pm, are designated as a general maintenance window where we will perform non-service affecting updates.