Andrews & Arnold Incidents Feedhttps://aastatus.net/atom.cgi2024-03-19T12:59:30ZAndrews & Arnoldsupport@aaisp.net.uk[PEW] TalkTalk: TalkTalk planned work on one of our interlinks (Open)https://aastatus.net/426452024-03-19T12:59:30Z2024-03-19T12:57:52Z
<b>Started: 2024-04-09 03:00:00</b><br />
<p>We have multiple interlinks to TalkTalk that carry our broadband traffic. TalkTalk have scheduled planned work on our links in our Equinix LD8 datacentre for 11th April between 1AM and 6AM. </p><p>So as to minimise the impact on our customers, we will move traffic off these links on 9th April at 3AM. This should be seamless, but there is a risk of some customers having a brief interruption to their service.</p>
[PEW] TalkTalk: TalkTalk planned work on one of our interlinks (Open)https://aastatus.net/426412024-03-19T12:54:44Z2024-03-14T14:40:22Z
<b>Started: 2024-03-20 03:00:00</b><br />
<p>We have multiple interlinks to TalkTalk that carry our broadband traffic. TalkTalk have scheduled planned work on our links in our Telehouse datacentre for 21th March between 1AM and 6AM. </p><p>So as to minimise the impact on our customers, we will move traffic off these links on 20th March at 3AM. This should be seamless, but there is a risk of some customers having a brief interruption to their service.</p>
<b>Update expected: 2024-03-20 03:30:00</b>
[PEW] Router Upgrades: Overnight router upgrades (Open)https://aastatus.net/426372024-03-18T12:09:08Z2024-03-11T17:01:59Z
<b>Started: 2024-03-12 03:00:00</b><br />
We will be performing software upgrades on our BGP routers. These will be scheduled for between 3AM and 4:30AM Tuesday-SAturday this week. This is to bring all our routers up to the same level with software that introduces a few minor feature updates. This work is not expected to impact customers.
<br /><br /><b>Update 12 Mar 2024 07:27:35</b> Two routers were upgraded successfully on Tuesday early morning, upgrades to other routers are scheduled for the rest of this week. (3AM-4:30AM)<br /><br /><b>Update 18 Mar 2024 12:08:55</b> We've a few more routers to upgrade, so this will continue to happen between 3AM and 4:30AM week beginning 18 March
<b>Update expected: 2024-03-21 10:00:00</b>
[Minor] VoIP: VoIP Call problemshttps://aastatus.net/426442024-03-15T12:15:52Z2024-03-15T11:13:11Z
<b>Started: 2024-03-15 11:10:00</b><br />
We're investigating reports of call problems with one of our voice servers.
<br /><br /><b>Update 15 Mar 2024 11:23:09</b> One of our call servers (b.voiceless) hit 100% cpu due to a huge number of calls that did not clear down properly, this affected the service. These calls have now cleared and investigations are underway to find the cause of the calls.
[Minor] LNS: Lines on x.witless reconnectedhttps://aastatus.net/426432024-03-15T07:56:07Z2024-03-15T07:39:40Z
<b>Started: 2024-03-15 07:00:00</b><br />
A 7:30AM, the X.Witless restarted causing customers on it to drop and reconnect.
<br /><br /><b>Update 15 Mar 2024 07:40:26</b> Lines reconnected by 7:33
<br /><br /><b>Resolution</b> This was related to https://aastatus.net/42608 This LNS is now out of service and will be analysed by our developers.
[PEW] LNS: LNS Move - Z.Witlesshttps://aastatus.net/426422024-03-15T07:37:50Z2024-03-14T15:32:10Z
<b>Started: 2024-03-15 04:00:00</b><br />
We will be moving lines off the Z.Witless LNS at 4AM. They will reconnect to a different LNS.
<br /><br /><b>Resolution</b> This work was completed.
[PEW] BT: BT planned work on one of our interlinks (Open)https://aastatus.net/426402024-03-14T14:37:44Z2024-03-14T14:37:44Z
<b>Started: 2024-03-25 03:00:00</b><br />
<p>We have multiple interlinks to BT that carry our broadband traffic. BT have scheduled planned work on our links in our Harbour Exchange Square datacentre for 27th March between midnight and 6AM. </p><p>So as to minimise the impact on our customers, we will move traffic off these links on 25th March at 3AM. This should be seamless, but last time we attempted this BT had a misconfiguration which caused some customers to drop their connection!</p>
<b>Update expected: 2024-03-25 03:30:00</b>
[PEW] L2TP: L2TP Router Upgradehttps://aastatus.net/426382024-03-14T14:09:41Z2024-03-12T16:12:55Z
<b>Started: 2024-03-14 03:00:00</b><br />
We will be performing software upgrades on our L2TP routers - l2tp.aa.net.uk. These will be scheduled for between 3AM and 4:30AM on Thursday this week. L2TP customers will see their connection drop and reconnect twice during this period.
<br /><br /><b>Resolution</b> This was completed at 03:10
[PEW] LNS: V.Gormless & W.Gormlesshttps://aastatus.net/426392024-03-13T10:21:15Z2024-03-12T16:30:44Z
<b>Started: 2024-03-13 03:00:00</b><br />
We will be performing overnight upgrades of V.Gormless and W.Gormless on 13th and 14th March between 3AM and 4:30AM. Customers on these will see their connection drop and reconnect a few seconds later.
<br /><br /><b>Resolution</b> This has been completed.
[Minor] LNS: Lines on x.witless reconnectedhttps://aastatus.net/426362024-03-09T12:11:12Z2024-03-09T12:04:47Z
<b>Started: 2024-03-09 11:35:00</b><br />
Customers on the X.Witless LNS dropped and reconnected at 11:35 today.
<br /><br /><b>Resolution</b> This is related to the ongoing LNS hangs we've been seeing: https://aastatus.net/42608. We do apologise to customers affected by this. This incident does help towards diagnosing and investigating the root cause.
[Minor] Hetzner: Intermittant routing problems to Hetzner (Open)https://aastatus.net/426232024-03-09T12:09:42Z2024-02-14T09:18:18Z
<b>Started: 2024-02-13 19:00:00</b><br />
Hetzner is a German based server hosting provider. We have seen intermittent problems in routing traffic to them in recent days.
<br /><br /><b>Update 14 Feb 2024 09:19:42</b> The problem we see is that we send traffic to Hetzner's routers via the LINX Internet Exchange, and traffic goes no further. We have opened a ticket with LINX and with Hetzner regarding this.<br /><br /><b>Update 14 Feb 2024 09:20:52</b> LINX say they have no reported problems with Hetzner. <br /><br /><b>Update 14 Feb 2024 09:20:57</b> We have heard that another UK ISP may be seeing similar problems with routing to Hetzner, but we've not been able to verify this ourselves.<br /><br /><b>Update 20 Feb 2024 10:39:52</b> The problem hasn't reoccurred, but we'll keep this post open for a week or so in case it comes back.<br /><br /><b>Update 6 Mar 2024 13:21:09</b> Issue reopened - we see routing problems again today (March 6th). <br /><br /><b>Update 6 Mar 2024 13:23:00</b> We've contacted Hetzner to further investigate this. It seems traffic leaving our network for Hetzner via either of our two LINX routers go no further. Traffic leaving Hetzner for our network also go via LINX and go no further.<br /><br /><b>Update 6 Mar 2024 13:36:03</b> Traffic is working again, at the moment... <br /><br /><b>Update 8 Mar 2024 12:07:32</b> <p>There have been three occasions where we've had routing problems to/from Hetzner: Feb 9th, Feb 13th/14th, March 6th.</p>
<p>Whilst the latest routing problem resolved itself after about 30 minutes we are still in conversation with LINX and Hetzner regarding this. </p>
<b>Update expected: 2024-03-11 14:00:00</b>
[Minor] SMS: PEW affecting SMS via Stour Marinehttps://aastatus.net/426342024-03-08T09:38:21Z2024-03-06T16:01:28Z
<b>Started: 2024-03-07 22:00:00</b><br />
SMSC nodes' firewall upgrade to improve QoS and High Availability. This activity is performed together with the vendor Engineers. Expected Impact: 4 hours.
[PEW] CityFibre: CityFibre Planned work - Multiple areas (Open)https://aastatus.net/426352024-03-07T09:27:27Z2024-03-07T09:27:27Z
<b>Started: 2024-03-27 00:01:00</b><br />
<p>CityFibre are carrying out work that will affect CityFibre connections in Maidenhead, Luton, Leicester, Kettering, Gloucester, Coventry, Glasgow, Bournemouth, Milton Keynes, Newcastle Upon Tyne, Northampton, Norwich, Peterborough, Plymouth, Poole, Reading, Rugby, Solihull, Swindon, Wakefield and Wolverhampton.</p>
<p>Customers may experience a momentary loss of service ranging from a couple of seconds up to a maximum of 30 seconds several times during the maintenance window.</p>
<b>Update expected: 2024-03-27 10:00:00</b>
[Minor] Data SIMs: Some Data SIM dropshttps://aastatus.net/426332024-03-05T16:55:39Z2024-03-04T22:50:47Z
<b>Started: 2024-03-04 22:45:00</b><br />
We saw Data SIMs drop and reconnect at around 22:45 this evening. This looks to be caused by something upstream of us at the carrier.
<br /><br /><b>Update 4 Mar 2024 22:51:15</b> The majority of SIMs reconnected within a minute or two.
<br /><br /><b>Resolution</b> Services are back online. This was planned work by the upstream carrier.
[Minor] BT DSL: Some BT line drops 1AM 2AMhttps://aastatus.net/426322024-03-01T08:04:20Z2024-03-01T08:02:49Z
<b>Started: 2024-03-01 01:00:00</b><br />
BT carried out planned work that affected one of our 4 links to them at 1AM and 2AM. This caused lines to drop and reconnect. BT failed to inform us of this work (again). Had BT informed us then we would have cleanly moved traffic of the affected link.
<br /><br /><b>Resolution</b> A formal complaint has been raised with BT. (Drops between 3AM and 5AM were A&A planned work)
[PEW] Broadband: Work to help resolve recent LNS problems (Updated 9th Feb) (Open)https://aastatus.net/426082024-03-01T04:46:54Z2024-01-19T15:55:13Z
<b>Started: 2024-01-19 15:50:00</b><br />
<p>
This is a summary and update regarding the problems we've been having with our network, causing line drops for some customers, interrupting their Internet connections for a few minutes at a time. It carries on from the earlier, now out of date, post: https://aastatus.net/42577
</p><p>
We are not only an Internet Service Provider.
</p><p>
We also design and build our own routers under the FireBrick brand. This equipment is what we predominantly use in our own network to provide Internet services to customers. These routers are installed between our wholesale carriers (e.g. BT, CityFibre and TalkTalk) and the A&A core IP network. The type of router is called an "LNS", which stands for L2TP Network Server.
</p><p>
FireBricks are also deployed elsewhere in the core; providing our L2TP and Ethernet services, as well as facing the rest of the Internet as BGP routers to multiple Transit feeds, Internet Exchanges and CDNs.
</p><p>
Throughout the entire existence of A&A as an ISP, we have been running various models of FireBrick in our network.
</p><p>
Our newest model is the FB9000. We have been running a mix of prototype, pre-production and production variants of the FB9000 within our network since early 2022.
</p><p>
As can sometimes happen with a new product, at a certain point we started to experience some strange behaviour; essentially the hardware would lock-up and "watchdog" (and reboot) unpredictably.
</p><p>
Compared to a software 'crash' a hardware lock-up is very hard to diagnose, as little information is obtainable when this happens. If the FireBrick software ever crashes, a 'core dump' is posted with specific information about where the software problem happened. This makes it a lot easier to find and fix.
</p><p>
After intensive work by our developers, the cause was identified as (unexpectedly) something to do with the NVMe socket on the motherboard. At design time, we had included an NVME socket connected to the PCIE pins on the CPU, for undecided possible future uses. We did not populate the NVMe socket, though. The hanging issue completely cleared up once an NVMe was installed even though it was not used for anything at all.
</p><p>
As a second approach, the software was then modified to force the PCIe to be switched off such that we would not need to install NVMes in all the units.
</p><p>
This certainly did solve the problem in our test rig (which is multiple FB9000s, PCs to generate traffic, switches etc). For several weeks FireBricks which had formerly been hanging often in "artificially worsened" test conditions, literally stopped hanging altogether, becoming extremely stable.
</p><p>
So, we thought the problem was resolved. And, indeed, in our test rig we still have not seen a hang. Not even once, across multiple FB9000s.
</p><p>
However...
</p><p>
We did then start seeing hangs in our Live prototype units in production (causing dropouts to our broadband customers).
</p><p>
At the same time, the FB9000s we have elsewhere in our network, not running as LNS routers, are stable.
</p><p>
We are still working on pinpointing the cause of this, which we think is highly likely to be related to the original (now, solved) problem.
</p><p>
Further work...
</p><p>
Over the next 1-2 weeks we will be installing several extra FB9000 LNS routers. We are installing these with additional low-level monitoring capabilities in the form of JTAG connections from the main PCB so that in the event of a hardware lock-up we can directly gather more information.
</p><p>
The enlarged pool of LNSs will also reduce the number of customers affected if there is a lock-up of one LNS.
</p><p>
We obviously do apologise for the blips customers have been seeing. We do take this very seriously, and are not happy when customers are inconvenienced.
</p><p>
We can imagine some customers might also be wondering why we bother to make our own routers, and not just do what almost all other ISPs do, and simply buy them from a major manufacturer. This is a fair question. At times like this, it is a question we ask ourselves!
</p><p>
Ultimately, we do still firmly believe the benefits of having the FireBrick technology under our complete control outweigh the disadvantages. CQM graphs are still almost unique to us, and these would simply not be possible without FireBrick. There have also been numerous individual cases where our direct control over the firmware has enabled us to implement individual improvements and changes that have benefitted one or many customers.
</p><p>
Many times over the years we have been able to diagnose problems with our carrier partners, which they themselves could not see or investigate. This level of monitoring is facilitated by having FireBricks.
</p><p>
But in order to have finished FireBricks, we have to develop them. And development involves testing, and testing can sometimes reveal problems, which then affect customers.
</p><p>
We do not feel we were irrationally premature in introducing prototype FireBricks into our network, having had them under test not routing live customer traffic for an appropriate period beforehand.
</p><p>
But some problems can only reveal themselves once a "real world" level and nature of traffic is being passed. This is unavoidable, and whilst we do try hard to minimise disruption, we still feel the long term benefits of having FireBricks more-than offset the short term problems in late stage of development. We hope our detailed view on this is informative, and even persuasive.
</p>
<br /><br /><b>Update 1 Mar 2024 04:45:00</b> <b>Work being carried out:</b>
<ul>
<li><s>Action point: Replacement of three LNSs on Tuesday 23rd January:</s> https://aastatus.net/42609 <b>Completed</b> </li>
<li><s>Action point: Work on Z.Witless from Sunday 28th January:</s> https://aastatus.net/42614 <b>Completed</b></li>
<li><s>Action point: Install a new LNS in to the pool. 2nd February:</s> https://aastatus.net/42615 <b>Completed</b></li>
<li><s>Action point: Upgrade A, B, C Gormless to FB9000: Work starting from Saturday 3rd February:</s> https://aastatus.net/42616 <b>Completed</b></li>
<li><s>Action point: Spread customer connections over the new LNSs: 10th and 11th February:</s> https://aastatus.net/42620 <b>Completed</b></li>
<li><s>Action point: Software upgrades and spread customer connections over the LNSs: 17th and 18th February:</s> https://aastatus.net/42626</li>
<li><s>Action point: Software upgrades and separate CityFibre and BT/TalkTalk connections: 1st March:</s> https://aastatus.net/42630 <b>Completed</b></li>
</ul><br /><br /><b>Update 9 Feb 2024 16:50:00</b> <b>Latest Summary, as of 9th February:</b> We now have a larger pool of FB9000 LNSs. Six out of seven of them have been fitted with NMVe drives and JTAG debugging capabilities. If/when they have a hardware lock-up we'll be able to gain a bit more of an insight in to the cause. The seventh LNS has not, but it has been stable with an uptime of 86 days.<br /><br /><b>Update 5 Feb 2024 20:30:10</b> <b>5th Feb:</b> Both Z and Y have hung in recent days (Saturday 3rd and Monday 5th) - we are currently analysing the data from the various cache and memory systems that we were able to retrieve from the hardware whilst it was in its hung state.
<b>Update expected: 2024-03-04 13:00:00</b>
[PEW] LNS: Overnight LNS software upgrades and shufflinghttps://aastatus.net/426302024-03-01T04:44:53Z2024-02-29T11:32:19Z
<b>Started: 2024-03-01 03:00:00</b><br />
<p>We have work planned for the early hours of Friday morning that entails upgrading software on our LNSs and moving CityFibre and higher-speed BT/TalkTalk services on to separate pools of routers (LNSs) at our side.</p>
<p>In practice this will mean that most customers with speeds of 80Mb/s and above will experience a few PPP drops and reconnects between 3AM and 5AM as we carry out the work.</p>
<p>This is related to the hardware hangs we've been experiencing: https://aastatus.net/42608 and it will help us further investigate this ongoing issue.</p>
<br /><br /><b>Update 1 Mar 2024 02:59:50</b> This work is about to start.
<br /><br /><b>Resolution</b> This work has been completed.
[Minor] VoIP and SIMs: SIP2SIM Voice/SMS problemshttps://aastatus.net/426312024-03-01T02:59:12Z2024-02-29T14:20:21Z
<b>Started: 2024-02-29 14:17:00</b><br />
Our SIP2SIM carrier are investigating a problem with VOICE and SMS, this is affecting the O2 Profile only
We'll update this post as soon as we have further information
<br /><br /><b>Update 29 Feb 2024 15:54:23</b> This also appears to be affecting Manx and VFNL profiles
<br /><br /><b>Resolution</b> Our service provider has advised that the issue was resolved at 18:25.
[MSO] Broadband: Some line drops at noonhttps://aastatus.net/426292024-02-27T12:41:52Z2024-02-27T12:05:54Z
<b>Started: 2024-02-27 12:00:00</b><br />
A number of lines dropped and reconnected at around noon. We're investigating.
<br /><br /><b>Update 27 Feb 2024 12:10:04</b> Lines on the X.Witless LNS were affected, sessions are recovering.<br /><br /><b>Update 27 Feb 2024 12:18:54</b> The majority of lines are now back online.
<br /><br /><b>Resolution</b> <p>The X.Witless LNS hung and restarted which caused customers to disconnect and reconnect. </p> <p>This incident is related to https://aastatus.net/42608 X.Witless had been running without incident for 104 days. However, it is not fitted with an NVMe drive and was running software that pre-dates our NVMe drive fixes. We suspect the hang was caused by these two factors. </p> <p>Further work on our LNSs is being planned and updates will be posted to the status page in due course.</p>
[Minor] DATA SIMs: Some Data SIM dropshttps://aastatus.net/426282024-02-27T12:13:28Z2024-02-26T12:44:01Z
<b>Started: 2024-02-26 12:35:00</b><br />
We're seeing a large number of Data SIMs disconnect and reconnect. This looks to be upstream of us, possibly in the mobile network.
<br /><br /><b>Update 26 Feb 2024 13:51:37</b> The service has been stable since 12:50
<b>Update expected: 2024-02-26 13:00:00</b>
[Minor] SMS: Problems with SMShttps://aastatus.net/426272024-02-19T16:19:41Z2024-02-19T09:11:38Z
<b>Started: 2024-02-19 00:36:00</b><br />
We're investigating problems with inbound/outbound SMS. (Outbound SMS will work, but our SMS API will take a while to respond due to timeouts with our main carrier before it fails over to the secondary carrier)
<br /><br /><b>Update 19 Feb 2024 09:13:21</b> One of our SMS carriers is experiencing connectivity problems within AWS (Amazon Web Services) causing this problem.<br /><br /><b>Update 19 Feb 2024 09:19:00</b> The connectivity problems started at 00:36, and was escalated within hosting provider at 07:06. This is still being worked on by our carrier and their hosting provider (AWS)<br /><br /><b>Update 19 Feb 2024 10:31:30</b> From our carrier: This issue is still being worked on by our vendor's Sr. Network and IT experts to recover the service located in the AWS. As soon as we receive further updates, we will inform you accordingly.<br /><br /><b>Update 19 Feb 2024 11:30:30</b> From our carrier: This connectivity issue is still being worked on and we are focused on providing a solution as soon as possible. Our vendor support together with their upper management teams are actively working on restoring the service. <br /><br /><b>Update 19 Feb 2024 12:39:54</b> From our carrier: We are currently waiting for our vendor's teams to implement the fix for this connectivity issue.
The issue on the AWS infrastructure has now been fixed and our vendor's Sr. Engineering team is working on bringing back live the FortiGate firewall controlling all traffic.<br /><br /><b>Update 19 Feb 2024 13:51:56</b> From our carrier: We have positive feedback from our vendor's Sr. teams along with the FortiGate team regarding the Inbound traffic towards HTTP accounts which now appears to be delivering.
Our teams are still working on bringing back the entirety of the service and as soon as further updates are available, we will let you know.
<br /><br /><b>Update 19 Feb 2024 15:00:04</b> <b>Inbound SMSs are now being received</b> - including one that were sent earlier in the day.<br /><br /><b>Update 19 Feb 2024 15:02:54</b> From our carrier: Our vendor's teams along with FortGate teams are still working on restoring all SMSC connectivities back up and running.
<br /><br /><b>Resolution</b> Service was restored at around 3PM. We'll post further updates from our carrier as we get them. During this time outgoing SMSs were working via our secondary carrier, but our API would have taken a longer than usual amount of time to accepts messages. Incoming messages sent today would have been delayed and received after 3PM.