Order posts by limited to posts

14 Jul 16:48:39
Details
Posted: 14 Jul 10:33:01
We've just seen BT and TT lines drop. We are investigating.
Update
14 Jul 10:57:12
This is similar to yesterday's problem. We have lost BGP connectivity to both our carriers. Over the past fee minutes the BGP sessions have been going up and down, meaning customers are logging in and then out again. Updates to follow shortly.
Update
14 Jul 11:16:46
Sessions are looking a bit more stable now... customer lines are still reconnecting
Update
14 Jul 11:56:08
We have about half DSL lines logged in, but the remaining half are struggling due to what is looking like a Layer 2 issue on our network.
Update
14 Jul 12:23:19
More DSL lines are now back up. If customers are still off, a reboot may help.
Update
14 Jul 12:35:51
A number of TT and BT lines just dropped. They are starting to reconnect now though.
Update
14 Jul 12:52:31
This is still on going - most DSL lines are back, some have been dropping but the network is still unstable. We are continuing to investigate this.
Update
14 Jul 12:55:50
We managed to get a lot of lines up after around an hour, during which there was a lot of flapping. The majority of the rest (some of the talk talk back haul lines) came up, and flapped a bit, at 12:00 and came back properly around 12:40. However, we are still trying to understand the issue, and still have some problems, and we suspect there may be more outages. The problem appears to be something at layer 2 but impacting several sites at once.
Update
14 Jul 14:41:23
We believe these outages are due to a partial failure of one of our core switches. We've moved most services away from this switch and are running diagnostics on it at the moment. We are not expecting these diagnostics to affect other services.
Update
14 Jul 17:00:40
The network has been stable for a good few hours now. One of our 40G Telehouse-to-Harbour Exchange interconnects has been taken down and some devices have been moved off of the suspect switch. We have further work to do in investigating the root cause of this and what we plan to do try to stop this from happening again. We do apologise to our customers for the disruption these two outages have caused and we work on trying to prevent this from happening again.
Started 14 Jul 10:30:00
Closed 14 Jul 16:48:39

13 Jul 21:38:01
Details
Posted: 13 Jul 18:56:13
Multiple circuits have disconnected and reconnected, staff are investigating
Update
13 Jul 19:00:34
Sessions seem to be repeatedly flapping rather than reconnecting - staff are investigating.
Update
13 Jul 20:05:47
We are still working on this, it's a rather nasty outage I'm afraid and is proving difficult to track down.
Update
13 Jul 20:17:42
Lines are re-connecting now...
Update
13 Jul 20:19:16
Apologies for the loss of graphs on a.gormless. We usually assume the worst and that out kit is the cause and tried a reboot of a FireBrick LNS. It did not help, but did clarify the real cause which was cisco switches. Sorry for the time it took to track this one down.
Update
13 Jul 20:19:27
Some line are being forced to reconnect so as to move them to the correct LNS, this will cause a logout/login for some customers...
Update
13 Jul 20:25:52
Lines are still connecting, not all are back, but the number of connected lines is increasing.
Update
13 Jul 20:29:55
We're doing some work which may cause some lines to go offline - we expect line to start reconnecting in 10 minutes time.
Update
13 Jul 20:34:35
We are rebooting stuff to try and find the issue. This is very unusual.
Update
13 Jul 20:42:56
Things are still not stable and lines are still dropping. We're needing to reboot some core network switches as part of our investigations and this is happening at the moment.
Update
13 Jul 20:52:29
Lines are reconnecting once more
Update
13 Jul 21:17:36
Looking stable.
Update
13 Jul 21:23:48
Most lines are back online now, if customers re still not online they a reboot of the router or modem may be required as the session may have got stuck inside the back haul network.
Update
13 Jul 21:41:38

We'll close this incident as lines have been stable for an hour. We'll update the post with further information as to the cause and any action we will be taking to help stop this type of incident from happening again.

We would like to thank our customers for their patience and support this evening. We had many customers in our IRC channel who were in good spirits and supportive to our staff whilst they worked on this incident.

Update
14 Jul 14:42:20
A similar problem occurred on Friday morning, this is covered on the following post: https://aastatus.net/2411
Closed 13 Jul 21:38:01

27 Mar 09:30:00
Details
Posted: 19 Feb 18:35:15
We have seen some cases with degraded performance on some TT lines, and we are investigating. Not a lot to go on yet, but be assured we are working on this and engaging the engineers within TT to address this.
Update
21 Feb 10:13:20

We have completed further tests and we are seeing congestion manifesting itself as slow throughput at peak times (evenings and weekends) on VDSL (FTTC) lines that connect to us through a certain Talk Talk LAC.

This has been reported to senior TalkTalk staff.

To explain further; VDSL circuits are routed from TalkTalk to us via two LACs. We are seeing slow thoughput at peak times on one LAC and not the other.

Update
27 Feb 11:08:58
Very often with congestion it is easy to find the network port or system that is overloaded but so far, sadly, we've not found the cause. A&A staff and customers and TalkTalk network engineers have done a lot of checks and tests on various bits of the backhaul network but we are finding it difficult to locate the cause of the slow throughput. We are all still working on this and will update again tomorrow.
Update
27 Feb 13:31:39
We've been in discussions with other TalkTalk wholesalers who have also reported the same problem to TalkTalk. There does seem to be more of a general problem within the TalkTalk network.
Update
27 Feb 13:32:12
We have had an update from TalkTalk saying that based on multiple reports from ISPs that they are investigating further.
Update
27 Feb 23:21:21
Further tests this evening by A&A staff shows that the throughput is not relating to a specific LAC, but that it looks like something in TalkTalk is limiting single TCP sessions to 7-9M max during peak times. Running single iperf tests results in 7-9M, but running ten at the same time can fill a 70M circuit. We've passed these findings on to TalkTalk.
Update
28 Feb 09:29:56
As expected the same iperf throughput tests are working fine this morning. TT are shaping at peak times. We are pursuing this with senior TalkTalk staff.
Update
28 Feb 11:27:45
TalkTalk are investigating. They have stated that circuits should not be rate limited and that they are not intentionally rate limiting. They are still investigating the cause.
Update
28 Feb 13:14:52
Update from TalkTalk: Investigations are currently underway with our NOC team who are liaising with Juniper to determine the root cause of this incident.
Update
1 Mar 16:38:54
TalkTalk are able to reproduce the throughput problem and investigations are still on going.
Update
2 Mar 16:51:12
Some customers did see better throughput on Wednesday evening, but not everyone. We've done some further testing with TalkTalk today and they continue to work on this.
Update
2 Mar 22:42:27
We've been in touch with the TalkTalk Network team this evening and have been performing further tests (see https://aastatus.net/2363 ). Investigations are still ongoing, but the work this evening has given a slight clue.
Update
3 Mar 14:24:48
During tests yesterday evening we saw slow throughput when using the Telehouse interconnect and fast (normal) throughput over Harbour Exchange interconnect. Therefore, this morning, we disabled our Telehouse North interconnect. We will carry on running tests over the weekend and we welcome customers to do the same. We are expecting throughput to but fast for everyone. We will then liaise with TalkTalk engineers regarding this on Monday.
Update
6 Mar 15:39:33

Tests over the weekend suggest that speeds are good when we only use our Harbour Exchange interconnect.

TalkTalk are moving the interconnect we have at Telehouse to a different port at their side so as to rule out a possible hardware fault.

Update
6 Mar 16:38:28
TalkTalk have moved our THN port and we will be re-testing this evening. This may cause some TalkTalk customers to experience slow (single thread) downloads this evening. See: https://aastatus.net/2364 for the planned work notice.
Update
6 Mar 21:39:55
The testing has been completed, and sadly we still see slow speeds when using the THN interconnect. We are now back to using the Harbour Exchange interconnect where we are seeing fast speeds as usual.
Update
8 Mar 12:30:25
Further testing happening today: Thursday evening https://aastatus.net/2366 This is to try and help narrow down where the problem is occurring.
Update
9 Mar 23:23:13
We've been testing, tis evening, this time with some more customers, so thank you to those who have been assisting. (We'd welcome more customers to be involved - you just need to run an iperf server on IPv4 or IPv6 and let one of our IPs through your firewall - contact Andrew if you're interested). We'll be passing the results on to TalkTalk, and the investigation continues.
Update
10 Mar 15:13:43
Last night we saw some line slow and some line fast, so having extra lines to test against should help in figuring out why this is the case. Quite a few customers have set up iperf server for us and we are now testing 20+ lines. (Still happy to add more). Speed tests are being run three times an hour and we'll collate the results after the weekend and will report back to TalkTalk the findings.
Update
11 Mar 20:10:21
Update
13 Mar 15:22:43

We now have samples of lines which are affected by the slow throughput and those that are not.

Since 9pm Sunday we are using the Harbour Exchange interconnect in to TalkTalk and so all customers should be seeing fast speeds.

This is still being investigated by us and TalkTalk staff. We may do some more testing in the evenings this week and we are continuing to run iperf tests against the customers who have contacted us.
Update
14 Mar 15:59:18

TalkTalk are doing some work this evening and will be reporting back to us tomorrow. We are also going to be carrying out some tests ourselves this evening too.

Our tests will require us to move traffic over to the Telehouse interconnect, which may mean some customers will see slow (single thread) download speeds at times. This will be between 9pm and 11pm

Update
14 Mar 16:45:49
This is from the weekend:

Update
17 Mar 10:42:28
We've stopped the iperf testing for the time being. We will start it back up again once we or TalkTalk have made changes that require testing to see if things are better or not, but at the moment there is no need for the testing as all customers should be seeing fast speeds due to the Telehouse interconnect not being in use. Customers who would like quota top-ups, please do email in.
Update
17 Mar 18:10:41
To help with the investigations, we're also asking for customers with BT connected FTTC/VDSL lines to run iperf so we can test against them too - details on https://support.aa.net.uk/TTiperf Thank you!
Update
20 Mar 12:54:02
Thanks to those who have set up iperf for us to test against. We ran some tests over the weekend whilst swapping back to the Telehouse interconnect, and tested BT and TT circuits for comparison. Results are that around half the TT lines slowed down but the BT circuits were unaffected.

TalkTalk are arranging some further tests to be done with us which will happen Monday or Tuesday evening this week.

Update
22 Mar 09:37:30
We have scheduled testing of our Telehouse interlink with TalkTalk staff for this Thursday evening. This will not affect customers in any way.
Update
22 Mar 09:44:09
In addition to the interconnect testing on Thursday mentioned above, TalkTalk have also asked us to retest DSL circuits to see if they are still slow. We will perform these tests this tonnight, Wednesday evening.

TT have confirmed that they have made a configuration change on the switch at their end in Telehouse - this is the reason for the speed testing this evening.

Update
22 Mar 12:06:50
We'll be running iperf3 tests against our TT and BT volunteers this evening, very 15 minutes from 4pm through to midnight.
Update
22 Mar 17:40:20
We'll be changing over to the Telehouse interconnect between 8pm and 9pm this evening for testing.
Update
23 Mar 10:36:06

Here are the results from last night:

And BT Circuits:

Some of the results are rather up and down, but these lines are in use by customers so we would expect some fluctuations, but it's clear that a number of lines are unaffected and a number are affected.

Here's the interesting part. Since this problem started we have rolled out some extra logging on to our LNSs, this has taken some time as we only update one a day. However, we are now logging the IP address used at our side of L2TP tunnels from TalkTalk. We have eight live LNSs and each one has 16 IP addresses that are used. With this logging we've identified that circuits connecting over tunnels on 'odd' IPs are fast, whilst those on tunnels on 'even' IPs are slow. This points to a LAG issue within TalkTalk, which is what we have suspected from the start but this data should hopefully help TalkTalk with their investigations.

Update
23 Mar 16:27:28
As mentioned above, we have scheduled testing of our Telehouse interlink with TalkTalk staff for this evening. This will not affect customers in any way.
Update
23 Mar 22:28:53

We have been testing the Telehouse interconnect this evening with TalkTalk engineers. This involved a ~80 minute conference call and setting up a very simple test of a server our side plugged in to the switch which is connected to our 10G interconnect, and running iperf3 tests against a laptop on the TalkTalk side.

The test has highlighted a problem at the TalkTalk end with the connection between two of their switches. When plugged in to the second switch we got about 300Mbit/s, but when their laptop was in the switch directly connected to our interconnect we got near full speed or around 900Mb/s.

This has hopefully given them a big clue and they will now involve the switch vendor for further investigations.

Update
23 Mar 23:02:34
TalkTalk have just called us back and have asked us to retest speeds on broadband circuits. We're moving traffic over to the Telehouse interconnect and will test....
Update
23 Mar 23:07:31
Initial reports show that speeds are back to normal! Hooray! We've asked TalkTalk for more details and if this is a temporary or permanent fix.
Update
24 Mar 09:22:13

Results from last night when we changed over to test the Telehouse interlink:

This shows that unlike the previous times, when we changed over to use the Telehouse interconnect at 11PM speeds did not drop.

We will perform hourly iperf tests over the weekend to be sure that this has been fixed.

We're still awaiting details from TalkTalk as to what the fix was and if it is a temporary or permanent fix.

Update
24 Mar 16:40:24
We are running on the Telehouse interconnect and are running hourly iperf3 tests against a number of our customers over the weekend. This will tell us if the speed issues are fixed.
Update
27 Mar 09:37:12

Speed tests against customers over the weekend do not show the peak time slow downs, this confrims that what TalkTalk did on Thursday night has fixed the problem. We are still awaiting the report from TalkTalk regarding this incident.

The graph above shows iperf3 speed test results taken once an hour over the weekend against nearly 30 customers. Although some are a bit spiky we are no longer seeing the drastic reduction in speeds at peak time. The spikyness is due to the lines being used as normal by the customers and so is expected.

Update
28 Mar 10:52:25
We're expecting the report from TalkTalk at the end of this week or early next week (w/b 2017-04-03).
Update
10 Apr 16:43:03
We've not yet had the report from TalkTalk, but we do expect it soon...
Update
4 May 09:16:33
We've had an update saying: "The trigger & root cause of this problem is still un-explained; however investigations are continuing between our IP Operation engineers and vendor".

This testing is planned for 16th May.

Resolution From TT: Planned work took place on the 16th May which appears to have been a success. IP Ops engineers swapped the FPC 5 and a 10 gig module on the ldn-vc1.thn device They also performed a full reload to the entire virtual chassis (as planned). This appears to have resolved the slow speed issues seen by the iperf testing onsite. Prior to this IP ops were seeing consistent slow speeds with egress traffic sourced from FPC5 to any other FPC; therefore they are confident that this has now been fixed. IP Ops have moved A&A's port back to FPC 5 on LDN-vc1.thn.
Started 18 Feb
Closed 27 Mar 09:30:00
Cause TT

14 Mar 21:10:00
Details
Posted: 14 Mar 21:05:28
Looks like we just had some sort of blip affecting broadband customers. We're investigating.
Resolution This was a LNS crash, and so affected customers on the "i" LNS. The cause is being investigated, but preliminary investigations show that it's probably a problem that is fixed in software that is scheduled to be loaded on to this LNS in a couple of days time as part of the rolling software update that we're performing at the moment.
Broadband Users Affected 12%
Started 14 Mar 21:00:57
Closed 14 Mar 21:10:00