As you will be aware there have been a few glitches with BT recently causing all lines to drop and reconnect - a process that can take from seconds to several minutes.
At 20:37:28 today all BT connected lines lost routing within BT for just over 10 seconds.
Previously, our 10 second timeout would have taken out all lines. Now, because we have changed the timeouts, only 15% of BT lines went off, and they were able to connect very quickly. These were multi-line customers. We recently changes timeouts to 10 seconds for multi-line customers and 60 seconds for single line cutsomers.
The new system worked. Now that we have timed the issue at just over 10 seconds we are changing the timeout for multiple line connections to 15 seconds from next connection.
Now we have to try and track down why BT lost routing for just over 10 seconds.
A second glitch happened at 21:45:05, also causing a 10 second loss in routing.
Further investigation this morning suggests that this was indeed logged as an 11 second outage by all of our monitoring. Changes to timeouts should mean this is the entirity of the outage if this happens again and not a PPP restart, thus significantly reducing the impact.
However, the cause may not be entirely BT's fault!
Logs show that the LNS detected an issue with the PHY (Ethernet physical interface) at the same time, but this was reset and recovered in 1.6 seconds. This is typical for a PHY to restart, but obviously should not have happened in the first place.
However, the outage was a further 10 seconds, and we suspect this is something at the BT end of the link - possible LACP or some such, which causes a link which is up at the PHY level to not start operating for some seconds.
So, we are investigating how the PHY issue has happened, which could simply be a part that needs replacing, and we are also checking with BT if they have anything that could cause the extra 10 second delay.