We have identified an issue that appears to be affecting some customers with FTTC modems. The issue is stupidly complex, and we are still trying to pin down the exact details. The symptoms appear to be that some packets are not passing correctly, some of the time. Unfortunately one of the types of packet that refuses to pass correctly are FireBrick FB105 tunnel packets. This means customers relying on FB105 tunnels over FTTC are seeing issues. The work around is to remove the ethernet lead to the modem and then reconnect it. This seems to fix the issue, at least until the next PPP restart. If you have remote access to a FireBrick, e.g. via WAN IP, and need to do this you can change the Ethernet port settings to force it to re-negotiate, and this has the same effect - this only works if directly connected to the FTTC modem as the fix does need the modem Ethernet to restart. We are asking BT about this, and we are currently assuming this is a firmware issue on the BT FTTC modems. We have confirmed that modems re-flashed with non-BT firmware do not have the same problem, though we don't usually recommend doing this as it is a BT modem and part of the service.
We have been working on getting more specific information regarding this, we hope to post an update tomorrow.
We have reproduced this problem by sending UDP packets using 'Scapy'. We are doing further testing today, and hope to write up a more detailed report about what we are seeing and what we have tested.
We have some quite good demonstrations of the problem now, and it looks like it will mess up most VPNs based on UDP. We can show how a whole range of UDP ports can be blacklisted by the modem somehow on the next PPP restart. It is crazy. We hope to post a little video of our testing shortly.
Here is an update/overview of the situation. (from http://revk.www.me.uk/2013/11/bt-huawei-fttc-modem-bug-breaking-vpns.html )
We have confirmed that the latest code in the BT FTTC modems appears to have a serious bug that is affecting almost anyone running any sort of VPN over FTTC.
Existing modems seem to be upgrading, presumably due to a roll out of new code in BT. An older modem that has not been on-line a while is fine. A re-flashed modem with non-BT firmware is fine. A working modem on the line for a while suddenly stopped working, presumably upgraded.
The bug appears to be that the modem manages to "blacklist" some UDP packets after a PPP restart.
If we send a number of UDP packets, using various UDP ports, then cause PPP to drop and reconnect, we then find that around 254 combinations of UDP IP/ports are now blacklisted. I.e. they no longer get sent on the line. Other packets are fine.
Sending 500 different packets, around 254 of them will not work again after the PPP restart. It is not actually the first or last 254 packets, some in the middle, but it seems to be 254 combinations. They work as much as you like before the PPP restart, and then never work after it.
We can send a batch of packets, wait 5 minutes, PPP restart, and still find that packets are now blacklisted. We have tried a wide range of ports, high and low, different src and dst ports, and so on - they are all affected.
The only way to "fix" it, is to disconnect the Ethernet port on the modem and reconnect. This does not even have to be long enough to drop PPP. Then it is fine until the next PPP restart. And yes, we have been running a load of scripts to systematically test this and reproduce the fault.
The problem is that a lot of VPNs use UDP and use the same set of ports for all of the packets, so if that combination is blacklisted by the modem the VPN stops after a PPP restart. The only way to fix it is manual intervention.
The modem is meant to be an Ethernet bridge. It should not know anything about PPP restarting or UDP packets and ports. It makes no sense that it would do this. We have tested swapping working and broken modems back and forth. We have tested with a variety of different equipment doing PPPoE and IP behind the modem.
BT are working on this, but it is a serious concern that this is being rolled out.
Work on this in still ongoing... We have tested this on a standard BT retail FTTC 'Infinity' line, and the problem cannot be reproduced. We suspect this is because when the PPP re-establishes a different IP address is allocated each time, and whatever is session tracking does not match the new connection.
Here is an update with some a more specific explanation as to what the problem we are seeing is:
On WBC FTTC, we can send a UDP packet inside the PPP and then drop the PPP a few seconds later. After the PPP re-establishes, UDP packets with the same source and destination IP and ports won't pass; they do not reach the LNS at the ISP.
Further to that, it's not just one src+dst IP and port tuple which is affected. We can send 254 UDP packets using different src+dest ports before we drop the PPP. After it comes back up, all 254 port combinations will fail. It is worth noting here that this cannot be reproduced on an FTTC service which allocates a dynamic IP which changes each time PPP re-established.
If we send more than 254 packets, only 254 will be broken and the others will work. It's not always the first 254 or last 254, the broken ones move around between tests.
So it sounds like the modem (or, less likely, something in the cab or exchange) is creating state table entries for packets it is passing which tie them to a particular PPP session, and then failing to flush the table when the PPP goes down.
This is a little crazy in the first place. It's a modem. It shouldn't even be aware that it's passing PPPoE frames, let along looking inside them to see that they are UDP.
This only happens when using an Openreach Huawei HG612 modem that we suspect has been recently remotely and automatically upgraded by Openreach in the past couple of months. Further - a HG612 modem with the 'unlocked' firmware does not have this problem. A HG612 modem that has probably not been automatically/remotely upgraded does not have this problem.Side note: One theory is that the brokenness is actually happening in the street cab and not the modem. And that the new firmware in the modem which is triggering it has enabled 'link-state forwarding' on the modem's Ethernet interface.
This post has been a little quiet, but we are still working with BT/Openreach regarding this issue. We hope to have some more information to post in the next day or two.
We have also had reports from someone outside of AAISP reproducing this problem.
We have spent the morning with some nice chaps from Openreach and Huawei. We have demonstrated the problem and they were able to do traffic captures at various points on their side. Huawei HQ can now reproduce the problem and will investigate the problem further.
Adrian has posted about this on his blog: http://revk.www.me.uk/2013/11/bt-huawei-working-with-us.html
We are still chasing this with BT.
We have seen this affect SIP registrations (which use 5060 as the source and target)... Customers can contact us and we'll arrange a modem swap.
BT are in the process of testing an updated firmware for the modems with customers. Any customers affected by this can contact us and we can arrange a new modem to be sent out.
Just a side note on this, we're seeing the same problem on the ZyXEL VMG1312 router which we are teting out and which uses the same chipset: info and updates here: https://support.aa.net.uk/VMG1312-Trial
BT are testing a fix in the lab and will deploy in due course, but this could take months. However, if any customers are adversely affected by this bug, please let us know and we can arrange for BT to send a replacement ECI modem instead of the Huawei modem. Thank you all for your patience. --Update-- BT do have a new firmware that they are rolling out to the modems. So far it does seem to have fixed the fault and we have not heard of any other issues as of yet. If you do still have the issue, please reboot your modem, if the problem remains, please contact email@example.com and we will try and get the firmware rolled out to you.