Thank you for your patience with this - the problem would have affected SIP messages some of the time, and caused either a few seconds of delay in connecting a call or a call timing out and failing.
The problem was to do with round trip times between our datacentres. Late last year one of our layer 2 links between Maidenhead and London was cut off without warning - we believe due the supplier having financial problems and the datacentre unplugged them, completely. We have 2 links between London and Maidenhead so this wasn't disruptive but it lowered our resilience. We connected up a new Layer 2 link to replace the one that was cut off.
The replacement link was slightly higher latency. Our VoIP system uses backend SQL and Redis servers, and as part of a SIP registrations or SIP calls a number of SQL and Redis queries are made to gather the data needed. We had got to a tipping point where the quantity of lookups required to process SIP messages and the increase in the latency between sites built up to enough to cause SIP timeouts and retries (which generally happen at 500ms).
The fix we did today, was simply to make the master Redis server the one local to our VoIP servers, and therefore not needing to traverse between datacentres.
This has taken us a little while to figure out - our investigations initially were showing SQL and Redis queries themselves being processed quickly - it wasn't until deeper investigation in to the which nodes were passing the traffic between themselves did the inter-site latency we realised that along with the quantity of queries that are required that this was the cause of the latency. Our processing time for SIP messages is now back to a healthy < 50ms.