VoIP Call problems
MINOR Closed VoIP
STATUS
Closed
CREATED
Jan 06, 03:41 PM (1¾ days ago)
AFFECTED
VoIP
STARTED
Jan 06, 03:40 PM (1¾ days ago)
CLOSED
Jan 07, 10:00 AM (1 day ago)
REFERENCE
42820 / AA42820
MASTODON
INFORMATION
  • INITIAL
    1¾ days ago by Andrew

    We're needing to restart our VoIP servers due to them having problems.

  • UPDATE
    1¾ days ago by Andrew

    Our VoIP servers had got in to a state where they were not clearing old calls and were unable to accept new calls.

  • UPDATE
    1¾ days ago by Andrew

    Our A voice server has been restart and is back in service - load has reduced across the board and we are monitoring.

  • UPDATE
    1¾ days ago by Andrew

    Our call levels are looking back to normal.

  • UPDATE
    1 day ago by David

    We are still seeing delays with our servers processing some SIP requests. This is under review.

  • UPDATE
    1 day ago by Andrew

    We've identified the cause of our problem and have implemented a fix. We're currently monitoring.

  • RESOLUTION
    1 day ago by Andrew

    Thank you for your patience with this - the problem would have affected SIP messages some of the time, and caused either a few seconds of delay in connecting a call or a call timing out and failing.

    The problem was to do with round trip times between our datacentres. Late last year one of our layer 2 links between Maidenhead and London was cut off without warning - we believe due the supplier having financial problems and the datacentre unplugged them, completely. We have 2 links between London and Maidenhead so this wasn't disruptive but it lowered our resilience. We connected up a new Layer 2 link to replace the one that was cut off.

    The replacement link was slightly higher latency. Our VoIP system uses backend SQL and Redis servers, and as part of a SIP registrations or SIP calls a number of SQL and Redis queries are made to gather the data needed. We had got to a tipping point where the quantity of lookups required to process SIP messages and the increase in the latency between sites built up to enough to cause SIP timeouts and retries (which generally happen at 500ms).

    The fix we did today, was simply to make the master Redis server the one local to our VoIP servers, and therefore not needing to traverse between datacentres.

    This has taken us a little while to figure out - our investigations initially were showing SQL and Redis queries themselves being processed quickly - it wasn't until deeper investigation in to the which nodes were passing the traffic between themselves did the inter-site latency we realised that along with the quantity of queries that are required that this was the cause of the latency.

    Our processing time for SIP messages is now back to a healthy < 50ms.

  • Closed