VoIP Call problems -- AAISP's status page

VoIP Call problems

MINOR Closed VoIP

STATUS

Closed

CREATED

Jan 06, 03:41 PM (5¼ months ago)

AFFECTED

VoIP

STARTED

Jan 06, 03:40 PM (5¼ months ago)

CLOSED

Jan 07, 10:00 AM (5¼ months ago)

REFERENCE

42820 / AA42820

MASTODON

@aastatus

PERMALINK

https://aastatus.net/42820

INFORMATION

INITIAL
5¼ months ago by Andrew
We're needing to restart our VoIP servers due to them having problems.
UPDATE
5¼ months ago by Andrew
Our VoIP servers had got in to a state where they were not clearing old calls and were unable to accept new calls.
UPDATE
5¼ months ago by Andrew
Our A voice server has been restart and is back in service - load has reduced across the board and we are monitoring.
UPDATE
5¼ months ago by Andrew
Our call levels are looking back to normal.
UPDATE
5¼ months ago by David
We are still seeing delays with our servers processing some SIP requests. This is under review.
UPDATE
5¼ months ago by Andrew
We've identified the cause of our problem and have implemented a fix. We're currently monitoring.
RESOLUTION
5¼ months ago by Andrew
Thank you for your patience with this - the problem would have affected SIP messages some of the time, and caused either a few seconds of delay in connecting a call or a call timing out and failing.
The problem was to do with round trip times between our datacentres. Late last year one of our layer 2 links between Maidenhead and London was cut off without warning - we believe due the supplier having financial problems and the datacentre unplugged them, completely. We have 2 links between London and Maidenhead so this wasn't disruptive but it lowered our resilience. We connected up a new Layer 2 link to replace the one that was cut off.

The replacement link was slightly higher latency. Our VoIP system uses backend SQL and Redis servers, and as part of a SIP registrations or SIP calls a number of SQL and Redis queries are made to gather the data needed. We had got to a tipping point where the quantity of lookups required to process SIP messages and the increase in the latency between sites built up to enough to cause SIP timeouts and retries (which generally happen at 500ms).

The fix we did today, was simply to make the master Redis server the one local to our VoIP servers, and therefore not needing to traverse between datacentres.

This has taken us a little while to figure out - our investigations initially were showing SQL and Redis queries themselves being processed quickly - it wasn't until deeper investigation in to the which nodes were passing the traffic between themselves did the inter-site latency we realised that along with the quantity of queries that are required that this was the cause of the latency.
Our processing time for SIP messages is now back to a healthy < 50ms.
Closed

Last updated: 5¼ months ago

VoIP Call problems

STATUS

CREATED

AFFECTED

STARTED

CLOSED

REFERENCE

MASTODON

PERMALINK

INFORMATION

INITIAL

UPDATE

UPDATE

UPDATE

UPDATE

UPDATE

RESOLUTION