A number of customers connecting via NPE002.THN may be experiencing a total loss of voice and data service.::WO0000000042962
MINOR Closed TTB-Outages
STATUS
Closed
CREATED
Mar 03, 04:00 AM (5 years ago)
TYPE
TalkTalk Outage
STARTED
Mar 03, 03:55 AM (5 years ago)
CLOSED
Mar 04, 04:20 PM (5 years ago)
REFERENCE
33780 / INC13285523
INFORMATION
  • INITIAL
    5 years ago

    Summary
    Following network monitoring. It has been identified that some customers connecting via NPE002.THN will experience a experience a total loss of voice and data services. The affected users will have interrupted voice and data services for the duration of this incident.N/A
                              

  • UPDATE
    5 years ago

    Latest Update Preliminary investigations are underway with our engineers to determine the root cause of this network incident. At this stage we are unable to issue an ERT until our engineers have completed further diagnostics.  

  • UPDATE
    5 years ago

    Latest Update Following communications from NOC we have been advised the issue is in relation to a planned CRQ. We have requested tests to be carried out on affected services. Please await further updates.

  • UPDATE
    5 years ago

    Latest Update Following a retest we have been advised that circuits are still down. NOC have been re engaged and are carrying out further investigations. Please await further updates.  

  • UPDATE
    5 years ago

    Latest Update This incident has now been upgraded to a P1 due to impact to resellers. Further updates will follow upon receipt of more information.  

  • UPDATE
    5 years ago

    Latest Update This incident has now been upgraded to a P1 due to impact to resellers. Further updates will follow upon receipt of more information.  

  • UPDATE
    5 years ago

    Latest Update Network Support have been investigating this issue and they are looking at a possible encaps related problem. They are seeing an encaps invalid error on the impacted circuits. A change was made at 07:03 on port 2/4 to amend the encaps and it appears that since this change some of the impacted circuits are working. Network Support have advised they are seeing a high number of circuits on the NPE having the encaps invalid issue. The TTB ops manager and Network Support are currently going through some of the impacted circuit configs to see if we can correlate the impacted circuits with the none impacted ones. NOC have been asked to review the root cause and confirm what has caused the outage. TTB are liaising directly with customers to confirm if any of the circuits are back up. TTB are completed a impact assessment and will provide a full list of impacted customers. Next Actions Network Support to continue investigating the encaps issue with TTB Operations TTB to provide a full list of impacted customers and their circuit details. TTB are contacting impacted customers to confirm if they are still seeing issues.  

  • UPDATE
    5 years ago

    Latest Update Network Support have checked the config back up on NPE.THN002 to see why rebuilding then did not resolve the issue and they confirmed that the backup config already had the corrupted invalid encaps version saved (this was done as per process) meaning any saved config would be wrong. Network Support have suggest a way to fix this would be to re add the correct configuration for all the circuits with the incorrect configuration (over 6.6k) and then complete an LDP process restart, this would result in all 8.5k layer 2 circuits on the box failing. As we only have around 100 – 130 reportedly impacted this would be extremely intrusive to the other services on the NPE. Head of Networks has asked for Network Support to set up a cross technology support call with IP Operations and Access to look at other possible solutions to restoring service that are not so intrusive. As the stored configuration for the impacted circuits is wrong TTB ops and Access are currently stripping a handful of impacted circuits fully of the network then rebuilding them from scratch to see if this resolves the corrupted configuration problem. Next Actions A technical bridge between IP Operations/Network Support and Access Control is being set up to review the causes of this issue. We can see that the configuration on the circuits has changed however it is not clear how or when this happened. If they cannot identify a better way or restoring service we will defer to rebuilding the circuits and restarting the LDP process on NPE.THN002. A restart of the LDP process would impact all 8000+ layer 2 circuits which traverse NPE.THN002, for this reason the technical bridge has been convened to ensure there is no other options. TTB Operations and Access are currently fully stripping several of the reported circuits and completing a full network rebuild to see if this remove the invalid configuration we are seeing on the circuits. The next MIM bridge will be at 11:00.  

  • UPDATE
    5 years ago

    Latest Update The technical bridge between Network Support/Access and I Operations has concluded. Thy agreed that the invalid encaps configuration will been to be rectified however are not sure why the impact is only being seen by such a small number of circuits on the NPE (there are 130 customer reporting problems, 6.6k with invalid encaps and over 8.5 customers connected to the NPE). They have confirmed we will need to read the correct configuration to the NPE and then complete a LDP process restart in order for them to be saved. This can be done out of hours if we can mitigate the impact by rebuilding the impacted customer services. Fully rebuilding the circuits that have been reported as impacted is working, once committed the circuits come up. It takes between 5-10 minutes to rebuild each circuit and we have around 130 reported (so far we have rebuilt 7). As a plan to mitigate impact TTB Operations have collated a full list of impacted circuits, this has been shared with Access and they will work with 24/7 to rebuild the remaining circuits asap. Its expected to take at least 2-3 hours to complete all of the rebuilds. MIM engaged CGI OSS and has tasked them with checking the HPSA server logs to see if there is anything which would elude to problems in the logs between 0100 – 0400 03/03. Next Actions TTB Operations and Access working through all of the impacted circuits (around 120 remaining) and fully striping and rebuilding the circuits. It takes between 5 and 10 minutes to build each circuit so we have allocated resource from 24/7, TTB Escalations and Access to hasten the progress. Network Support want to await the rebuild of circuits before progressing the next steps. They have confirmed they still see the invalid encaps setting on 6.6+ circuits on the NPE and this will need correcting. If we continue to mitigate the impact to customers be rebuilding circuits any corrective works can be done under a planned outage. CGI Oss have been engaged to review the HPSA logs an identify if there was any interaction between the HPSA servers and NPE.THN002.  

  • UPDATE
    5 years ago

    Latest Update Access Engineering and TTB continue to work through the list circuits which have been reported as impacted. The rebuild of the original list of 110 circuits will be completed by 14:10. TTB continue to receive reports of circuits which are down, if they are found to have been caused by this incident the circuits are being added to the rebuild list. If customers are still reporting circuits are down can we please ask them to log the details with the TTB teams so we can investigate if they are being caused by this incident. Network Support continue to work on root cause investigations.  

  • UPDATE
    5 years ago

    Latest Update Access Engineering and TTB continue to work through the list circuits which have been reported as impacted. The rebuild of the original list of 110 circuits will be completed by 14:10. TTB continue to receive reports of circuits which are down, if they are found to have been caused by this incident the circuits are being added to the rebuild list. If customers are still reporting circuits are down can we please ask them to log the details with the TTB teams so we can investigate if they are being caused by this incident. Network Support continue to work on root cause investigations. Please send a list of affected circuits to b2booh@talktalkbusiness.co.uk email account and add the incident reference to the subject field. Thank you

  • UPDATE
    5 years ago

    Latest Update TTB and Access continue on rebuilding circuits, the number of reported cases has increased to 259. In order to ensure we fix all of the circuits as fast as possible Access are looking at a way to bulk strip and rebuild multiple circuits at one time. We ae testing 100 of the reported circuits in this way to see if we can script the process. If successful Access are going to develop a plan strip and rebuild all 6.6k circuits which have the incorrect configuration. In order to do this an outage window or around 6 hours will be require (individual circuits should not be impacted for longer than 15 minutes) to complete all the circuits impacted. Approval for an outage is being reviewed and Network Support are raising a ECRQ. To ensure any bulk updates are successful Network Support are setting up real time monitoring on the NPE to ensure them to see customer traffic increasing and confirm if the bulk uploads have been successful without requiring customer feedback. Logs from HPSA have been shared, initial checks don’t indicate an problems with the automated updated it did last night but Network Support will review these further with Business Applications. Next Actions Access and TTB continue to work through the list of impacted customers striping and rebuilding the circuits. Access to build a script which will allow for batch re building of the circuits. Network Support to set up retime monitoring on NPE.002THN to ensure we are seeing successful traffic. Update to TTB confirming an outage window. A restart is put on hold due to the risk around not restoring. This will remain a last resort. Check point call at 17:00   Please send a list of affected circuits to b2booh@talktalkbusiness.co.uk email account and add the incident reference to the subject field. Thank you

  • UPDATE
    5 years ago

    Latest Update Of the reported 270 circuits impacted we have rebuilt over 220. Access have created a script which will allow bulk updates of circuits so we can strip and rebuild circuits in batches which will speed up the restoration time. A test of 27 circuits is currently being set up to ensure that the process will work. An update will be provided on the 18:00 go no call. Network Support to set up real time monitoring on NPE.002THN to ensure we are seeing successful traffic during the outage window. GNG call at 18:00.   Please send a list of affected circuits to b2booh@talktalkbusiness.co.uk email account and add the incident reference to the subject field. Thank you

  • UPDATE
    5 years ago

    Latest Update We have rebuilt all of the 279 reported circuits that have been impacted. 27 of the circuits were rebuilt using the script created by Access Engineering This test as successful and will allow for batches of 50 – 100 to be run at a time without causing excessive load. CRQ000000054946 has been raised and approved for a window between 00:00 – 06:00 on the 4th of March at which point Access engineer will execute the script a batch at a time to make sure we are seeing the circuits restore. Network Support and TTB are on hand to confirm the circuits are restoring. If this is successful MMI will be updated upon completing of all 6.6k circuits impacted (expected to take 2-3 hours). Should any issues be seen technical support is on standby and field resource has been acquired should any hardware be needed. Next Actions Access Engineer need to pull the configurations for the impacted circuits and prep them ready to deploy batched of 50 – 100 depending on load during the change window. Its expected to take around 3 hours to pull all the information and prepare the script for deployment. We have rebuilt all reported circuits and they have been confirmed to be working, TTB will continue to do thins until 23:00 at which point we will stop to allow prop for the change to be completed. A check point call has been set up at 23:00 to make sure all the prep works have been completed.   Please send a list of affected circuits to b2booh@talktalkbusiness.co.uk email account and add the incident reference to the subject field. Thank you

  • UPDATE
    5 years ago

    Latest Update Prep work for CRQ000000054946 Is going through final stages (expected completion 23:30). The support teams for this evening change are primed and briefed on what are their requirements. Approval to start the change at 00:00 has been confirmed. MIM will send out an update post change to confirm its success. A check point call has been set up at 09:00 on the 4th of March. Next Actions Access Engineering will start the change at 00:00 slowly adding batches of 50 – 100 circuits back onto the NPE with TTB assurance and Network Support monitoring to ensure the circuits are coming back up. If all is successful they will contact the MIM team once completed so we can send out comms. Should any issues be seen during the change the MIM team will open an incident bridge. Network Support have set up monitoring for NPE.002THN during the outage window. They will support Access Engineering during the change window. Check point call at 09:00 on the 4th of March  

  • UPDATE
    5 years ago

    Latest Update Work on CRQ000000054946 Is going through final stages (expected completion 05:00 – 05:30) The support teams for this evening change working methodically to ensure we do not push to many cases through a one time. MIM will send out an update post change to confirm its success. A check point call has been set up at 09:00 on the 4th of March. Next Actions Access Engineering continue to work through the change slowly adding batches of 50 – 100 circuits back onto the NPE with TTB assurance and Network Support monitoring to ensure the circuits are coming back up. The works are expected to be completed before the 6am change window deadline. If all is successful they will contact the MIM team once completed so we can send out comms. Should any issues be seen during the change the MIM team will open an incident bridge. Network Support have set up monitoring for NPE.002THN during the outage window. They will support Access Engineering during the change window. Check point call at 09:00 on the 4th of March  

  • UPDATE
    5 years ago

    Latest Update Work on CRQ000000054946 Is going through final stages (expected completion 06:30) Access Engineering continue to work through the change slowly adding batches of 50 – 100 circuits back onto the NPE with TTB assurance and Network Support monitoring to ensure the circuits are coming back up. MIM has approved the extension to the outage window to allow for the time required to complete all of the impacted circuits. A check point call has been set up at 09:00 on the 4th of March. Next Actions Network Support have set up monitoring for NPE.002THN during the outage window. They will support Access Engineering during the change window. Check point call at 09:00 on the 4th of March  

  • UPDATE
    5 years ago

    Latest Update Work on CRQ000000054946 is complete. The last batch of circuits recovered at 05:57. The circuits will now be monitored to ensure stability. TTB will update customers and confirm the issue is now resolved. A check point call has been set up at 09:00 on the 4th of March. Next Actions Monitoring will remain in place to ensure the circuits remain up and stable. We will await customer feed back before taking a decision on standing the incident down. Check point call at 09:00 on the 4th of March.  

  • UPDATE
    5 years ago

    Latest Update Work on CRQ000000054946 is complete. The last batch of circuits recovered at 05:57. The circuits will now be monitored to ensure stability. Initial checks have confirmed all circuits are up and passing traffic. We have not received any reports from customers so far advising of issues post the change. This will continue to be monitored throughout the day. A check point call has been set up at 16:00. Next Actions Monitoring will remain in place to ensure the circuits remain up and stable. TTB will be in contact with customers throughout the day and will report any problems that are seen. Check point call at 16:00.  

  • RESOLUTION
    5 years ago

    Technical / Suspected Root Cause Root cause has not currently been identified. A post Incident review call has been set up on the 5th of March to discuss and provide further details.  

  • Closed