Order posts by limited to posts

15 Oct 17:14:18
6 Oct 14:22:50
For the next week or so we're considering 5am-7am to be a PEW window for some very low disruption work (a few seconds of "blip"). We're still trying very hard to improve our network configuration and router code to create a much more stable network. It seems, from recent experience, that this sort of window will be least disruptive to customers. It is a time where issues can be resolve by staff if needed (which is harder at times like 3am) and we get more feedback from end users. As before, we expect this work to have no impact in most cases, and maybe a couple of seconds of routing issues if it is not quite to plan. Sadly, all of our efforts to create the same test scenarios "on the bench" have not worked well. At this stage we are reviewing code to understand Sunday morning's work better, and this may take some time before we start. We'll update here and on irc before work is done. Thank you for your patience.
7 Oct 09:06:41
We did do work around 6:15 to 6:30 today - I thought I had posted an update here before I started but somehow it did not show. If we do any more, I'll try and make it a little earlier.
8 Oct 05:43:11
Doing work a little earlier today. We don't believe we caused any blips with today's testing.
9 Oct 05:47:53
Another early start and went very well.
10 Oct 08:22:53
We updated remaining core routers this morning, and it seemed to go very well. Indeed pings we ran had zero loss when upgrading routers in Telecity. However, we did lose TalkTalk broadband lines in the process. These all reconnected straight away, but we are no reviewing how this happens to try and avoid it in future.
Resolution Closing this PEW from last week. We may need to do more work at some point, but we are getting quite good at this now.
Started 7 Oct 06:00:00
Closed 15 Oct 17:14:18
Previously expected 14 Oct 07:00:00

5 Oct 07:26:50
3 Oct 10:41:59
We do plan to upgrade routers again over the weekend, probably early saturday morning (before 9am). I'll post on irc at the time and update this notice.

The work this week means we expect this to be totally seamless, but the only way to actually be sure is to try it.

If we still see any issues we'll do more on Sunday.

4 Oct 06:54:19
Upgrades starting shortly.
4 Oct 07:24:47
Almost perfect!

We loaded four routers, each at different points in the network. We ran a ping that went through all four routers whilst doing this. For three of them we did see ping drop a packet. The fourth we did not see a drop at all.

This may sound good, but it should be better - we should not lose a single packet doing this. We're looking at the logs to work out why, and may try again Sunday morning.

Thank you for your patience.

4 Oct 07:53:52
Plan for tomorrow is to pick one of the routers that did drop a ping, and shut it down and hold it without restarting - at that point we can investigate what is still routing via it and why. This should help us explain the dropped ping. Assuming that provides the clues we need we may load or reconfigure routers later on Sunday to fix it.
5 Oct 06:57:39
We are starting work shortly.
5 Oct 07:11:00
We are doing the upgrades as planned, but not able to do the level of additional diagnostics we wanted. We may look in to that next weekend.
Resolution Only 3 routers were upgraded, the 3rd having several seconds of issues. We will investigate the logs and do another planned work. It seems early morning like this is less disruptive to customers.
Started 4 Oct
Closed 5 Oct 07:26:50
Previously expected 6 Oct

1 Oct 17:49:32
30 Sep 18:04:06
Having been very successful with the router upgrade tonight, we are looking to move to the next router on Wednesday. Signs so far are that this should be equally seamless. We are, however, taking this slowly, one step at a time, to be sure.
Resolution We loaded 4 routers in all, and some were almost seamless, and some had a few seconds of outage, it was not perfect but way better than previously. We are now going to look in to the logs in detail and try to understand what we do next.

Our goal here is zero packet loss for maintenance.

I'd like to thank all those on irc for their useful feedback during these test.

Started 1 Oct 17:00:00
Closed 1 Oct 17:49:32
Previously expected 1 Oct 18:00:00

30 Sep 18:02:25
29 Sep 21:57:11
We are going to spend much of tomorrow trying to track down why things did not go smoothly tonight, and hope to have a solution by tomorrow (Tuesday) evening.

This time I hope to make a test load before the peak period at 6pm, so between 5pm and 6pm when things are a bit of a lull between business and home use.

If all goes to plan there will be NO impact at all, and that is what we hope. If so we will update three routers with increasing risk of impact, and abort if there are any issues.

Please follow things on irc tomorrow.

If this works as planned we will finally have all routers under "seamless upgrade" processes.

30 Sep 08:29:42
Tests on our internal systems this morning confirm we understand what went wrong last night, and as such the upgrade tonight should be seamless.

For the technically minded, we had an issue with VRRP becoming master too soon, i.e. before all routes are installed. The routing logic is now linked to VRRP to avoid this scenario, regarless of how long routing takes.

Resolution The upgrade went very nearly perfectly on the first router - we believe the only noticeable impact was the link to our office, which we think we understand now. However, we did only do the one router this time.
Started 30 Sep 17:00:00
Closed 30 Sep 18:02:25
Previously expected 30 Sep 18:00:00

29 Sep 19:29:19
29 Sep 14:06:12
We expect to reload a router this evening, which is likely to cause a few seconds of routing issues. This is part of trying to address the blips caused by router upgrades, which are meant to be seamless.
29 Sep 18:48:37
The reload is expected shortly, and will be on two boxes at least. We are monitoring the effect of the changes we have made. They should be a big improvement.
Resolution Upgrade was tested only on one router (Maidenhead) and caused some nasty impact on routing to call servers and control systems - general DSL was unaffected. Changes are backed out now, and back to drawing board. Further PEW will be announced as necessary.
Started 29 Sep 17:00:00
Closed 29 Sep 19:29:19
Previously expected 29 Sep 23:00:00

29 Sep 16:57:23
2 Sep 17:15:50
We had a blip on one of the LNSs yesterday, so we are looking to roll out some updates over this week which should help address this, and some of the other issues last month. As usual LNS upgrades would be over night. We'll be rolling out to some of the other routers first, which may mean a few seconds of routing changes.
7 Sep 09:43:40
Upgrades are going well, but we are taking this slowly, and have not touched the LNSs yet. Addressing stability issues is always tricky as it can be weeks or months before we know we have actually fixed the problems. So far we have managed to identify some specific issues that we have been able to fix. We obviously have to be very careful to ensure these "fixes" do not impact normal service in any way. As such I have extended this PEW another week.
13 Sep 11:07:13
We are making significant progress on this. Two upgrades are expected today (Saturday 13th) which should not have any impact. We are also working on ways to make upgrades properly seamless (which is often the case, but not always).
14 Sep 17:21:35
Over the weekend we have done a number of tests, and we have managed to identify specific issues and put fixes in place on some of the routers on the network to see how they go.

This did lead to some blips (around 9am and 5pm on Sunday for example). We think we have a clearer idea on what happened with these too, and so we expect that we will load some new code early tomorrow or late tonight which may mean another brief blip. This should allow us to be much more seamless in future.

Later in the week we expect to roll out code to more routers.

16 Sep 16:57:07
We really think we have this sussed now - including reloads that have near zero impact on customers. We have a couple more loads to do this week (including one at 5pm today), and some over night rolling LNS updates.
17 Sep 12:23:59
The new release is now out, and we are planning upgrades this evening (from 5pm) and one of the LNSs over night. This should be pretty seamless now. At the end of the month we'll upgrade the second half of the core routers, assuming all goes well. Thank you for your patience.
18 Sep 17:15:27
FYI, there were a couple of issues with core routers today, at least one of which would have impacted internet routing for some destinations for several seconds. These issues were on the routers which have not yet been upgraded, which is rather encouraging. We are, of course, monitoring the situatuion carefully. The plan is still to upgrade the second half of the routers at the end of the month.
19 Sep 12:12:42
One of our LNS's (d.gormless) did restart unexpectedly this morning - this router is scheduled to be upgraded tonight.
28 Sep 13:25:10
The new release has been very stable for the last week and is being upgraded on remaining routers during Sunday.
Resolution Stable releases loaded at weekend
Started 2 Sep 18:00:00
Closed 29 Sep 16:57:23
Previously expected 19 Sep