Order posts by limited to posts

25 Jun 06:28:18
Details
25 Jun 05:39:05
The minor change last night is still having some issues with some servers being slow, and this is currently impacting VoIP. This is still being working on.
Update
25 Jun 06:07:24
VoIP services starting to work properly again, still not 100% right.
Update
25 Jun 06:15:04
VoIP looking a lot healthier now. Still working on tracking down the poor performance issues.
Update
25 Jun 08:35:02
For more details on what happened, please see https://www.facebook.com/AAISP/posts/673421826086051
Resolution We'll be monitoring this closely during the day.
Started 25 Jun 05:38:24
Closed 25 Jun 06:28:18

22 Feb 13:00:00
Details
22 Feb 08:03:44
A disk server has failed, it is impacting all web sites we host and email. Engineers are working on this now.
Update
22 Feb 10:08:38
There is a major issue with one of the disk servers, and we are planning to switch to a backup, but that is likely to involve an engineer visit to the data centre.
Update
22 Feb 10:16:44
Engineer is on his way to the data centre now.
Update
22 Feb 11:15:40
This is looking more complex than expected - we have switched the secondary controller, but there are issues with one of the disk arrays as well. Engineer still on site.
Update
22 Feb 11:17:48
Disk array is rebuilding now. We should have email working shortly and then web pages once the disk array rebuilds.
Update
22 Feb 11:57:48
Web space up, and mail servers being reconnected to disk array now.
Update
22 Feb 12:11:56
Issues with web pages again, investigating.
Update
22 Feb 12:14:17
The secondary disk server is now showing problems too. We are working on it.
Update
22 Feb 12:32:03
This is proving to be quite a serious issue - we appear to have issues with two separate disk controllers and with some of the RAID disks and with the file system on one of the disks. This is a very odd multiple failure, especially given that all of this is monitored constantly and was not showing any issues yesterday. We do have daily backups, so if all else fails there are ways to get service restored with backups and some loss of recent emails or changes. At this stage we are working to repair the failed file systems before considering that move.
Update
22 Feb 12:35:22
It looks like we have the mail file store repaired and mail should be back on line shortly.
Update
22 Feb 12:38:27
Web pages back.
Update
22 Feb 12:40:20
Incoming email should now be working again.
Update
22 Feb 12:44:44
We are checking all mail and web servers now to confirm all is well again.
Resolution Obviously this sort of multiple failure is somewhat unexpected. We do have plans for new disk servers anyway, and this type of failure will be considered as part of that system design.
Started 22 Feb 00:39:00
Closed 22 Feb 13:00:00
Previously expected 22 Feb 13:00:00

25 Sep 2013 16:20:00
Details
25 Sep 2013 16:09:44
We're currently investigating a problem with our disk storage server that runs our email and web space. We are investigating.
Update
25 Sep 2013 16:18:26
The server is being restarted at the moment.
Update
25 Sep 2013 16:20:41
Ser server is back on line, we are doing post-boot up checks etc to being the services back online.
Update
25 Sep 2013 16:20:58
Web services are now back online.
Started 25 Sep 2013 16:00:00
Closed 25 Sep 2013 16:20:00

17 Feb 2013 07:57:04
Details
17 Feb 2013 06:51:33

Our web site, and hosted customer web sites are not responding. We are looking in to this.

Update
17 Feb 2013 07:32:53

This issue is the same as the one affecting email, and is related to a disk server.

Update
17 Feb 2013 07:43:06

We are working on disk server now

Resolution

We sites now working.

Started 17 Feb 2013 06:22:30
Closed 17 Feb 2013 07:57:04

11 Feb 2013 12:53:12
Details
11 Feb 2013 12:40:11

 

We've just had a blip causing ADSL lines to drop of and reconnect, this also affected routing for about a minute.

More info to follow shortly.

 

Update
11 Feb 2013 12:53:12

The issue affected both BGP links and LNS traffic, cauing lines to reconnect. Not all lines and services were affected. The lines have reconnected quickly, and the problem seems to have cleared without any need for any intervention from staff.

We are investigating the cause - this looks to be something external that has triggered this. We are trying to find exactly how this has happened so we can avoid the issue in future.

General Users Affected 33%
Closed 11 Feb 2013 12:53:12

26 Jan 2013 04:59:58
Details
26 Jan 2013 04:56:46

We have lost access to everything in HEX 6/7 which means L2TP only customers are off line. This also affects some of our general servers. This is being investigated now.

Update
26 Jan 2013 04:57:41

The issue is a result of a port/link failure earlier. We are re-configuring access via an alternative route now.

Resolution

Access has been restored

Started 26 Jan 2013 04:31:00
Closed 26 Jan 2013 04:59:58

09 Apr 2012 17:00:43
Details
09 Apr 2012 16:23:00

Our main web pages and customer web pages are currently inaccessable or very slow due to a denial of service attack. We are taking steps to address this. Sorry for any inconvenience.

Update
09 Apr 2012 16:26:34

This appears to be an attach on one web site, which is being moved.

Update
09 Apr 2012 16:31:17

Load is starting to move to the backup server now.

Update
09 Apr 2012 16:44:32

We have moved a lot of the load over, and adjusted TCP settings to try and ensure we sites are working albeit slightly slow.

Update
09 Apr 2012 16:59:12

Web access is being much more responsive now.

Started 09 Apr 2012 14:36:19
Closed 09 Apr 2012 17:00:43

14 Feb 2012 13:06:19
Details
13 Feb 2012 12:37:17
We currently have a DNS problem with our main domains aa.net.uk and aaisp.net.uk - this will cause a problem with various services that we run.
Update
13 Feb 2012 12:45:36

This was a simple planned change - it seems however we have been caught out by bind refusing to reload if one zone has one syntax error. Seems one of our customer zones had a typo in it which our tools had not managed to check and that caused it to refuse to load.

Unfortunately it took quite a few minutes to find what was wrong as this was nothing to do with the changes we actually made.

Update
13 Feb 2012 12:55:05

Most things are OK.

Update
13 Feb 2012 13:13:38

Both authorative serves are serving the aaisp.net.uk and aa.net.uk zones correctly.

Update
13 Feb 2012 13:39:09

There are still problems with the aaisp.net.uk domain - we're working on this.

VoIP is currently affected too, this is being worked on now.

Update
13 Feb 2012 13:56:00
We are running in to slightly unexpected errors as well, and working through them. Some things that we have not touched are not working, which kind of makes no sense.
Update
13 Feb 2012 14:43:27

Whilst most things are working, at least from customer lines, there was an issue which caught us out, but was way to far in to the process to sensibly back up. The top level delegation for aaisp.net.uk was going to a special DNS server, which meant when we moved everything it stopped working properly.

Update
13 Feb 2012 14:46:17

Some incoming VoIP is not working, we are still working on this.

Update
13 Feb 2012 15:09:53

Still lots of progress mopping up things. Mostly non customer affecting.

Just to clarify - you should be able to use aaisp.net.uk or aa.net.uk as they interchangable. However, right now, some places are having some issues seeing some of the aaisp.net.uk sub domains.

The preferred version is now aa.net.uk and we should be quoting that everywhere now.

Using aa.net.uk is also a work around for the issues some people are seeing right now, which are down to DNS caches.

Update
13 Feb 2012 16:18:42

At the moment we still have some services affected, these are: some incoming VoIP and accessing our services from outside our network (ie customer using DNS resolvers ither than ours)

Update
13 Feb 2012 17:38:56

We really think this should be sorted now - but monitoring carefully.

Update
13 Feb 2012 18:25:30

Mor details on http://aa.net.uk/news-2012-02-dns.html

Update
14 Feb 2012 13:06:34

We think all is OK now, so closing this incident.

Started 13 Feb 2012 12:15:49
Closed 14 Feb 2012 13:06:19

22 Jan 2012 08:32:57
Details
22 Jan 2012 07:57:48

It looks like we have a major issue with mail and web services.

This is being investigated now.

Update
22 Jan 2012 08:04:01

This looks like an issue with the main storage array used by web and email. Being worked on now.

Update
22 Jan 2012 08:31:56

Looks like we have it working again - web pages are fine - email being a bit sluggish catching up.

Resolution

The disk server has been restarted to clear the problem. The underlying cause of the problem is being investigated.

Started 22 Jan 2012 06:45:00
Closed 22 Jan 2012 08:32:57

11 Nov 2010 11:59:24
Details
11 Nov 2010 11:53:30

Routing blipped briefly - we're investigating.

Update
11 Nov 2010 12:00:39

Routing is back now.

Total downtime was from about 11:45 to 11:53

We're investigating the cause of this problem.

Started 11 Nov 2010 11:50:28
Closed 11 Nov 2010 11:59:24

26 Oct 2010 23:06:25
Details
26 Oct 2010 20:57:35

We are seeing problems with equipment located in our HEX datacentre, most likely power related.

This will be affecting A&A webspace, customer L2TP connections, beta tester data SIM connections and some customer equipment which is hosted there.

We have been in contact with the data centre staff, who are investigating it.

Update
26 Oct 2010 21:32:34

The problem doesn't look like power, the core routers we have there appear to be rebooting.

Power cycling has restored the service to one of the routers, so access to servers hosted there is working again now. However the second router remains offline and is still causing problems with data SIM testers.

We are working to resolve that problem too.

Update
26 Oct 2010 21:33:04

The main issues have been resolved now - it does not look like power, but trying to find why two separate routers developed problems at the same time in HEX. Data SIMs will be affected still.

Update
26 Oct 2010 22:09:33

We have re-routed the data SIMs and set up backup for future use anyway.

Resolution

Both routers working now.

Started 26 Oct 2010 20:38:00
Closed 26 Oct 2010 23:06:25

22 Sep 2010 09:30:00
Details
22 Sep 2010 08:50:08

Investigating now - if this is the same as we had at the weekend we should be able to sort it quite quickly.

Update
22 Sep 2010 08:58:53

We hope to have this sorted in a few minutes.

Update
22 Sep 2010 09:03:14

This is impacting some VoIP services but not all.

Update
22 Sep 2010 09:11:09

There will be a slight blip on broadband while we sort this.

Update
22 Sep 2010 09:13:04

This looks like some issue with routing through LINX. We may take down the route collector peering until we are happy we have identified the cause.

Update
22 Sep 2010 09:16:25

Still seeing some issues.

Update
22 Sep 2010 09:21:03

Equipment reboot worked briefly and then the problem re-occured. It seems clear this is a routing issue with a peer that is causing a black hole. We do not understand exactly where or how yet and this is being addressed.

Update
22 Sep 2010 09:35:38

We have taken town LINX route server peering and things are looking a lot better - checking things now.

Update
22 Sep 2010 09:46:39

It may be worth explaining this a little. We have dual redundent equipment to allow for failures. If something fails completely, or can be turned off, then the systems re-route to use other equipment. Depending on where such issues are this can mean no outage, a few seconds or a few minutes.

However, if there is a partial failure, such as a single black-hole route for the link to Maidenhead, then this is not an equipment failure. The other routers get that route and expect it to be valid. This can create complex problems that are hard to diagnose, also and mean we have to use various alternative means to access systems which causes delays.

Resolution

I would stress, just because taking down the LINX route server seems to have addressed the issue does not mean there is an issue with LINX. This could be something odd with our routers, or the LINX router server or a peer via that route server feeding something odd to us as a route. We're trying to identify what has happened but for now we'll leave the route server shutdown until we know.

Started 22 Sep 2010 08:37:00
Closed 22 Sep 2010 09:30:00

21 Jul 2010 15:06:47
Details
21 Jul 2010 14:26:21

There is a general issue with packet loss at the moment.

It looks like it might be related to transit, and we're investigating.

Update
21 Jul 2010 14:42:46

Still investigating. Looks like a problem with peering.

Something's recovering - traffic levels are returning to normal.

Update
21 Jul 2010 14:43:52

Looks like a power failure in Telehouse North, which would have affected a lot of peering, and some transit.

Update
21 Jul 2010 15:07:17

It looks like everything is back to normal for us.

Started 21 Jul 2010 14:23:47
Closed 21 Jul 2010 15:06:47

27 May 2010 20:20:26
Details
27 May 2010 19:29:03

We have been trying to resolve this properly and run in to a further snag which means again we have no routing to servers in HEX. Broadband unaffected as before but many services like control pages, hosted servers, our web site, accounts, and so on are all off-line while we reoslve this.

Update
27 May 2010 19:44:13

We are waiting for someone on-site at present.

Update
27 May 2010 20:20:26

Ok, working again.

Resolution

Trying to find the exact problem for a permanent fix.

Started 27 May 2010 19:27:00
Closed 27 May 2010 20:20:26

27 May 2010 17:26:33
Details
27 May 2010 17:13:39

We have a major issue with a firewall/router in HEX at present which will affect access to web pages, control pages, accounts system, and variosu systems. We hope to have this resolved ASAP.

Resolution

Routing has been reworked - some smaller issues remain.

Started 27 May 2010 17:16:37
Closed 27 May 2010 17:26:33

08 Apr 2010 08:30:52
Details
08 Apr 2010 01:35:23

We seem to have lost access to machines in Harbour Exchange Square.

This includes one of our key machines for RADIUS authentication for broadband lines (clueless) and our accounts server (priceless), customer web servers (Limitless) and a few hosted machines.

We are trying to identify the cause of the problem now.

Update
08 Apr 2010 01:46:48

We have managed to confirm it is not a power issue

Update
08 Apr 2010 01:56:38

We are gettign someone to check the rack in HEX now.

The backup RADIUS and DNS servers are working as they should.

Update
08 Apr 2010 02:05:02

This appears to be an issue with our main firewall in HEX. It is firewalling rather too well all of a sudden.

Update
08 Apr 2010 02:12:13

We are just waiting on a reboot now. We believe we have found the cause of the issue though so can take some preventative measures once the reboot is complete.

Update
08 Apr 2010 02:37:29

Reboot complete - all working

Update
08 Apr 2010 08:30:52

One slight side effect - session tracking timed out as the RADIUS server could not see the LNS. This meant PPP restarts for most lines at some point during the night and a few may do so during the day (on the hour). Until then usage is not being meter for those lines.

Resolution

We will be applying an update to the firewall shortly

Started 08 Apr 2010 00:59:10
Closed 08 Apr 2010 08:30:52