Main website down 13th January 2013

Home Forums Previews & Announcements Main website down 13th January 2013

This topic contains 2 replies, has 0 voices, and was last updated by  Hywel Phillips 11 years, 3 months ago.

Viewing 3 posts - 1 through 3 (of 3 total)
  • Author
    Posts
  • #10256

    Hi All,

    Bugger. Main website is currently down. No idea why- server seems to be running, but Apache doesn’t, so pages aren’t getting served. Tech support is on it. Hopefully normally service resumed ASAP.

    The current level of downtime is clearly unacceptable. I am demanding an explanation of our web hosts and will ensure the situation improves. The first step is to install better monitoring so if the web server software or MySQL crashes, it is spotted automatically (and hence will get fixed even if it happens in the middle of the night my time- which this one seems to have, which is why it was 6 hours before I got onto it). The second step is to move to a new server, which has been on the cards for a while. We put that off because we installed more disk space and our web hosts said they were getting some new machines in the new year i.e. now; I will see if we can move that forward as a more powerful machine serving the site may be more robust.

    Apologies. I will resolve this situation and get us back to the 99.9%+ uptime we expect and require.

    I have also just initiated external monitoring of the site, with alerts to me, so I can monitor the situation more closely and respond more quickly in future.

    Hywel.

    #16979

    The site should now be back up.

    This outage and the one before Xmas constitute an unacceptable level of downtime. We are investigating with the help of tech support to figure out not only why the outages happened (last time it was due to the MySQL server process crashing, this time it looks like it might have been Apache process crashing)… but more importantly why this was not automatically flagged by the monitoring at the data centre.

    We clearly need to improve this. So in addition to investigating what happened and improving the internal monitoring, I have just signed up for an external monitoring service which checks the site every minute and sends me alerts by text if the site goes down. I will monitor events very closely for the next few weeks.

    I apologise for the outage. This time we were off air for nearly 12 hours, which is not acceptable. I am working to make sure it doesn’t happen again.

    Hywel.

    #16980

    It looks like support have identified the problem- a bug in some of the monitoring and stats programs. We’ve been running with that process turned off for the last few days and the site has had 100% uptime so far.

    We may need to run some more tests and see if we can install patched versions of the rogue process but looks like the issue is at least identified and a workaround is done so the site should be back to 99.9%+ uptime.

    Cheers, Hywel.

Viewing 3 posts - 1 through 3 (of 3 total)

The forum ‘Previews & Announcements’ is closed to new topics and replies.