|
Welcome to the Australian Ford Forums forum. You are currently viewing our boards as a guest which gives you limited access to view most discussions and inserts advertising. By joining our free community you will have access to post topics, communicate privately with other members, respond to polls, upload content and access many other special features without post based advertising banners. Registration is simple and absolutely free so please, join our community today! If you have any problems with the registration process or your account login, please contact us. Please Note: All new registrations go through a manual approval queue to keep spammers out. This is checked twice each day so there will be a delay before your registration is activated. |
|
Site Support If something isn't working or you have a suggestion ( a nice one !! ) let us know here. |
|
Thread Tools | Display Modes |
10-03-2013, 10:17 AM | #1 | ||
Chairman & Administrator
Join Date: Dec 2004
Location: 1975
Posts: 107,070
|
Good morning
Firstly, let me apologise for the extended server outage that we have experienced this morning but it's the result of a comedy of errors. By way of background, a ticket was originally raised with our ISP to upgrade RAM from 8Gb to 16GB in the server for action between 0100-0400 yesterday morning. This was completed successfully within the agreed change window at about 0100 AEDST. There was a minor outage of <10 mins at about 0200 but the server came back up. The server was then down from 0400-0900 yesterday which wasn't noticed until 0830 at which time a request for reboot was lodged and the reboot performed. I ran some simple diagnostics and checked error logs but could find no cause. 0400 is the time when the backup and several housekeeping tasks are performed and server load tends to be very high. The server crashed again at 2330 last night and I contacted the support team and advised that the server was down again and that I wanted diagnostics run when it was restarted to find the root cause of the problem. I received the following email from support at 00:04 this morning. Hello, I apologize for the issues you are having since the RAM upgrade. We can check the hardware as requested but we will need to bring the server offline for 1-2 hours. We can do this 24/7, just reply back letting us know what time you would like us to proceed. This email came from their Dedicated Customer Care so I replied to say that as it was 0100 here (when I saw the email) it was fine to proceed as this was out of hours. Stupidly, despite saying "just reply back" I have since noted that this is a no reply email address and I was supposed to go into the ticket and update it. I checked the server at 0200 before heading to bed and the server was still down. Anyway. as at 0700 this morning when I got up the server was still down, nothing had been done and it hadn't even been rebooted. I contacted support and spoke to an agent who advised that because the ticket had been reopened and I hadn't replied to the message above nothing had been done - to quote: I don't think the tech was aware that it was already down. I will reopen it and add that information It's like Keystone Cops. At the very least it should have been rebooted and it would probably still be up and we could have scheduled the maintenance window in non production hours. As of 0800 local it was still down and I again contacted the support team to be advised that they were running diagnostics on new RAM after the original upgrade RAM failed the test - this despite my request that the diagnostics not be performed so that we could do it in a planned maintenance window out of normal hours. I'm happy to take responsibility for not noticing that the email was from a no reply address but, in this era, it isn't technically hard to route email to agents and we should be well past using no-reply addresses. Not a lot of point sending the question in the email if you don't want a response via the same channel. I have escalated the issue to the support manager as to not do anything for 7.5 hours and make no further contact is unacceptable. Hopefully, the server will return to a stable state now that the new memory has been tested and we shouldn't experience any further issues. Cheers Russ
__________________
Observatio Facta Rotae
|
||
13 users like this post: | 4stanger, csv8, danmc, DJR-351, I6DOHC, Jason[98.EL], Mr G8, naddis01, Silver Ghia, supershifty, WMD351, XRGhia, yakcam |