Thursday, September 06, 2007

Why downtime?

So if you have been trying to access it's learning for the last few hours, you have probably been gotten a glimpse of the following message:

A few of you might be wondering what exactly goes on behind the scenes when we take down it's learning and start upgrading. So for the most curious of you, here's a simplified list of what our eager staff of system engineers and developers (I currently count eight of us here!) are currently doing:

  1. Backup. Every update starts with a full backup of all customer data. This typically takes a couple of hours, including verifying that the backup has run properly on every customer database.
  2. Patching. Before we start reinstalling we make sure that every server is patched properly.
  3. Databases are updated with necessarily changes to the new version of our application.
  4. Optimalization scripts are run on each database (new indexes, obsolete data is removed, etc).
  5. The core application is installed on all of our web-servers.
  6. Connected applications are upgraded (like exam, mobile, community and importapplications).
  7. Backup verification. We make sure that all backups are running properly after upgrade is finished.
  8. Documentation. All of our configuration documentation is updated to make sure it is now reflecting our new data center configuration.
  9. Testing. A crew of testers make sure that the application is installed properly before customers are let back onto the servers.
  10. We let you back in. And funny enough, even four in the morning hundreds of users starts logging in :-)

2 comments:

Svend Andreas Horgen said...

Very interesting and useful to know - I have often wondered why you (and other vendors) need so much time during an upgrade, but now I understand better. Thanks for the insight!

I probably shouldn´t ask in public, but all customers know that something actually went terribly wrong with your update (or after the first hours of usage on sept. 7) - and I guess many of us are curious to get a hint at what was actually failing? :-)

jab said...

Hi Sven Andreas,

I promise to come back shortly with a detailed post on why we ended up with downtime and had to temporarily roll back to the previous version. It was related to the implementation of a few new security enhancements in 3.2 that put a bit of a strain on a central resource in our data center. It is now fixed :-)