Why downtime?
So if you have been trying to access it's learning for the last few hours, you have probably been gotten a glimpse of the following message:
A few of you might be wondering what exactly goes on behind the scenes when we take down it's learning and start upgrading. So for the most curious of you, here's a simplified list of what our eager staff of system engineers and developers (I currently count eight of us here!) are currently doing:
- Backup. Every update starts with a full backup of all customer data. This typically takes a couple of hours, including verifying that the backup has run properly on every customer database.
- Patching. Before we start reinstalling we make sure that every server is patched properly.
- Databases are updated with necessarily changes to the new version of our application.
- Optimalization scripts are run on each database (new indexes, obsolete data is removed, etc).
- The core application is installed on all of our web-servers.
- Connected applications are upgraded (like exam, mobile, community and importapplications).
- Backup verification. We make sure that all backups are running properly after upgrade is finished.
- Documentation. All of our configuration documentation is updated to make sure it is now reflecting our new data center configuration.
- Testing. A crew of testers make sure that the application is installed properly before customers are let back onto the servers.
- We let you back in. And funny enough, even four in the morning hundreds of users starts logging in :-)
2 comments:
Very interesting and useful to know - I have often wondered why you (and other vendors) need so much time during an upgrade, but now I understand better. Thanks for the insight!
I probably shouldn´t ask in public, but all customers know that something actually went terribly wrong with your update (or after the first hours of usage on sept. 7) - and I guess many of us are curious to get a hint at what was actually failing? :-)
Hi Sven Andreas,
I promise to come back shortly with a detailed post on why we ended up with downtime and had to temporarily roll back to the previous version. It was related to the implementation of a few new security enhancements in 3.2 that put a bit of a strain on a central resource in our data center. It is now fixed :-)
Post a Comment