Wednesday, September 12, 2007

Downtime, part 2

My last post touched on some of the reasons why we scheduled planned downtime when updating it's learning to version 3.2 Thursday evening/night. Unfortunately, the following day we got some unplanned downtime and had to temporarily roll most of our customers back to version 3.1 while troubleshooting and sorting out the bug. Fortunately we quickly found a resolution to the problem, but it is a textbook example of how easy it is to mess up your performance in a large data center.

To explain what happened, I have to start off with the basics of how our hosting environment works (simplified version!):



1. Content switches. This is the entry point of any request to our web-server. The primary function of the content switches is to route you to a pool of web servers based on what it's learning "site" you belong to (our customers are divided into 4-5 different pools of web-servers at the moment - maybe a subject worth blogging about at a later time). The content switch also terminates https traffic; load balances web-servers and caches static files.



2. Web server(s). Every pool of servers consists of 5-8 web-server. This is where the actual application is installed. Based on the load on the servers in the pool a request is assigned to a server. (so for every page you click inside it's learning you could access a different server).



3. Session state server. HTTP is a stateless protocol. Every request from your browser to the server is initiated and terminated. To keep track of who you are a session ID is created. This session ID is stored in a cookie on your computer and on the server. Since we have a lot of web servers and you can be assigned a random server between requests, all sessions are stored on the session state server. When you access http://www.itslearning.com/ you are assigned a session on the session server. This session will continue to live on our session state server until 20 minutes after you close your browser. So with the amount of traffic we receive new sessions are created and expires every second.


4. Database server. This is where the customer databases are stored. Depending of the size of the customer there could be one, two or a heap of customers residing on one database server. It's learning is a very database dependent application, and the amount of traffic makes it important to have finely tuned database servers.


5. File server. The file server(s) act as a client for the SAN where all the files uploaded into it's learning are stored. These are directly connected with dual fiber cards to a very, very expensive hard drive.


So what happened? The problem came with a new security measurement introduced to it's learning. you can now only access files and similar from a separate domain (files.itslearning.com). What we didn't realize what that the implementation created a new session on our session state server for every file that was opened by a user in it's learning. This simply was to much for the session state server, and it froze. We ended up with one of these guys on our servers:


0 comments: