Outage postmortem: 10th of March

Created: 2017/03/15 22:48:56+0000

My website suffered a recent outage and having identified the likely problem I thought I would do a postmortem. On the 10th of March, a Friday evening, my website stopped responding to requests. The ngnix server remained up and timed out the requests. Service resumed after restarting the Tomcat server. There is still some uncertainty but it appears the cause of the failure was the inability to get database connections from the pool.

This site uses Filters to process the request and gather the information needed for it. Each Filter requiring a DB connection obtained its own from the pool. This meant that a single request might require up to four DB connections to process. The connection pool size was configured for a maximum of eight connections. I think that multiple requests each obtained a DB connection but none recevied enough to complete succesfuly. This left each thread blocked waiting for additional DB connections from the pool.

Two main changes were made to address this issue. A single DB connection is shared by all Filters dealing with the request. The first Filter to need a DB connection is responsible for obtaining and releasing the connection. The connection is passed as a servlet request attribute to the later Filters. Abandoned connections are now removed by Tomcat. After a fixed period Tomcat will identify DB connections as abandoned and remove them. This should ensure that DB conenctions can be obtained for requests in future.