———- Forwarded message ———-
From: The Google Apps Team
Date: Thu, Feb 26, 2009 at 1:37 AM
Subject: Google Apps Update: February 24 Outage Postmortem
To: (Name omitted. I received this group e-mail from the business manager of a local private school).
Dear Google Apps customer, Between approximately 9AM to 12PM GMT / 1AM to 4AM PST on Tuesday, February 24, 2009, some Google Apps mail users were unable to access their accounts. The actual outage period varied by user because the recovery process was executed in stages. No data was lost during this time. The root cause of the problem was a software bug that caused an unexpected service disruption during the course of a routine maintenance event. The root cause of this unexpected service disruption has been found and fixed. Additional Details A few months ago, new software was implemented to optimize data center functionality to make more efficient use of Google’s computing resources, as well as to achieve faster system performance for users. Google’s software is designed to allow maintenance work to be done in data centers without affecting users. User traffic that could potentially be impacted by a maintenance event is directed towards another instance of the service. On Tuesday, February 24, 2009, an unexpected service disruption occurred during a routine maintenance event in a data center. In this particular case, users were directed towards an alternate data center in preparation for the maintenance tasks, but the new software that optimizes the location of user data had the unexpected side effect of triggering a latent bug in the Google Mail code. The bug caused the destination data center to become overloaded when users were directed to it, and which in turn caused multiple downstream overload conditions as user traffic was automatically shifted in response to the failures. Google engineers acted quickly to re-balance load across data centers to restore users’ access. This process took some time to complete. The recently launched Apps Status Dashboard includes greater detail on this February 24th incident, including actions we are taking to continually improve performance. For a direct link to this Incident Report, visit http://www.google.com/appsstatus/ir/1nsexcr2jnrj1d6.pdf (English only). For ongoing service performance information, please access the Apps Status Dashboard at http://www.google.com/appsstatus (English only). We are very sorry for the inconvenience that this incident has caused. We understand that system problems are inconvenient and frustrating for customers who have come to rely on our products to do many different things. One of Google’s core values is to focus on the user, so we are working very hard to make improvements to our technology and operational processes so as to prevent service disruptions. We are confident that we will achieve continuous improvements quickly and persistently. Once again, we apologize for the impact that this incident has caused. Thank you very much for your continued support. Sincerely, The Google Apps Team Email preferences: You have received this mandatory email service announcement to update you about important changes to your Google Apps product or account. Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043
Dr. Jen’s Analysis: Google was proactive in offering this apology. It includes nice statements of regret for the problem, acceptance of responsibility, and a commitment to preventing similar problems in the future. The original recipient of this message was pleasantly surprised enough about receiving this apology and he felt moved to send it to me for my apology files. Thank you!