Microsoft Meltdown – Outage Post Mortem
On Friday, February 22 at 12:44 PM PST the Safefood 360 application became inaccessible to our customers for several hours. Three days earlier we experienced a briefer outage lasting over an hour. For this we are really sorry.
Our software is hosted on the Microsoft Windows Azure Platform which is a global network of state of the art, highly resilient data centers. This platform provides us with the security, stability and performance we need and has done so flawlessly for two years up to this point.
This first outage occurred on Tuesday when Microsoft were conducting some upgrade work to the SQL Server database system in their East Coast United States data center. There was an unexpected malfunction in the work being carried out and as a result all databases in that data center became unavailable. We were immediately notified and our failover plan kicked in which involved;
- Redirecting our login page to a secondary instance of the application (in a different data center) which we have running in the event of an outage in the primary data center such as this was.
- Increasing the number of support staff to field inbound calls from concerned customers.
- Provide a System Update page with a constant stream of detailed updates to keep our customers informed of our progress in rectifying the issue.
Within 90 minutes the database system was restored to full operation and all traffic was redirected back to our primary application. Such isolated incidents like this are extremely rare but can still happen from time to time and we are satisfied that our failover plan is sufficient to provide a satisfactory workaround for such unlikely occurrences.
What happened yesterday however was completely unprecedented and unrelated to the earlier issue. This outage was caused by an expired security certificate in Microsoft’s storage service. Any application, including ours, which relies on accessing this storage layer over HTTPS was immediately brought down as communication could not take place over a secure channel. This same security certificate was used in all data centers across the network and therefore our failover plan could not be triggered. This resulted in our application and our customer’s data becoming inaccessible for several hours – something which was never supposed to happen. Every enterprise software application globally which relies on the Azure Platform was in the same boat.
One of the key reasons why Microsoft globally distributes their data centers is to provide for redundancy in the event of a failure in any single data center. What happened yesterday was said to be impossible, and therefore is an event we had not planned for. Yet it happened and did so because of the simplest little oversight of a Microsoft employee.
Naturally this has prompted the Safefood 360 technical team to completely rethink our disaster recovery strategy. From today we will be putting in place additional controls and procedures to ensure that we can recover quickly from an unlikely repeat occurrence involving a complete outage of the entire Azure network.
Once again please accept our sincere apologies for this downtime and rest assured that we will not let this happen again.
Philip Gillen, COO.
News Sources: Link | Link | Link | Link
Microsoft statement (Source)
At 12:29 PM PST on February 22nd, 2013 there was a service interruption in all regions that affected customers who were accessing Windows Azure Storage Blobs, Tables and Queues using HTTPS. Availability was restored worldwide by 00:09 AM PST on February 23, 2013.
We apologize for the disruption of service to affected customers and are proactively issuing a service credit to those customers as outlined below.
We are providing more information on the components associated with the interruption, the root cause of the interruption, the recovery process, what we’ve learned from this case, and what we’re doing to improve the service reliability for our customers.
Windows Azure Overview
Before diving into the details of the service interruption, and to provide better context on what happened, we’d first like to share some information on the internal components of Windows Azure associated with this event.
Windows Azure runs many cloud services across various data centers and geographic regions around the globe. Windows Azure Storage runs as a cloud service on Windows Azure. There are multiple physical storage service deployments per geographic region, which we call stamps. Each storage stamp has multiple racks of storage nodes.
The Windows Azure Fabric Controller is the resource provisioning and management layer that manages the hardware, provides resource allocation, deployment and upgrade functions, and management for cloud services on the Windows Azure platform.
Windows Azure uses an internal service called the Secret Store to securely manage certificates needed to run the service. This internal management service automates the storage, distribution and updating of platform and customer certificates in the system. This internal management service automates the handling of certificates in the system so that personnel do not have direct access to the secrets for compliance and security purposes.
Root Cause Analysis
Windows Azure Storage uses a unique Secure Socket Layer (SSL) certificate to secure customer data traffic for each of the main storage types: blobs, tables and queues. The certificates allow for the encryption of traffic for all subdomains which represent a customer account (e.g. myaccount.blob.core.windows.net) via HTTPS. Internal and external services leverage these certificates to encrypt traffic to and from the storage systems. The certificates originate from the Secret Store, are stored locally on each of the Windows Azure Storage Nodes, and are deployed by the Fabric Controller. The certificates for blobs, tables and queues were the same for all regions and stamps.
The expiration times of the certificates in operation last week were as follows:
- *.blob.core.windows.net Friday, February 22, 2013 12:29:53 PM PST
- *.queue.core.windows.net Friday, February 22, 2013 12:31:22 PM PST
- *.table.core.windows.net Friday, February 22, 2013 12:32:52 PM PST
When the certificate expiration time was reached, the certificates became invalid prompting a rejection for those connections using HTTPS with the storage servers. Throughout, HTTP transactions were still operational.
While the expiration of the certificates caused the direct impact to customers, a breakdown in our procedures for maintaining and monitoring these certificates was the root cause. Additionally, since the certificates were the same across regions and were temporally close to each other, they were a single point of failure for the storage system.
Details of how the Storage Certificate was not updated
For context, as a part of the normal operation of the Secret Store, scanning occurs on a weekly basis for the certificates being managed. Alerts of pending expirations are sent to the teams managing the service starting 180 days in advance. From that point on, the Secret Store sends notifications to the team that owns the certificate. The team then refreshes a certificate when notified, includes the updated certificate in a new build of the service that is scheduled for deployment, and updates the certificate in the Secret Store’s database. This process regularly happens hundreds of times per month across the many services on Windows Azure.
In this case, the Secret Store service notified the Windows Azure Storage service team that the SSL certificates mentioned above would expire on the given dates. On January 7th, 2013 the storage team updated the three certificates in the Secret Store and included them in a future release of the service. However, the team failed to flag the storage service release as a release that included certificate updates. Subsequently, the release of the storage service containing the time critical certificate updates was delayed behind updates flagged as higher priority, and was not deployed in time to meet the certificate expiration deadline. Additionally, because the certificate had already been updated in the Secret Store, no additional alerts were presented to the team, which was a gap in our alerting system.
Recovering the Storage Service
The incident was detected at 12:44 PM PST through normal monitoring, and the expired certificates were diagnosed as the cause. By 13:15 PM PST, the engineering team had triaged the issue and established several work streams to determine the fastest path to restore the service.
During its normal operation, the Fabric Controller drives nodes to a desired state, also known as a “goal state”. The service definition of a service provides the desired state of the deployment, which enables the Fabric Controller to determine the goal state of nodes (servers) that are a part of the deployment. The service definition is comprised of role instances with their endpoints, configuration, and failure/update domains, as well as references to other artifacts such as code, Virtual Hard Disk (VHD) names, thumbprints of certificates, etc.
During normal operation, a given service would update their build to include new certificates and then have the Fabric Controller deploy the service by systematically walking the update domains and deploying the service across all nodes. This process is designed to update the software in such a way that external customers experience seamless updates and meet the published Service Level Agreement (SLA). While some of this work is executed in parallel, the overall time to deploy updates to a global service is many hours.
During this HTTPS service interruption, the Windows Azure Storage service was still up and functioning for customers who were using HTTP to access their data and some customers had quickly mitigated their HTTPS issues by moving to HTTP temporarily. Care was taken not to impact customers using HTTP while restoring service for others.
After examining several options to restore HTTPS service, two approaches were selected: 1) an update of the certificate on each storage node, and, 2) a complete update of the storage service. The first approach optimized for restoring customer service as rapidly as possible.
1) Update of the Certificate
The development team worked through the manual steps required to update the certificate to validate the remediation approach and restore service. This process was complicated by the fact that the Fabric Controller tries to return a node to its goal state. A process that successfully updated the certificates was developed and tested by 18:22 PM PST. A key learning from a previous outage was to take the time upfront to test and validate the fix sufficiently, to prevent complications or secondary outages that would impact other services. During testing of the fix, several issues were found and corrected before it was validated for production deployment.
Once the automated update process was validated, we applied it to the storage nodes in the US West Data Center at 19:20 PST, successfully restoring service there at 20:50 PM PST. We then subsequently rolled it out to all storage nodes globally. This process completed at 22:45 PM PST and restored HTTPS service to the majority of customers. Additional monitoring and validation was done and the Azure dashboard marked green at 00:09 AM PST on February 23rd, 2013.
2) Complete Update
In parallel to the update of the certificate, a complete update of the storage service with the updated certificate was scheduled and rolled out across the globe. The purpose of this update was to provide the final and correct goal state for all of the storage nodes and ensure the system was in a constant and normal state. This process was started on February 22nd at 23:00 PM PST and completed on February 23rd at 19:59 PM PST and as designed, it did not impact the availability SLA for customers.
Improving the Service
After an incident occurs, we always take the time to analyze the incident and look at ways we can improve our engineering, operations and communications. To learn as much as we can, we do a root cause analysis and analyze all aspects of the incident to improve the reliability of our platform for our customers.
This analysis is organized into four major areas, looking at each part of the incident lifecycle as well as the engineering process that preceded it:
- Detection – how to rapidly surface failures and prioritize recovery
- Recovery – how to reduce the recovery time and impact on our customers
- Prevention – how the system can avoid, isolate, and/or recover from failures
- Response – how to support our customers during an incident
Detection
We will be expanding our monitoring of certificates expiration to include not only the Secret Store, but the production endpoints as well, in order to ensure that certificates do not expire in production.
Recovery
Our processes for recovery worked correctly, but we continue to work to improve the performance and reliability of deployment mechanisms.
We will put in place specific mechanisms to do critical certificate updates and exercise these mechanisms regularly to provide a quicker response should an incident like this happen again.
Prevention
We will improve the detection of future expiring certificates deployed in production. Any production certificate that has less than 3 months until the expiration date will create an operational incident and will be treated and tracked as if it were a Service Impacting Event.
We will also automate any associated manual processes so that builds of services that contain certificate updates are tracked and prioritized correctly. In the interim, all manual processes involving certificates have been reviewed with the teams.
We will examine our certificates and look for opportunities to partition the certificates across a service, across regions and across time so an uncaught expiration does not create a widespread, simultaneous event. And, we will continue to review the system and address any single points of failure.
Response
The multi-level failover procedures for the Windows Azure service dashboard functioned as expected and provided critical updates for customers through the incident. There were 59 progress updates over the period of the incident but we will continue to refine our ability to provide accurate ETAs for issues and updates.
We do our best to post what we know, real-time, on the Windows Azure dashboard and will continuously look for ways to improve our customer communications.
Service Credits
We recognize that this service interruption had a significant impact on affected customers. Due to the nature and duration of this event we will proactively provide SLA credits to affected customers. Credits will cover all impacted services. Customers that were running the following impacted services at the time of the outage will get a 25% service credit for any charges associated with these services for the impacted billing period:
- Storage
- Mobile Services
- Service Bus
- Media Services
- Web Sites
Impacted customers will also receive a 25% credit on any data transfer usage. The credit will be calculated in accordance with our SLA and will be reflected on a subsequent invoice. Customers who have additional questions can contact Windows Azure Support for more information.
Conclusion
The Windows Azure team will continue to review the findings outlined above over the coming weeks and take all steps to continually improve our service.
We sincerely apologize and regret the impact this outage had on our customers. We will continue to work diligently to deliver a highly available service.