We deeply regret and apologize for the recent outage and the disruption it caused to our impacted customers’ and partners’ business. They entrust us with their critical business communications. We take this responsibility seriously.
The following post reviews the outage source, comprehensive measures we’re taking to prevent its recurrence, and how we are improving communications to customers. A full Reason for Outage (RFO) was sent to impacted customers on Wednesday, 4/21. Under our service level agreement, their accounts will be proactively credited by Friday, 4/23.
Summary of the Reason for Outage (RFO)
At approximately 6:15 a.m. PT on Thursday 4/15, a hardware failure occurred on one of the storage area networks (SANs) located in Intermedia’s New Jersey datacenter. The service processor for one of the controller nodes had a failure. This failure caused the entire load for that SAN to be shifted to the service processor on the redundant controller node. The spare capacity on the single service processor was not enough to handle the entire load of all systems connected to the SAN. This caused performance issues in Domains 20 and 21.
For customers on Domain 21, a backlog of email rapidly developed. This caused major problems with mail delivery throughout Thursday, 4/15.
For customers on Domain 20, the backlog was large enough that it took 32 hours to clear. At approximately 2 p.m. PT on 4/16, all systems were functioning normally and mail delivery was considered to be “real-time.”
Corrective Actions
Our SAN vendor analyzed the system logs for the event and determined that the service processor failure occurred due to a unique bug in the specific version of firmware on the system. Our vendor performed an emergency upgrade. The newer version of firmware includes a fix for the bug. We are taking additional corrective actions to make certain that there is enough spare capacity on the SAN. This will assure it performs without performance degradation in the event of a single hardware failure.
Improving Communications
Intermedia received significant constructive feedback regarding our communication throughout the outage. We recognize how important it is to proactively communicate timely, detailed information that clearly explains the impact on our customers’ service. We recognize that our current client notification tools and processes are more reactive than proactive.
We have taken a number of steps in response. These steps include development of a new client notification tool that will be used by Technical Support to proactively notify and communicate with clients during a service interruption. The notification tool will be released next week and put into operation in May. It includes automated SMS notification (text messaging). We are also revising our communication processes to assure that clear, non-technical information on service impact is included alongside technical details.