Google to rethink resilient networking following dual failure

In spite of running a dual redundant network, Google's Gmail system has had a major failure.

In spite of running a dual redundant network, Google's Gmail system has been impacted by a major failure.

Google is revamping its network and disaster recovery procedures following the failure that caused Gmail attachments to be delayed by up to two hours.

From the time the company first reported the issue on the Google Apps dashboard, at 3:25pm on 23rd September, it took Google almost 12 hours to update the status of Gmail to 'resolved'.

In a blog post discussing the problem, Sabrina Farmer, senior site reliability engineering manager for Gmail wrote: “The message delivery delays were triggered by a dual network failure. This is a very rare event in which two separate, redundant network paths both stop working at the same time. The two network failures were unrelated, but in combination they reduced Gmail’s capacity to deliver messages to users, and beginning at 5:54 am PST messages started piling up.”

More articles on Gmail in the enterprise

The outage affected 29% of users. Farmer said that for1.5% of Gmail messages, the delay in downloading large attachment was up to two hours.

To tackle the problem she said the Gmail network team restored some of the network capacity that was lost and worked to re-purpose additional capacity, to clear the accumulated message backlog.

While its impact may not have been catastrophic, the outage at Gmail is a cause for concern especially as businesses are turning to Google and other providers to run cloud-based email and Saas. Companies like Reed.co.uk and Pearsons have swapped out Microsoft Exchnage for Gmail, due to the lower cost of running cloud-based email.

Going forward Farmer said: “We're taking steps to ensure that there is sufficient network capacity, including backup capacity for Gmail, even in the event of a rare dual network failure. We also plan to make changes to make Gmail message delivery more resilient to a network capacity shortfall in the unlikely event that one occurs in the future. Finally, we’re updating our internal practices so that we can more quickly and effectively respond to network issues.”

While recruitment site Reed.co.uk was not impacted by the Google outage, Mark Ridley, director of technology at Reed.co.uk, said: "Communication to end users is really key. Had we been impacted we can rely on Jive (our intranet), phone communications  to line management, or good, old fashioned yelling across the office to brief the teams."

For any prolonged issue he said  the company would need to seek an alternative email system. "However I'm confident that we're in better hands with Google (or Office 365, or a hosted Exchange provider) than we would be if we were to be running email ourselves, even in the shadow of this issue."

Read more on Network routing and switching