Japan earthquake and tsunami provides first real test of high-tech disaster recovery, says Microsoft

Japan's earthquake and tsunami that killed at least 15,000 people and left 11,000 missing in March 2011 was the first real test of disaster recovery and business continuity plans on a large scale in a high-tech country, according to Microsoft.

Japan's earthquake and tsunami that killed at least 15,000 people and left 11,000 missing in March 2011 was the first real test of disaster recovery and business continuity plans on a large scale in a high-tech country, according to Microsoft.

The quake and seven-metre tsunami damaged major critical infrastructure in the worst hit areas, with disruptions to travel, electricity, communications and mobile networks throughout Japan.

The biggest challenge to keeping Microsoft's Japan-based services running were power fluctuations and outages, said Bruce Cowper, group manager, Trustworthy Computing, at Microsoft.

The priority was to migrate services, especially for nearly 10 million Hotmail users, to take pressure of the power grid as well as Microsoft's datacentre and its support staff north of Tokyo.

More than 800,000 telephone lines, 400,000 fibre-optic lines and 11,000 wireless basestations were out of service.

High demand for online services

In the face of communications disruptions, many people turned to online services to make contact with their families in Japan and elsewhere in the world, resulting in abnormally high demand for Hotmail and Messenger services, said Cowper.

Microsoft's team in Japan worked with partners to create local applications such as J!ResQ to help people find family and friends and to aid relief efforts, so not only was the team working to maintain existing services, but was also having to expand services such as website hosting for aid agencies.

On Windows Azure, Microsoft provided a cloud-based disaster response communications portal to governments and non-profit organisations to communicate between agencies and directly with citizens.

While the business continuity and disaster recovery plans were crucial to enabling Microsoft to avoid any major service outages, the focus of such plans was purely on the technical and business impact, whereas in reality, Microsoft teams were having to deal with the human impact too.

"We learned that there is a difference between what is business critical and what people actually need, so consumer services such as Hotmail were a lot more important than expected," said Cowper.

Massive data migration

After consultation with Japanese authorities on compliance issues, Microsoft was able to move Hotmail services to the US very quickly to enable people to send and receive messages. The migration of data associated with those accounts took several days.

Hundreds of staff at Microsoft's Redmond headquarters and thousands around the world were involved in monitoring services and providing remote maintenance and support.

"We were faced with having to move petabytes of data with limited bandwith and intermittent power, but managed to do so without any service interruption," said Cowper.

Key to achieving this was having the back-up infrastructure in place and communicating with users about what was being done and what they could expect, he said. Microsoft measures it success by the fact that not a single Hotmail user complained.

Beyond maintaining the services, Microsoft was also called upon to provide anti-fraud support as criminals sought to capitalise on the disaster with various social engineering attacks to trick people into donating money and giving up credit card details in the belief they were contributing to relief efforts.

"In the midst of the recovery operation, we were issuing consumer guidance on how to deal with online fraud, and we realised we had to think beyond critical business needs and consider the impact on people and communities," said Cowper.

Although the continuity and recovery plans worked well, Cowper said real-world situations are always going to show that there are better ways of doing things and, above all, the experience highlighted the need for a much broader understanding of the impact of big natural disasters.

Read more on IT risk management