Gajus - Fotolia

Cloud-era disaster recovery planning: Assessing risk and business impact

In the first in a series on cloud-era disaster recovery, we provide a step-by-step guide to building firm foundations for the disaster recovery plan, with risk assessment and business impact analysis

This article can also be found in the Premium Editorial Download: Computer Weekly: Using technology to protect human rights

Technology disaster recovery (DR) initiatives provide strategies and procedures that can help organisations protect investments in IT systems and infrastructure. The essential mission for disaster recovery is to return IT operations to an acceptable level of performance as quickly as possible following a disruptive event. The development and rapid acceptance of cloud-based technologies have greatly enhanced the IT DR process.

A disaster recovery plan has a consistent structure which makes it easy to organise and conduct development activity. Let’s examine the flow of a programme.

Figure 1 is adapted from International Standard ISO 27031:2011, developed by the International Organisation for Standardisation (ISO), Information technology – security techniques – guidelines for information and communications technology readiness for business continuity. It uses the plan-do-check-act model present in current ISO standards.

As can be seen from Figure 1 below, the IT disaster recovery process (also called information and communications technology/ICT continuity) has a standard process flow, based on the ISO plan-do-check-act model.

Figure 1: The plan-do-check-act model for IT disaster recovery

Business impact analyses (BIA) are typically conducted before a risk assessment to identify the most important business functions and the IT systems and assets that support them.

Next, the risk assessment (RA) examines the internal and external threats and vulnerabilities that could negatively impact IT assets. Availability of cloud-based services, which are typically located somewhere else outside an IT department’s control, underscores the importance of performing these two analytical activities.

Once critical systems, critical business functions and risks associated with each have been defined, the next step is to define strategies to mitigate the risks and threats to those critical assets.

Two examples of such strategies might be to contract for off-site storage of critical data and systems using a third-party cloud services firm such as Amazon Web Services (AWS) or Microsoft Azure, and to source critical IT assets such as servers and routers from multiple suppliers.

DR plans provide a step-by-step process to respond to a disruptive event – as identified in the risk assessment. Response steps are designed to provide an easy-to-use and repeatable process to recover damaged IT assets and return them to normal operation as quickly as possible. This presents an interesting challenge with cloud-based services, in that the IT department has virtually no hands-on control of services provided and must be especially proactive when evaluating – and subsequently managing – a cloud service provider.

Exercises help determine if disaster recovery procedures work as intended. A variety of exercises can be performed, ranging from a table-top review (usually in a conference room) of plans and their associated recovery procedures, to a full-scale “pull the plug” exercise that examines what happens when the real system fails.

In a cloud environment, the DR service provider may offer its own version of DR exercising, and it is important to examine what can be done in advance of contracting for a cloud service. It is especially important to find out what resources the vendor will use, how much performance data from the exercise can be provided, and how actively involved users can be during an exercise.

Plan maintenance ensures a process is established that accommodates change management, changes in personnel, and other situations that can affect the plan’s content and effectiveness. Maintenance ensures plans are fit for purpose and aligned with current staffing and business operations.

Cloud-based DR serviced providers can offer similar kinds of services to customers, and may offer flexibility during plan development and maintenance activities. It is very important to carefully investigate all services available from a cloud provider, and compare the costs of third-party management versus user management.

Standards for BIAs and RAs

The ISO has a standard for performing a business impact analysis that provides useful guidelines for planning and executing a BIA. It is called ISO 22317:2015, Societal security – business continuity management systems – business impact analysis. When conducting a risk assessment, a useful standard is available from the US National Institute for Standards and Technology (NIST). The standard is NIST SP 800-30 (2012), Guide for conducting risk assessments.

Business impact analysis (BIA)

The initial analytical step is the business impact analysis, which identifies the most important (mission-critical) business processes and supporting IT assets.

The BIA helps identify additional consequences to an organisation if key business functions are disrupted, including loss of customers, poor financial performance, damage to reputation, and impacts to employees and supply chains. Once the most important business activities and the systems and data that support them are identified, the next step is the risk analysis.

A BIA uses a series of questions presented to leaders and subject matter experts in each operating unit in the company, including IT. Questions should address the following issues, as a minimum:

  • Understanding how each business unit operates.
  • Identification of critical business unit processes that depend on IT (on-site).
  • Identification of critical business unit processes that depend on IT (cloud-based).
  • Financial value of critical business processes (for examples, revenues generated per hour).
  • Dependencies on internal organisations.
  • Dependencies on external organisations, especially cloud services.
  • Data requirements.
  • System requirements.
  • Minimum time needed to recover data to previous state of use.
  • Minimum time needed to return IT operations to normal or near-normal following an incident.
  • Minimum number of staff needed to conduct business.
  • Minimum technology needed to conduct business.
  • Maximum amount of time for technology to be unavailable before the organisation can no longer deliver its products and services.

BIA outputs present a clear picture of actual impacts on the business, in terms of potential problems and probable costs. Results of the BIA help determine which areas require protection, the amount of business tolerance to disruptions, the minimum IT service levels needed by the business, and the maximum tolerable amount of IT downtime before the business begins to fail.

Risk assessment (RA)

The IT world typically focuses on one or more of the following risk scenarios, the loss of which would most certainly have a negative impact on the organisation’s ability to conduct business:

  • Loss of access.
  • Loss of data.
  • Loss of function.
  • Loss of skills.
  • Loss of control.

Availability of cloud-based services means loss of control is a definite risk for IT departments. On-site DR planning and management can be managed end-to-end. But with the cloud, control for many functions defers to the third party. IT leadership must decide if the risk inherent in using the cloud is worth taking.

Risk assessments identify and analyse risks, threats and vulnerabilities which can lead to the above outcomes. While many ways to perform a risk analysis are available, Table 1 provides a simple approach that can be easily implemented. The challenge is to validate risk factor assumptions with senior leadership.

Table 1 provides realistic examples of on-site and cloud-based risk events. Based on experience and available statistics, such as from insurance companies or actuarial data, it is possible to estimate the likelihood of specific events occurring on a scale of 0 to 1 (0.0 = will never occur, and 1.0 = will always occur). Then do the same with the impact of the event, using a 0 to 1 range (0.0 = no impact at all, and 1.0 = total loss of operations). The last column lists the product of likelihood multiplied by impact. This becomes a “risk weight factor”. Situations with the highest risk weight factors become the events to be addressed by DR plans.

Risk treatments include the following:

  • Prevent – High-probability/high-impact events (actively work to mitigate these).
  • Accept – Low-probability/low-impact events (maintain vigilance).
  • Contain – High-probability/low-impact events (minimise likelihood of occurrence).
  • Transfer – Medium-probability/medium- to high-impact events (transfer risk to a third party, such as an insurance company).
  • Plan – Low-probability/high-impact events (plan steps to take if this occurs).

The relationship between BIA and RA

When the business impact analysis and risk assessment have been completed, the next step is to correlate the data from each activity into a table (or other format) that presents the critical risks and business impacts caused by the risks occurring.

The following table also includes the recovery time objective (RTO), a metric developed from the BIA that indicates how long a system/process can be unavailable before the organisation is unable to function normally. Table 2 depicts a way to map findings from BIA and RA into a report for top management.

Summary

Business impact analyses and risk assessments are key activities associated with the creation and management of technology disaster recovery programmes. Availability of cloud-based services introduces new variations to traditional datacentre-based analyses. A clear understanding of IT risks, especially regarding cloud services, and their relationship to business operations is essential when developing a disaster recovery plan. 

Read more about disaster recovery and the cloud

Read more on Data protection regulations and compliance