What are the key elements of disaster recovery (DR) planning and design? While there's no one-size-fits-all solution,...
a data asset inventory that includes conducting a data classification project and assessing the potential risk for disaster from within your company will help you protect all of your data resources.
In broad terms, you need to determine the recovery point objective (RPO) and recovery time objective (RTO) for different parts of the business and put together an effective data protection plan to achieve this. You should start by performing a business impact analysis.
Business impact analysis
The point of performing a business impact analysis is to understand the organisation's individual DR requirements. This involves establishing what systems are in place, how critical they are, where the risks are, and the company's auditing and compliance requirements. You also need to evaluate the potential impact of a disaster and how the company might recover from this situation.
A key element involves working out how much data the organisation would be prepared to lose (RPO) and how quickly it would need to regain that information (RTO). These two factors are particularly important as they're the key influencers in determining what plans need to be in place and how much it's likely to cost.
"Do a business impact assessment, evaluate the RPO and RTO required for each application, and then do a gap analysis to understand what you need to do moving forward," said Michael Cock, who works for DST International Billing as a contract IT manager for utility company Sutton and East Surrey Water. "Don't skimp on this phase as it lays the foundations for everything else you do."
The information gleaned from undertaking a gap analysis can then be used to define technical specifications that act as the basis of a request for proposal from storage vendors and/or service providers.
Define disaster recovery levels
After devising what Cock describes as a blueprint for action, the next step is to define between two and four disaster recovery levels (such as high, medium and low criticality) before establishing which systems, applications and data fall into those categories based on a discussion with stakeholders.
Such a move not only ensures that the right level of service is provided based on business need, but that the organisation obtains value for money. "Some people just take a belt-and-braces approach and replicate everything," said Jim Spooner, IT service management practice lead at consulting firm GlassHouse Technologies. "But in today's economic climate, a lot of people are revising the one-size-fits-all approach as they can end up spending too much or too little."
One company that has adopted this tiered storage approach is international construction and consultancy firm Mace. It employs gold, silver and bronze disaster recovery levels that are "defined in terms of the criticality of operations and the order in which they're recovered," said Guy Miller, IT director at the firm. These classifications are then linked to service-level agreements (SLAs).
In Mace's case, the gold level guarantees that data will be replicated every 15 minutes and that no information or service loss will occur beyond that timeframe. Data is replicated every four hours for the silver level, and every eight hours for bronze.
Such an approach not only makes it "simple for the business to understand," Miller said, but provides staff with a "clear running order." This is important, he added, because "you don't know when a disaster is going to strike or its nature," so storage and other experts may not always be on hand.
Mace, which is based in Docklands, is in the final phases of a major infrastructure revamp to support corporate growth and has employed the services of storage integrator B2net. The move included setting up a second data centre for DR purposes using colocation space in Croydon as it didn't make economic sense to build its own facilities.
Attribute pricing to the different DR service levels
The third phase is to understand and attribute high-level pricing to each of the defined DR service levels. This provides storage professionals with a framework to help the business explore the cost implications associated with each service levels. This allows departmental managers to base their decisions on what they truly require rather than what they would like to have.
If you don't do this, GlassHouse Technologies' Spooner said, "they'll always just say that they want the highest level of cover." In his experience, an average of 10% of applications/data fall into the highest disaster recovery level, approximately 20% are in the middle and the rest fall under the lower tiers.
Disaster recovery site options
Once these steps have been taken, storage professionals need to explore their disaster recovery site options. These fall into three broad categories: hot site, warm site and cold site. As a rough guide, hot sites offer the fastest recovery times and are the most expensive, while cold sites provide the slowest but cheapest recovery. Recovery in warm sites falls somewhere between that of hot and cold sites.
Other considerations include whether to use remote-office locations, colocation space or managed service provider facilities. Decisions will depend on individual circumstances relating to disaster recovery site availability, cost-benefit calculations and corporate policy in relation to outsourcing.
As a general rule of thumb, organisations with multiple data centres often use one of them as their DR facility as it's more cost-effective to do so, while companies with multiple sites that don't include data centres frequently opt for colocation space or a more expensive managed service provider.
If organisations choose to run a hot site or offer hot site DR provisioning for their mission-critical applications, they'll require dedicated, live data centre space with active, dedicated server and storage systems. Systems are mirrored and data is replicated to the site synchronously in real-time; failover is automatic and immediate, or takes no longer than 30 minutes.
Hot provisioning is generally associated with sectors running time-critical applications that are fundamental to running the business. These include financial services in areas such as share-dealing, telecommunications, Internet service providers, air traffic control and retailers conducting business online.
Such services are generally provided in-house, partly because it's cheaper to do it that way and partly because even the largest suppliers are unwilling to take on the risk associated with offering such high levels of service unless provisioning is part of a wider outsourcing deal.
One organisation that has taken a pragmatic approach to hot site DR provisioning is the Britannia Building Society. While systems are mirrored between its head office in Leek, Staffordshire and its remote disaster recovery site, most systems need some manual intervention to fail over, although they can all be recovered within 24 hours.
But according to Dylan Matthias, Unix and storage manager at Britannia Building Society, some of the systems the company uses (such as IBM's MQSeries middleware) enable the applications and data that run on it to take on hot status because of their built-in failover and clustering capabilities. MQSeries, for example, is used to integrate the building society's customer relationship management (CRM) system -- which is its most critical, as the branches use it to handle customer accounts -- with a data feed from its mortgage and investment system. This ensures that no service interruptions take place should problems occur.
"We've not spent time and money putting in clustering software around things that have it built in as it's expensive software and we have guys in house with the skills to do manual failover," Matthias said. "But quite a lot of middleware includes load-balancing technology and where that's been included, we've taken advantage of it."
In the case of a warm site, organisations pay for live data centre space dedicated to recovery and have all of their equipment in place. But some of the kit may be repurposed for use in other activities such as development; in addition, test and failover/recovery requires some level of manual intervention. This means that recovery times tend to be within one to four hours. This kind of approach tends to be used for less time-critical applications of all stripes than its hot site counterpart.
Another consideration is that failover is often subject to governance decisions by the IT director. "For example, if something's happened to your SAN [storage-area network] or disk farm, there might be a data corruption or currency issue, so failing over might make it worse," explained Bill Broadley, client director at business and technology consultancy Morse. "Therefore, the decision on whether to do so or when falls to the IT director."
This kind of DR site along with its cold site cousin is routinely provided both in-house and by third-party providers, with decisions depending on corporate policy, available skills and/or suitable sites, as well as cost-effectiveness -- all of which will vary from organisation to organisation.
A cold site amounts to shared data centre space that can be made available on short notice and includes basic infrastructure such as network connections and UPS systems. In some instances, servers or storage systems will be in place but not configured or switched on, while in others, new equipment will need to be shipped in and worked on from scratch. Recovery times in this context can vary from as little as eight hours to a week or more.
One organisation that decided to go down this route was Carmarthenshire County Council. About four years ago, it joined the Welsh Authority DR Consortia, which also includes Cardiff County and Wrexham County Borough Councils. The three put money into a central pot to rent cold site restore facilities in a former ready-built BT building in Cardiff. The site now houses about six Unix servers to run its key systems, which include revenue, benefits and general ledger.
Peter Fearn, computer services manager for the Carmarthenshire County Council, believes that the local authority could "recover everything in a couple of days if we got a DR team down there and they worked non-stop." But he indicates that the situation, while not ideal, is an improvement on times gone by when the organisation rented space in Hitchin from Guardian iT (which has been acquired by SunGard Availability Services).
"With Guardian iT, we were held to three occasions or five days per year to do disaster recovery testing -- but that was taken up in one visit. We can do practically unlimited testing in Cardiff," Fearn said.
The local authority has likewise consolidated and virtualised 150 x86 file and print servers into 40 blades, as well as introduced two 6 TB Hitachi Data Systems AMS SANs to replace its former direct-attached storage (DAS) arrangement, which was time-consuming and difficult to back up. One SAN is located at County Hall and can fail over to the other at a secondary site; backup is now undertaken centrally to disk using CommVault software.
"We were having a few backup failures and couldn't guarantee their integrity, but this project has helped in that regard," Fearn said. As to whether the Council is likely to upgrade its disaster recovery provisioning further, Fearn believes this is unlikely in the near future. "It's a cost issue," he said. "And I can't see that changing much in the current climate."