Measuring a Successful Disaster Recovery

Recovery Time Objective

Natural disasters, hardware failures, and human error – not anything we want to happen, but the reality is they do happen. Considering organizations’ IT infrastructure, applications and valuable data at risk, all of these occurrences could cause irreparable damage.

This is not new information. We all know that “things happen.” So the focus should be on preparing for these “things,” preventing what we can, and having a recovery plan for everything else.

The Key to establishing a business continuity and disaster recovery (BCDR) plan is identifying a recovery time objective (RTO) and a recovery point objective (RPO).

The RTO is the maximum acceptable length of time that a computer, system, network, or application can be down after a failure or disaster occurs. It is the maximum desired length of time allowed between an unexpected failure or disaster and the resumption of normal operations and service levels.

The RPO is the maximum amount of data loss, measured in time, which an organization is willing to potentially lose. The RPO is expressed backward in time from the instant at which the failure occurs. It can be specified in seconds, minutes, hours, or days. For example, if the RPO is 30 minutes, then the company has decided that it is willing to accept no more than 30 minutes of data loss. The smaller this number, the more complex and expensive the solution will be.

A surprising number of organizations don’t take time to review these numbers from a business perspective, and instead leave it to the IT department to figure out. But the truth is that they are business decisions that require an IT solution. The business is the only group that really knows how long they can be down, or how much data loss they are willing to be exposed to.

Without a clearly defined RTO and RPO, BCDR is just expensive, wishful thinking.

How to ensure RPO and RTO are achieved

When an outage occurs, the clock starts. Backward in time to establish the desired point at which data needs to be recovered from. And forward in time measuring how long a computer, network, application, etc. can be down before resuming normal operations. It is critical that both measurements are met if not exceeded. What are some steps to ensure this happens?

  • Establish RPO and RTO early in the disaster recovery planning process. These measurements dictate necessary recovery tools and procedures.
  • Clearly identify the data that is required to be included.
  • Utilize the appropriate tools for replication. If the RPO is 24 hours or more, a tape, disk or remote Web server backup may be appropriate. However, if RPO is much shorter, redundant systems and replication may be necessary. As it approaches zero, clustering and completely redundant systems become possibilities.
  • If needed, compress data or increase network bandwidth to allow for faster transfer of data.
  • Back-up applications and data to a third-party recovery center.
  • Ask a third-party to oversee the disaster recovery process. A managed services provider ensures optimal recovery processes are in place and executed.
  • Test the replication process. Probably the most important part of the process. Validate, maintain and adjust the recovery process to meet changing organizational needs.
  • Conduct an annual review of current BCDR policies, including RPO and RTO. The reviewers should include business operations and IT management.

What can weaken the recovery process?

There are many issues that can cause RPO and RTO to not be met. Testing of the disaster recovery process can help eliminate many of these factors. Here are some obstacles to achieving RPO and RTO goals.

  • Incorrect or incomplete understanding of RPO and RTO. Establishing unreasonable measurements leads to unmet goals.
  • Inexperienced IT staff.
  • Unreliable backup systems.
  • Unavailability or lack of staff to manage recovery process. If natural disaster, IT staff may be unable to physically participate in recovery.
  • Failure to manage all pieces of the puzzle.
  • Failure to document and test the entire plan regularly. This is the most critical.

Effective BCDR is a core business discipline that establishes a company’s ability to respond and recover in a crisis. It is critical that BCDR is planned and tested before an outage occurs. Guardian Eagle’s experts are here to help plan, prevent and overcome site-wide failures that could cause irrecoverable damage. Put our many years of designing, building and managing highly available, mission-critical systems to work for you.