Cloud Disaster Recovery

A disaster is an unforeseen event in a system lifetime. It can be caused by natural disaster (earthquake, climate change or tsunami), human errors or hardware/software failures.

This in turn can lead to serious financial loss or even death. As a result, the major objective of Cloud Disaster Recovery (DR) is to provide an organization with automated and reliable approach(es) to data recovery and failover in the event of a man-made or natural catastrophe.

A model of a disaster recover system is presented in the figure below.

A Model of Disaster Recovery System

A Model of Disaster Recovery System

The definitions of Cloud Disaster Recovery are as follows:

  • Techopedia defines cloud disaster recovery as a service that enables the backup and recovery of remote machines on a cloud-based platform.
  • Techtarget defines cloud disaster as a backup and restore strategy that involves storing and maintaining copies of electronic records in a cloud computing environment as a security measure.

Cloud DR is an Infrastructure-as-a-Service (IaaS) solution that provisions backup and recovery for critical dedicated server machines hosting enterprise-level data applications, (for example, Oracle, MySQL), located on a remote offsite cloud server.

A cloud disaster recovery system is often structured in a distributed computing, centralized storage manner to ensure ready availability of application and the security of data. Disaster recovery can be categorized into three levels based on different requirements. These include data-level, system level and application-level disaster recovery.

Data-level disaster recovery is the most fundamental among all others and guarantees the security of the application data. System level disaster recovery disaster makes recovery for operating system of application server as short as possible.

System level disaster recovery ensures recovery occurs in real-time relieving the users of the feel that any disaster occurred.

Disaster Recovery Requirements

Five major requirements, including Recovery Point Objective (RPO) , Recovery Time Objective (RTO), performance, consistency and geographical separation are discussed here.

These requirements are influenced by factors like actual cost of system downtime or data loss, correctness and application performance:

Recovery Point Objective (RPO)

The RPO of a DR system depicts the point in time of the most recent backup prior to the event of a disaster or failure. RPO is affected by a business decision, either to allows a no data loss for some applications via a continuous synchronous replication process with RPO=0, while on the other hand, it permits some level of data loss for some other applications, which could range from a few seconds to days as the case may be.

Recovery Time Objective (RTO)

The RTO is an impertinent business decision that determines the duration it takes an application to be restored back online in the event of a failure.

This includes the required time for failure detection, configure any required servers in the backup site (virtual or physical), initialize the failed application, and perform the network reconfiguration required to reroute requests from the original site to the backup site so the application can be used.

The application type and backup technique determine the process to be executed next. Such processes may include the verification of the integrity of state or performing application specific data restore operations and require careful scheduling of recovery tasks to be done efficiently.

It should be noted that a very low RTO ensures continuous running of business such that an application seamlessly continues to run despite a disaster.

Performance

Disaster recovery service must allows a minimal impact on the performance of each application being protected under failure-free operation for it to be useful.

The impact of DR on performance can be direct or indirect. Direct impact involves operations like synchronous replication that allows application write not to return until it is committed remotely.

However, with indirect impact, disk and network bandwidth resources are consumed which could have otherwise been used by the application.

Consistency

This ensures that an application regains a consistent state after a failure occurs. To achieve this, the DR mechanism often requires application-specific settings that guarantee proper replication of all relevant state of the application on the backup site.

Similarly, the DR system assumes a consistent copy of the pertinent state of an application to be made available on disk, and leverage on a disk replication scheme to create consistent copies at the desktop site.

Geographic Separation

This ensures that both the primary and backup sites are located at separate geographical locations such that they are not affected by a single or similar disaster at the same time.

This requirement is challenged by higher WAN bandwidth costs due to increased distance and greater network latency.