AWS Reliability


We will talk about an overview of the reliability of AWS in this article. In addition, we will list some important points about disaster recovery, what should we bear in mind. Finally, we will introduce some case studies.

Principles of Reliability

There are 3 main points we have to achieve when talking about reliability. First of all, recover from infrastructure or service failures. Secondly, dynamically acquire computing resources to meet demand. Thirdly, mitigate disruptions such as misconfigurations or transient network issues.

There are some design principles for reliability.

  1. Test recovery procedures.
  2. Automatically recover from failure.
  3. Scale horizontally to increase aggregate system availability.
  4. Stop guessing capacity.
  5. Manage change in automation.

Disaster Recovery (DR)

For disaster recovery (DR), we have different class with different cost. The more the cheaper cost, the higher the RTO/RPO.

Backup and Restore: It is very simple to get started and cost effective. What we have to do is to take backups of current systems and store backups in S3. Then we just restore the backups once we need.

Pilot Light: It is slightly better than backup and restore. When there is a disaster, it will automatically bring up resources around the replicated core data set. We have to switch over to the new system in this case.

Low-capacity standby. It is the rich version of pilot light. Even in DR site, all the servers are still running with lower capacity. Therefore, in case of disaster, we are able to immediately fail over most critical production load.

Active-Active: it is the best solution with the highest cost. At any moment the DR site can talk all production load.

The best practices for DR sites are to start simple and work our way up. Also, we have to check for any software licensing issues. Finally, we have to exercise our DR solution regularly.

Reliability Case Study

There are 3 main services we always use: S3, EBS and EC2. For S3, we use for data backup, snapshots storing. EBS snapshot copy used to backup SQL Server on Amazon EC2 data volumes and restore them to the disaster recovery region when necessary. For EC2, all instances except master SQL Server are kept on standby in DR region.


We briefly introduce an overview of the reliability on AWS. And we discuss how to implement a good DR site by providing some design principles. Finally, we see some real examples using existing AWS services.

Leave a Reply