Building Unbreakable Software @ Toronto AWS Meetup

We met at The Score‘s office, and Nate Smith talked about ways to structure AWS-hosted applications for availability and consistency.

Here are some points that stood out:

Your software will probably outlive the VM it lives in. Be ready to deploy to another server on short notice, or have redundant servers ready
Build systems that get stronger when they break, like human muscle. Read Antifragile
EC2 can have network outages between nodes. Do not trust the network more than you trust an instance
One bad outage example – on April 2 2011, parts of the AWS EBS service were down for 80 hours (despite this, Amazon is still better at sysadminning than you)
When thinking about CAP theorem and a network of database servers, assume the network will go down. That means the P (partition tolerance) is chosen as 1 of your 2 options, and you are choosing between C and A. Shopping Cart software usually picks Availability as the other option. How to deal with the resulting inconsistency is a business issue
Check out Jepsen, tests of different databases to see they react to network partitions.
Oversimplified CAP summary:
- Got 1 MySQL server? You have CP, because your data is consistent (it’s all in 1 place), but not available (1 server goes down = no availability)
- Got 1 read/write server and replicas to read from? You have AP because your data is available (some read replicas can down and you’ll be OK), but data isn’t consistent across all servers (due to replication lag, network partitions, or other issues)