This article is part of a multi-post series on Software Design. In the last post, we took our design and scaled it up to meet our unique performance needs. In this article, we will start ensuring that our platform is highly available.
As mentioned in previous articles, this application platform is by example. As we cover the design options and challenges for this application, it is essential to keep in mind that it is an example.
Understanding our Availability Needs
Much like scaling, when designing for availability, it is easy to take a simple design and make it complicated. Complexity often happens due to a lack of understanding of the availability requirements of the platform. Many times people will jump into creating cookie-cutter availability solutions without considering what is required and what isn’t required.
Availability is an area of design where there are standard patterns, but not every approach is necessary or applicable across the board. To create the right availability design for our application, we need to breakdown and have a thorough understanding of our needs. As such, one of the first steps in our design process is to review our availability goal.
In the first article in this series, we stated that our example application required 99.9% uptime. This percentage allows for about 40 minutes of downtime per month. While this is a good guideline, it doesn’t give us everything we need.
To truly understand the availability needs, we need to take an in-depth look into our platform.
If we look at the overall design of our platform, it would be easy to break it into three major components.
Frontend Web Application
The first component is our web frontend, which is how our users interact with our calendar service. It’s an essential feature of our platform. Any unavailability in our web application would immediately impact our customers’ ability to use our service.
This service is an area where we want to put a high focus on staying online. We want to be at least 99.9% available, but 40 minutes a month is probably not good enough. We should aim for less downtime per month, as 40 minutes might not seem like a long time, but from a customer experience perspective, it is.
Backend Calendar Services
While the frontend is what our customers see, the backend is what keeps it working. Our calendar API is something we should consider as the heart of our platform. This API is supporting all web traffic and all backend updates.
This service is an essential service that is required to keep other services online. Therefore this service requires the same level of availability as those that depend on it.
This section of our architecture covers our task publisher, message queues, and task workers. These services are essential to keep our calendars up to date, but the availability of these services is less visible to customers. Our updates could be unavailable for periods of time without users noticing.
It is in this segment of our architecture that we have a lower availability need. We should keep this in mind to avoid over-complicating the whole design by trying to keep this segment in-line with the others.
Note: While backend updates have a lower availability requirement, that doesn’t mean extended downtime is acceptable.
If extended downtime is not acceptable, why are we discussing how some parts of the platform can be less available than others? Mainly because it helps us make better design decisions.
If we look at our design, our scheduled updates fundamentally work differently than our frontend or backend services. How we solve high availability for asynchronous tasks will differ from how we solve it in our web-based services. Knowing where and when we can make trade-offs is key to creating an effective high availability design.
Since the frontend application requires the highest availability, it seems like the most logical place to start.
The first question to ask is, “How many Availability Zones do we need?”
What is an Availability Zone?
Depending on experience, this term may or may not be familiar to you. An availability zone is a logical datacenter (sometimes physical datacenter) used to host an environment. In general, availability zones are sections of physical datacenters.
Each availability zone has isolated networking and power equipment. This isolation helps reduce the likelihood of two availability zones from going offline. However, there have been, of course, situations where this has occurred. Usually, these outages are due to a physical problem that spans the entire data center. Simple examples of physical issues are weather conditions or natural disasters.
Since our web frontend is critical to our business, we will run out of three availability zones. Why three? Well, that’s simple. Conventional wisdom for availability is three is always better than two. But why is this?
Three is Better than Two
The three vs. two philosophy for availability is a simple concept to understand once explained. This philosophy comes from the thought that if you have two instances of a service, data center, etc. if one is down for maintenance, the risk of a complete outage is very high.
If the second instance were to fail at that time, there would be no instances to send traffic. With three instances, you can have one instance down for maintenance and rest easy that even if another instance failed, things would still be ok.
In addition to the maintenance example, there is always the possibility of multiple failures. From my own experiences, I can tell you that it is best to plan for two losses to coincide. While not frequent, it can and does happen. Having three instances will give you safety and allows for at least two failures.
Now that we know how many instances we will run, we can start looking into directing traffic to those instances. Luckily, this is a no brainer for our web application.
Since users get to our web application via our example.com domain name, we can use a series of load balancers to direct traffic to a running instance. Since the beginning, we have had both a DNS and a local load balancer identified in our design. Now we can explore how those help us with availability.
The local load balancer, which sits directly in front of our web application, will accept requests and balance traffic between the local instances of our web application. It also has the job of performing health checks on our web applications, and if any fail, they are removed from the load balancing list. It is key to remember is that the local load balancer is only aware of local web application instances. It does not know of other availability zones.
That’s where the DNS load balancer comes into play. The DNS load balancer sits outside of our availability zones, usually as a service provided by our hosting solution or a CDN service. The DNS load balancer doesn’t know about our web application; instead, it is only aware of our local load balancers.
This load balancing is how our users navigate between availability zones. Each request for DNS gets directed to the DNS load balancer, which gives the IP of the local load balancer in a near-by availability zone.
Many DNS load balancers support routing users to the closest availability zone based on their geolocation. For our web application, we would want to use this to optimize page load times and reduce network latency.
Web Application Dependencies
At this point, we know we want to run the web application out of three availability zones and that we can direct traffic there. But what about other parts of our design that the web application requires to run? Specifically, what about the web database and our calendar API?
Three Cluster Database Tier
Along with our web application, we also need its database. Typically, we want to keep our database and application deployed in the same availability zones. This principal helps reduce network latency with database calls, as network latency significantly reduces database performance.
When our primary use is to provide an interactive user interface, any database latency would be highly visible.
To avoid latency, we will need to deploy our database within each of the three availability zones we plan to run within.
In the past, this would not be easy, but today several databases can handle this kind of topology.
If we look at our web application and calendar API, our databases store customer preferences and calendar entries. This type of data doesn’t require strict consistency.
It’s ok if our databases aren’t always 100% in sync; as long as they eventually sync up, our users are not likely to notice differences.
Suppose our data required strict consistency, even across availability zones; like a financial application, for example. Then replication would be much harder to solve, and our options for database technology would be limited.
But our use case is simple, and our needs can easily be solved using standard open source databases, which is lucky because this keeps our web application design simple.
In the scaling exercise of our design, we split our calendar and web applications. However, our web application still relies on data from the calendar API. While it is certainly possible to have our three availability zone web applications talk to a two availability zone calendar API, it would be a bad practice.
A bad practice because the calendar API is so critical to our web frontend; ideally, we should deploy them together. Which will mean for every availability zone; the web frontend exists, and so does our calendar API. It also means our calendar API must have at least the same availability as our frontend web application.
From our previous design, we can see that the calendar API is used by the Web application to show calendar entries. This usage is driving our availability zone requirements for the calendar API.
We can also see that our scheduled updates are using the calendar API. Because it is asynchronous and in the background, We don’t need to worry about latency. Our scheduled updates can call a calendar from anywhere.
Since all of our calendar API traffic is HTTP based, we can use a similar setup as our web frontend.
Traffic Management for Backend Services
To manage our traffic and ensure it ends up in the right places, we will use a DNS load balancer and local load balancer as described previously. This setup will fit both the web frontend and scheduled updates calls to the calendar API, but for different reasons.
Our web requests are all synchronous and need to happen quickly. Our scheduled updates are asynchronous; they run in the background and don’t have the same availability needs as our web-based services.
Since our availability needs are different, we should also assume that traffic will always be from a different zone. If we use this design for our interactions, whether we end up running both in the same availability zone or not doesn’t matter.
To do this, we will have our scheduled updates access the calendar API through the DNS load balancer. Traditionally if our two services ran in the same datacenter, we would skip the DNS load balancer and only use the local load balancer. The problem with this approach is if our local calendar API is completely down either for maintenance or unexpectedly. Our scheduled updates can’t leverage other instances in different availability zones.
By assuming our system runs out of a different availability zone, using the DNS load balancer, any local failures will cause our traffic to route to another availability zone. This approach gives our scheduled updates a hassle-free method of identifying which instance to use. All application instances use the DNS address, and the load balancer will figure it out from there.
With our web application, our needs are a little different. We still want our latency to be as low as possible, using the local calendar API. But we also want to failover to another availability zone if the local nodes are unavailable. For these requirements, we can still use the DNS load balancer.
Many DNS load balancing solutions offer multiple load balancing algorithms. For the web application needs, we will want to use geolocation or latency based load balancing. These settings will direct the web application traffic to the closest instance of the calendar API. Most times, this will mean the local instance; however, if the local instance is down, it will mean the closest availability zone.
This load balancing setup gives the web application both the performance it needs and a simple method for controlling failover.
Many would look at our use case and say that it is the perfect example of needing a service mesh. That might be true if you already have an established service mesh setup and running. For our simple use case, DNS load balancing is the right balance of simple and effective. It works, and it’s easy to manage.
Note: There may be other reasons to implement a service mesh such as performance and load balancing. But for the act of redirecting traffic to another availability zone, our solution will work with a DNS load balancer.
While our web application and calendar API are tied together from an availability perspective, our scheduled updates components are not. We have a different set of requirements for scheduled updates.
The first question we should ask is, what level of availability to we need? Unlike our web application, if our scheduled updates are unavailable, it is less noticeable to our end users. While downtime in our scheduled updates will break new calendar additions, the existing calendar entries will work. In which case, this component needs to be relatively highly available, but not as highly available as our other components.
One thing to explore is that this segment of our architecture is an area where we can get away with two instead of three instances. The key to this is the background nature of scheduled updates. They all happen in the background, and it doesn’t matter which availability zone does the work as long as some availability zone eventually does the job.
Suppose we go down for maintenance and the other crashes. We have some time to recover before it becomes a problem. We can choose to run out of two availability zones instead of three. This choice will save in infrastructure costs and complexity, but when something breaks, we run a higher risk of impacting calendar additions and updates.
As mentioned before, the scheduled updates are running out of different availability zones from our other components. We should assume that this component is always in a different availability zone as the others. This assumption will help us design and implement interactions with other systems to work with both scenarios.
As such, all interactions with the calendar API will be via the DNS load balancer. All other exchanges, such as the task publisher or message queues, will happen locally. Each availability zone will act as a disconnected system pushing updates to whichever calendar API is available.
By keeping the two instances disconnected and independent, we simplify the design of our scheduled updates. While it’s entirely possible, both instances will schedule update at the same time. The nature of our data will allow for the calendar API to course-correct on its own. Duplicating the work is also not a problem, as the cost of synchronizing work between two physically distant instances if often much higher than just blindly running a second copy.
Availability Design in Summary
In this article, we broke our platform down into three main components. The web application, backend API, and scheduled updates all have unique availability needs. We grouped the backend API with our web application tier, as the web application is our most visible service to end-users, and it requires the backend API to function. These tiers use DNS and local load balancers to direct traffic to the nearest active availability zone. A simple but effective design.
Our scheduled updates are a simple two availability zone design where neither zone talks to the other. They both act independently and perform actions on the backend API via DNS load balancers. A strategy that doesn’t follow the “three is better than two” philosophy but makes up for it in lack of complexity. Since each availability zone is independent with no data or task replication, it is much less likely to fail.
The most complicated part of our design is the database tier, which must replicate across three availability zones. This replication is problematic because not many databases can handle multi-zone replication reliably. We will also have to structure our database data to avoid replication conflicts, as writes across many availability zones are always ripe for contention.
With our high availability design complete, we only have one more article, which will explore how to choose the technology used to build our platform.