Cloud outages are unpredictable, inevitable, and keep site reliability engineering (SRE) teams up at night. While deploying your applications around the globe would ensure high availability, doing so—and paying for the required server footprint—is prohibitively expensive in most cases for most organizations.
Public cloud remains the most popular deployment model within the cloud-native community, with multi-cloud growing in adoption. Adopting a multi-cloud strategy isn’t as simple as hitting the “go” button, however. Despite best efforts at building out redundancy, cloud providers cannot guarantee 100% uptime.
In fact, it’s not a question of if your servers or services will go down, but when. You may face external difficulties that are outside of your public-cloud provider's control, such as a Domain Name System (DNS) failure or connectivity issues with your upstream Internet provider. Human factors may lead to downtime, such as with a code-deployment mistake that is difficult to roll back. Or a natural disaster may strike, taking down services.
As a result, organizations spend significant amounts of time and money preparing for that next inevitable cloud outage.
Disaster recovery to the rescue (maybe)
The vast majority of organizations use one of four disaster-recovery strategies when responding to an outage.
- Active/active deployment strategy: If your primary server goes down, flip the switch on your DNS and your request goes to a second active server. While this is the fastest and least disruptive option, consider yourself lucky if your IT budget supports this option!
- Active/passive deployment strategy: This is similar to active/active but cheaper because you’re not paying for the passive hosting when you’re not using it. You have to spin up the passive instance and flip the switch on your DNS before service is restored, however—delaying the return of service.
- Periodic database-backup strategy: With this option, when your service goes down, you must first spin up your code, restore your backups, and then continue serving as normal. While viable, this is not a rapid response and can significantly extend service outages. The only thing worse is . . .
- No disaster-recovery strategy: Far too many organizations fall into this category. It’s understandable; you’re busy building features and don’t have time to think about disaster recovery. When something happens, you’ll figure it out!
Disaster recovery requires a high level of discipline. Your team must know exactly what to do when an outage occurs, and even the best-laid plans require some level of human intervention to restore service. As you add new features or components to your system, you’ll need to test your disaster-recovery plan to account for changes. Ideally, this should happen at least every quarter—but it’s easy to put off review until it’s too late.
Multi-cluster disaster recovery
Let’s assume you’re running a modern Kubernetes containerized application. Let’s further assume that your application is running on multiple distributed clusters to maximize availability and performance. How does that impact disaster recovery?
Unfortunately, multi-cluster doesn’t mean automatic failover during an outage. DNS servers often become unavailable. Even if the servers don’t go down, DNS configuration can cause problems during outages. DNS uses time-to-live (TTL) settings to handle routing, and there’s no guarantee that providers will honor your TTL. This can mean that distributed clusters are available but effectively invisible during an outage.
What if there was another approach to disaster recovery?
BGP + Anycast = A match made in heaven
Imagine a disaster-recovery approach that is self-healing, doesn’t require human intervention, involves no single point of failure, and anticipates that anything could go down at any time—including your DNS servers.
The solution? Border Gateway Protocol (BGP) and Anycast Internet Protocol (IP) addresses used together.
First, you’ll need to purchase an IP address range—and your cloud provider must allow you to bring your own IP range (something that most public clouds support). Then there’s the learning curve associated with implementing BGP. As we know, the Internet is essentially a network of networks, where the larger networks are autonomous systems. BGP ensures that autonomous systems communicate with each other in the most efficient way possible. If you have one server that needs to reach another server, BGP ensures that Transmission Control Protocol (TCP) packets from a given server find the most efficient route to their destination on the Internet.
This happens by way of BGP “speakers” that announce the range of IP addresses within their autonomous system to all other autonomous systems. Within a few seconds, the entire Internet knows where each specific IP range resides. When you have a packet that needs to reach a specific IP address, every system knows where to send it based on the IP range it falls within. When the packet reaches the autonomous system with the correct IP range, internal routing finds the exact server with the exact IP address and sends your packet through to its destination.
As a failover response, BGP offers significant time-saving benefits. It’s not uncommon for DNS servers to take five minutes or more to recover from a disaster. With BGP convergence, it takes just seconds. When the BGP speaker announces an IP address, the whole world knows about it. Similarly, when it stops announcing an IP address, the entire Internet also knows about it.
There are a few addressing methods for sending packets across networks, including:
- Unicast: One server sends one TCP packet over the network to exactly one destination server.
- Multicast: One server sends a packet that reaches many different destinations.
- Anycast: There are many servers around the world with the exact same public-IP address, and your packet is guaranteed to find the nearest one.
Anycast and BGP enable a world of possibilities, offering built-in failover to automatically adapt to cloud outages. While you can’t guarantee 100% uptime, combining BGP with Anycast will bring you close to the holy grail with minimal effort.
BGP + Anycast in action
To better understand the benefits, let’s look at a simple scenario with a small Kubernetes test application that requests a response every second. We’ll deploy our cluster in three different clouds and regions—one in New York City, another in Amsterdam, and a third in Sydney. In a healthy state, the clusters have the same Anycast public-IP address.
With our test app, if you’re located in Zurich, you’ll receive a response from Amsterdam because it’s the closest location. If, instead, you're located in Cairo, you’ll also get a response from Amsterdam because it is again the closest location.
If you repeat this scenario to send a request to each cluster every second and stop announcing the IP range for Amsterdam (to simulate one region going down), the app will start getting a response from New York—the next-closest location—in less than a second. Repeat the same process and take down the New York cluster, and you’ll instantly start receiving a response from Sydney. The system will have recovered on its own within milliseconds.
You will have now automatically rescheduled workloads and rerouted traffic to healthy clusters in real time without having to touch a single thing. No disaster-recovery strategy required.
Keep learning
Choose the right ESM tool for your needs. Get up to speed with the our Buyer's Guide to Enterprise Service Management Tools
What will the next generation of enterprise service management tools look like? TechBeacon's Guide to Optimizing Enterprise Service Management offers the insights.
Discover more about IT Operations Monitoring with TechBeacon's Guide.
What's the best way to get your robotic process automation project off the ground? Find out how to choose the right tools—and the right project.
Ready to advance up the IT career ladder? TechBeacon's Careers Topic Center provides expert advice you need to prepare for your next move.