Design for Availability – Game of 9s

Recently in one of the discussions I heard a statement – “for our solution, we require near 100% availability”. But do we really understand, what’s near 100% really means. For me, anything above 99% is near 100. But in reality, there is huge difference in 99% availability and 99.9999% availability.

Let’s look at definition of Availability – “Availability is the percentage of time that the infrastructure, system or a solution remains operational under normal circumstances in order to serve its intended purpose.

The mathematical formula for Availability is: Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time

That means, for an SLA of 99.999 percent availability (the famous five nines), the yearly service downtime could be as much as 5.256 minutes.

availability 1

As an IT leader, we should be aware of differences between nines’ and define requirements properly for the development team. As higher the nines, higher will be operational and development cost.

Another statement I heard during discussion – “cloud provider mostly provides 99.95% availability, so our system also provides same.”. Really? That may be true, if you are using SaaS solution from any of the cloud provider. But if you are developing your own solution over any cloud provider’s IaaS or PaaS services, then consider following two things,

  1. SLA defined by cloud providers is of their individual services only. That means, combined SLA need to be calculated based on cloud services you have consumed within your solution. We will further see how this is calculated in the next section.
  2. Suppose you are using only PaaS services in your solution, then you still own Application and Data layer, any bug or issue in your code, will result in non-availability. That also need to be considered while calculating your solution availability.

Combined SLA for consumed cloud services

Suppose you are developing a simple web application using Azure PaaS services, such as Azure App Service and Azure SQL Database. Taken in isolation, these services usually provide something in the range of three to four nines of availability,

  • Azure App Service: 99.95%
  • Azure SQL Database: 99.99%
  • Azure Traffic Manager: 99.99%

However, when these services are combined within architecture there is possibility that any one component could suffer an outage, bringing overall solution availability lower than individual availability.

Services in Serial

In following example where App Service and SQL Database are connected in serial, each service is a failure mode. There could be three possibilities of failure,

  1. App Service may go down, SQL Database may still be up and running
  2. App Service may be up and running, SQL Database may go down
  3. Both App Service and SQL Database may go down together

availability 2

So, to calculate combines availability for serial connected services, simply multiply individual availability percentage, i.e.

Availability of App Service * Availability of SQL Database

=

99.95% * 99.99%

=

99.94%

Observation – combined availability of 99.94% is lesser than individual services availability.

Services in Parallel

Now to make this solution highly available, you can have same replica of this solution deployed in another region and add traffic manager to dynamically redirect traffic into one of the region. This may add larger failure modes, but we will see how it will enhance/increase solution availability.

As we calculated,

  • Availability across services in Region A = 99.94%
  • Availability across services in Region B (replica of Region A) = 99.94%

Both Region A and Region B are parallel to each other. So, to calculate combined availability for parallel services, use following formula,

1 – ((1 – Region-A availability) * (1 – Region-B Availability))

=

1 – ((1 – 99.4%) * (1 – 99.4%))

=

99.9999%

availability 3

Also observe, Traffic Manager is in series to both parallel regions. So combines solution availability will be,

Availability of Traffic Manager * Combined availability of both regions

=

99.99% * 99.9999%

=

99.99 %  

Observation – we are able to increase availability from three nines to four nines by adding a new region in parallel.

Please note, above is the combined availability of services (you have chosen) provided by Azure. This availability doesn’t include your custom code. Remember following diagram, which explains what is owned by cloud providers and what is owned by you based on cloud platform you choose,

availability 4

Going back to our web application example, using App Services and SQL Database, we have opted for PaaS platform. In that case, the availability we have calculated is from Runtime to Networking layers, which doesn’t include your custom code for Applications and Data layers. So those layers you still must design for high availability. You can refer some of the following techniques, which are useful while designing for high availability solution,

  1. Auto-scaling – design solution to increase and decrease instances, based on active load
  2. Self-Healing – dynamically identify failures and redirect traffic to healthy instances
  3. Exponential backoff – implement retries on the requester side, this simple technique increases the reliability of the application, and takes care of intermittent failures
  4. Broker pattern – implement message passing architecture using queues and allow decoupling of components

The price of availability

Please remember one thing, availability has a cost associated with it. The more available your solution need to be, the more complexity is required, and so forth more expensive it will be.

availability 5

High available solution requires high degree of automation and self-healing capabilities, which requires significant development, testing and validation. This will require time, money and right resources, and all this will impact cost.

In the last, analyzing your system and calculating theoretical availability will help you understand your solution capabilities and help you take right design decisions. However, this availability can be highly affected by your ability to react to failures and recover the system, either manually or through self-healing processes.

Cloud Agnostic – Not a myth anymore, just architect solution differently

Vendor Lock-In – a term which every IT decision maker is either talking and thinking about. No one wants to be locked with one cloud provider but like to have flexibility to move around or at-least have an option to move around enterprise workloads between different cloud providers.

Multi-Cloud Strategy

Knowingly or unknowingly many organizations already find themselves of adopting multi-cloud strategy. Meaning they are using different cloud providers for their multiple workloads, example – one workload in AWS and another workload in Azure, but don’t have the flexibility to move workloads from AWS to Azure or vice-versa.

1

Example of Multi-cloud strategy

Within the above multi-cloud strategy example, if IT wants to move Workload-2 from Azure to AWS, it will not be straightforward, and it will require some effort. Reason for that – because Workload-2 was not designed to be cloud agnostic.

So, what is Cloud Agnostics?

A workload that can be migrated seamlessly around different vendor clouds will be called a true cloud agnostics workload. But is it possible to be true cloud agnostic? Yes, it is, you just have to design and architect it to be cloud agnostic. But before we look into how to design a cloud agnostic solution, let’s look at different models in the way a workload can be cloud agnostics.

Dev/Test and Production Segregation

Running development and testing on one environment and production on another environment is one of the most common scenarios. Benefits of this setup is, running development and testing on cloud environments which are cheaper and don’t require scalability, and only use Production where scalability is required.

2

Example of Dev/Test and Production segregation

Disaster Recovery in another cloud platform

Leveraging another cloud provider for Disaster Recovery environment is the most useful scenario for being cloud agnostics. It requires running production environment on one cloud provider and running same copy of application on another cloud provider. If production environment goes down, all requests should be diverted to DR environment hosted on another cloud provider. While designing this cloud agnostics DR strategy, you have to keep RTO (Recovery Time Objective) and RPO (Recovery Point objective) in mind and design it accordingly, which will drive decisions like – how frequently data sync need to happen across DR site, etc.

3

Example of Disaster recovery in another cloud platform

Production in multi-cloud (truly cloud agnostic)

It requires running application simultaneously on different vendor clouds and both sites are up and running simultaneously and sharing user load. Benefits of this approach is – in case one cloud provider runs into issue and goes down, your workload is still up and running without any or with minimal RPO (Recovery Point Objective) impact. Only drawback of this approach will be – cost, as you will be paying double to both cloud providers. But it’s an ideal solution of business mission critical applications.

4

Example of Production environments in multi-cloud

How to architect workloads to be cloud agnostics?

Before you decide designing your solution, you need to understand what all services are common across different cloud providers. To understand that, you can categorize cloud services broadly within three categories,

  1. Base Services – Services which has become standard to provide any cloud-based services, such as virtualization, networking etc.
  2. Broadly Accepted Services – Services which has been industry accepted and mostly available across all cloud provides. Example of these type of services are, Dockers, Kubernetes, MongoDB, PostgreSQL, etc. As the cloud adoption increases, you will see more an more services getting moved within this category.
  3. Unique Services – These are services which are unique to each cloud provider and becomes their selling point and differentiator. Examples of these are AWS Lambda, Azure IoT Hub, Azure Cognitive Services, etc.

5

Segregation or categorization of services

Now to design/architect your solution and be cloud agnostic, prefer to choose services from first two categories. If you choose to use any of the unique services provided by cloud providers, in that case you are automatically tied up with that provider and you are no more cloud agnostic.

Managed Services

Question arises, if you restrict yourself up to broadly accepted services, in that case you may miss out most of the managed services provided by cloud platforms. If any of the managed service provides huge advantage in comparison to building on your own, in that case it makes sense to use it, rather than not using it for the sake of vendor lock-in. If you still want to be vendor independent, in that case there are two options,

  1. Look for managed services which accepts common protocols. Example of this is – Azure Cosmos DB, which can be accessed using MongoDB wire protocol. So, your application can have flexibility to leverage Azure Cosmos DB using MongoDB APIs within Azure, while also leveraging Mongo DB in AWS, getting a benefit of managed service within Azure.
  2. Second option is to have your own layer of segregation before leveraging managed services, so that you can switch to another provider later.

6

Example of custom providers for different cloud platforms

 

Microservices – especially what it is NOT !!

Microservice is another term in the industry, which was picked up in last few years, and everyone using it in one context or another. Some are using it in right context, and some are using to look cool, without properly understanding the concept.

Microservice in itself doesn’t have a standard, it’s an architectural style, and there are many flavors of interpretations and implementations of microservice.

According to Martin Fowler: “Microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare minimum of centralized management of these services, which may be written in different programming languages and use different data storage technologies.

To put it in simple statement – breakdown application into small services that each perform one business function and can be developed, deployed, and scaled independently.

Microservice properties

There is already lot of literature, books and articles available to understand Microservices. One of the book I will personally recommend will be Microservices with Azure, authored by Namit TS and Rahul Rai, and I personally got a chance to tech review it.

But rather than going into Microservices details, let me list down few things which doesn’t qualify as microservices, and technologist should understand that and stop creating confusion.

What is NOT a Microservice

  • A published API is not a microservice. Microservice could be an architectural implementation behind API, but that detail is independent of API specification. You don’t say “here is Microservice ready to be consumed by other applications”, you only say “here is API specifications ready to be consumed by other applications”
  • A functionality exposed as an API by another system is not a microservice, it’s just an API. It doesn’t matter how it is implemented in the backend, for consumer application it’s just an API.
  • A simple API implemented as part of a monolithic application is not a microservice, again it’s just an API.
  • If a service is dependent on another service, it’s not a microservice
  • If a service is not independently deployable, it’s not a microservice.
  • If a service can’t be updated independently, it’s not a microservice.
  • A monolithic set of services grouped together and deployed in a Docker container is not a microservice.

and in the end,

  • Stop putting terms like microservices within Conceptual Architecture diagrams. Conceptual architecture is all about appropriate decomposition of the system without going into specifics of interface specification or implementation details.

Cloud Solution – Cost Optimization and Capacity planning

Building cost aware cloud based solutions is very different than the traditional IT approach. While developing applications targeted for traditional local data center, we use to plan to allocate servers/hardware to meet maximum capacity required by the application. Even if maximum capacity required by the application is only used for few hours/days in a year, even in that case we usually allocate maximum capacity hardware. We use to do that because allocating hardware within traditional data centers was a tedious and time consuming job.

With evolution of cloud, allocation of hardware is just a click away. So applying the traditional IT approach of allocating maximum capacity in cloud is not recommended and will not even be feasible from cost perspective, you will end up paying for idle and redundant resources most of the time, and will make it very-very expensive for you.

0 Cloud consumption

So there is a big difference in traditional IT and Cloud based resource allocation approaches. As per traditional IT approach, you plan for worst case scenario, but in cloud based approach, you plan for immediate requirement only, but plan for on-demand scale up for worst case scenario.

Now the question arises, how do you plan resources and do cost monitoring for cloud based applications. To do that, consider following points,

  • Right Sizing – be Granular
  • Make Consumption Cost one of your Architecture’s Quality Attribute
  • Elasticity – Automate Scaling of Resources
  • Tie Cost with Business Returns
  • Continuous Monitoring
  • Purchasing Options
  • Architecture/Solution should be portable

Right Sizing – be Granular

While planning for capacity, be as much granular as much as you can. Each cloud service comes in different sizes, cost and features. There could be chances that multiple small instances will suffice your requirement, rather than one large instance, or vice versa. But how do you decide? You can decide only by understanding your requirements and by calculating expected load on the system.

Example
Suppose you are designing an IoT solution on Azure platform, and leveraging Azure IoT Hub for message ingress. IoT Hub comes in different flavors (free / S1 / S2 / S3), and each have its own pricing and limitations around number of messages supported per day. So question is which one to use and what would be expected price? For that you will have to do quick calculation to understand your message ingress load. Suppose you have half million devices and each device will send out one message per hour to the cloud. So total messages per day will become 12 million messages.

1 IoT Messages

Now take this total number of messages and calculate how many number of units will be required for each S1, S2 or S3 to support this 12M message load per day.

2 IoT Messages Calculation

Above calculation shows that 2 instance of S2 will easily suffice 12M messages load, and would be the cheapest too. Each cloud service is unique, so identifying right size for each service is first step towards cost optimization.

Make Consumption Cost one of your Architecture’s Quality Attribute

While architecting a solution, we consider usual quality attributes, such as Availability, Scalability, Security, Performance, Usability, etc. These are very well applicable for solutions either designed for cloud or traditional data centers. But for cloud, one more attribute plays a bigger role – cost of consumption. My suggestion is to consider “cost of consumption” as one of the architecture quality attribute and design solution accordingly.

There could be a scenario, where you may prefer to give priority to cost over other attribute such as performance. One such example I personally came across – we were using Azure Data Lake Analytics service for data analysis. Azure Data Lake Analytics is an on-demand analytics job service, where you only pay for the job’s running time and nothing else. In the first design approach, we gave priority to performance attribute and created couple of parallel running jobs, which were actually running on same data source, but doing different analytics. We achieved high performance, but the cost was high. Then we changed the approached and merged these jobs into one, which resulted in a slight latency in performance, but resulted in almost 75% of saving on cost.

Elasticity – Automate Scaling of Resources

Cloud brings the flexibility to allocate resources on demand whenever you need it and also release resources when you don’t require. As discussed earlier, allocate resources only what you need right now, but plan for automation to scale up when the demand increases. Almost all cloud providers provide this elasticity.

There are different ways application can be scaled,

Tie Cost with Business Returns

Depending upon the nature of the application, cost of consumption of the application should justify the business need and returns from that application. E.g. suppose you are hosting an eCommerce application, and have configured auto-scaling for that. Now if user load increases, and application scales up, cost of consumption will also increase. But if the user load is increasing on an eCommerce application, than there should be increase in direct profit also, which should take care of increase in cost of consumption.

Now if the increase in cost of consumption is not proportional to the increase in business, then it’s time to revisit architecture, review each component and its cost consumption in details, i.e. going back to point #1 – be granular.

Continuous Monitoring

Until unless you monitor, you will not even know how much you are consuming, and may get a big surprise at the end of the month. Monitoring doesn’t mean monitoring at application level. Monitoring has to be at granular level, i.e. at each resource level which is getting consumed within an application. Best way is to tag each resource, and monitor its usage on daily basis. Based on the monitoring data, analyze consumption and identify optimization options. After identifying optimization options, reconfigure your deployment, and start monitoring again. So it’s a recursive cycle of Monitoring, Analyze, Reconfigure and monitor again.

3 Monitoring

Purchasing Options

Each cloud provider is getting innovative in pricing and coming up with very lucrative options, such Monthly Commitment (for Azure) or Reserve Capacity (for AWS), along with usual Pay as you Go model. But can you commit for capacity on the very first day? Answer is “No”, you should first start with Pay as you go model, observe or benchmark your application consumption in production/similar to production usage, and identity how much minimum commitment you want to go with. If you can identify your application usage pattern, you can easily leverage commitment offerings from cloud providers and save significant amount.

Architecture/Solution should be portable

Cloud is evolving, new services are getting launched every day or every week. That time is gone, when you use to design solution once and expect it to run for next 10 years uninterrupted. Not only new cloud services, but chances are even for existing cloud services there may be new plans or SLAs been offered. So your architecture need to be agile and flexible to adopt new changes and leverage benefits of new services. Example, if your architecture is designed for microservices patterns, than you can leverage serverless cloud computing like Azure Functions or AWS Lambda.

 

So in short, cost optimization for cloud based solution starts from the Architecture phase itself, and continue during its life cycle in production usage. It’s not a one-time activity, it’s an on-going activity, where everyone contributes, be it architect, designer, developer, tester, and support engineer.