Design for Reliability – driven by RTO and RPO

Reliability, maintainability, and availability (RAM) are important aspects of any software development. Availability we discussed in detailed in the last post. In this post let’s talk about Reliability.

Let’s first look at what’s the definition of reliability is – “Reliability is the probability that a product will continue to work normally over a specified interval of time, under specified conditions.

Recent disruption on Garmin services through ransomware attack, which disrupted not only sports activities, but also critical services like flight navigation and routing. This attack highlighted the importance of reliability in software development. Failures are imminent, reliable solutions are those which can recover from those failures, with minimal impact on users or consumer services.

Reliability and maintenance are inversely proportional to each other, means more reliable product requires less time and effort for maintenance. So, it’s often a NFR trade-off between reliability and maintainability. This relationship is depicted very well within following diagram,

Reliability Metrics

Traditionally system reliability was measures through MTTF (Mean time to failure) and MTTR (Mean time to restoration) metrics. MTTF is the measure of time for next failure, and MTTR is a measure of time how quickly system can be restored back after failure.

Now days, another set of similar metrics are used to define reliability requirements – RTO (recovery time objective) and RPO (recovery point objective).

  • Recovery Time Objective (RTO) is the metric which defines the time which your business can afford to be in downtime state after the disaster occurs. But if the downtime crosses defined RTO time limit, business will suffer irreparable harm. Objective is to recover services back to operational within this defined RTO time limit, after the disaster occurs.
  • Recovery Point Objective (RPO) is a measure to define the maximum tolerable amount of data that can be lost. This helps in defining the maximum time interval can occur between data backup and disaster occurrence, without causing significant damage to business.

Both RTO and RPO are business metrics, getting them defined early in the development cycle will help you in better designing and architecting the solution. In the next section, let’s look at the process, which can help you design solutions for reliability.

Design for Reliability (DfR)

DfR (design for reliability) is a process so that system performs its function within a given environment over expected lifetime. Two primary benefits of DfR process,

  1. Assurance of Reliability – DfR process embeds specific activities within the development lifecycle, which helps in ensuring reliability is baked in within the final product.
  2. Cost Control and preserve profit – helps in keeping control to the budget, which ultimately helps in preserving market share.

DfR is basically divided into 5 key phases,

  1. Define – define clear and quantifiable reliability requirements, to meet business and end-user needs. There are many factors that play role in defining reliability requirements, such as cost, customer expectations, competitive analysis, benchmarks, etc. Once the requirements are defined, those are translated further into design requirements, development requirements, validation requirements, monitoring requirements and disaster recovery requirements.
  2. Identify and Design – whether it’s a new product development or upgradation project, purpose of this phase is to identify key reliability risk items, prioritize them and detail out corrective actions to mitigate those risks by making corrective design decisions. One of the tools which really helps in identifying these risks is DFMEA (Design failure mode and effect analysis). To know more about DFMEA, please refer its elaboration within next section of this post.
  3. Analyze – analysis of design changes or new design decisions based on DFMEA analysis, is done by executing them against previous failure data or against different design concepts. The focus is to explore, discover and reveal design weaknesses in order to allow design changes to improve product reliability.
  4. Verify – this phase starts, when the design changes or new designs are implemented by the development team. In this phase these changes are validated by executing load testing, performance testing, stress testing, and DR (disaster recovery) drills. During these testing, identified failure scenarios are simulated and corrective actions are verified. If any of the test fails, go back to the design phase, and take corrective actions to mitigate those failures.
  5. Sustain (monitor and control) – once all changes are released to production, product is continuously monitored for failures, either through monitoring system performances or through synthetic testing or by monitoring degradation of health parameters. This is important phase, as this will help you measure product reliability in production, as well as help you in improving it in future. In case the disaster happens, measure RTO and RPO, that will help in measuring the reliability of the product based on the specifications defined at the start of the project.

DFMEA

Design failure mode and effect analysis (DFMEA) is a systematic activity used to recognize and evaluate potential systems, products or process failures. DFMEA identifies the effects and outcomes of these failures or actions and mitigates the failures.

Components of DFMEA template,

  1. Item – component or sub-system of the product to be analyzed, it will consist of one of many functions.
  2. Function – function within the product/item, and it will consist of one or many requirements.
  3. Requirement – requirement of the function, and it will consist of one or many potential failure modes.
  4. Potential Failure Mode – the way component may fail to meet the requirement, and it will result in one or many potential effects.
  5. Potential Effects – the way the customer or consumer services are affected due to failure.
  6. Severity – ranking of failure based on severe potential effect.
  7. Class – failure categorization based on high or low risk impact.
  8. Cause – the reason of the failure. There could be multiple causes for single failure.
  9. Control Method
    • Prevention Control – design or architecture action to prevent potential failure from occurring.
    • Detention Control – design or architecture action to detect the failure.
  10. Correction
    • Corrective Action – action to remove or reduce the chance of cause of failure mode.
    • Responsibility – team or individual responsible to implement the recommended corrective action.

DFMEA should be a living document during the development process and should be kept updated as the product life cycle progresses.

Recovery or failover approaches

With cloud adoption and cloud native applications becoming a norm, designing for failure is becoming more and more important. That makes DfR process so important for any development process, so that you can incorporate and plan for failure as early as possible within the development cycle. Based on the RTO and RPO defined for the product, you will have to identify and implement failure and recovery approaches for your product too. Following is brief overview of three approaches that can be considered for failover approaches,

Backup and redeploy on disaster – This is the most straightforward disaster recovery strategy. In this approach, only the primary region has product services running. Keep taking the backup of the data on periodic basis as per defined RPO. The secondary region is not set up for an automatic failover. So, when a disaster occurs, you must spin up all the parts of the product services in the new region. This includes setting up new product services, restoring the data, and network configurations. Although this is the most affordable of the multiple-region options, it has the worst RTO and RPO characteristics.

Active/Passive (Warm spare) – In an active-passive approach, the choice that many companies favor. This pattern provides improvements to the RTO with a relatively small increase in cost over the redeployment pattern. In this scenario, there is again a primary and a secondary region. All the traffic goes to the active deployment on the primary region. The secondary region is better prepared for disaster recovery because the data services are running on both regions and synched. This standby approach can involve two variations: a database-only approach or a complete light deployment in the secondary region.

Active/Active (Hot spare) – In an active-active approach, the product services and database are fully deployed in both regions. Unlike the active-passive model, both regions receive user traffic. This option yields the quickest recovery time. The product services are already scaled to handle a portion of the load at each region. Network traffic configurations are already enabled to use the secondary region. It is also, however, the most expensive approach, but you will achieve the best RTO and RPO.

Design for Availability – Game of 9s

Recently in one of the discussions I heard a statement – “for our solution, we require near 100% availability”. But do we really understand, what’s near 100% really means. For me, anything above 99% is near 100. But in reality, there is huge difference in 99% availability and 99.9999% availability.

Let’s look at definition of Availability – “Availability is the percentage of time that the infrastructure, system or a solution remains operational under normal circumstances in order to serve its intended purpose.

The mathematical formula for Availability is: Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time

That means, for an SLA of 99.999 percent availability (the famous five nines), the yearly service downtime could be as much as 5.256 minutes.

availability 1

As an IT leader, we should be aware of differences between nines’ and define requirements properly for the development team. As higher the nines, higher will be operational and development cost.

Another statement I heard during discussion – “cloud provider mostly provides 99.95% availability, so our system also provides same.”. Really? That may be true, if you are using SaaS solution from any of the cloud provider. But if you are developing your own solution over any cloud provider’s IaaS or PaaS services, then consider following two things,

  1. SLA defined by cloud providers is of their individual services only. That means, combined SLA need to be calculated based on cloud services you have consumed within your solution. We will further see how this is calculated in the next section.
  2. Suppose you are using only PaaS services in your solution, then you still own Application and Data layer, any bug or issue in your code, will result in non-availability. That also need to be considered while calculating your solution availability.

Combined SLA for consumed cloud services

Suppose you are developing a simple web application using Azure PaaS services, such as Azure App Service and Azure SQL Database. Taken in isolation, these services usually provide something in the range of three to four nines of availability,

  • Azure App Service: 99.95%
  • Azure SQL Database: 99.99%
  • Azure Traffic Manager: 99.99%

However, when these services are combined within architecture there is possibility that any one component could suffer an outage, bringing overall solution availability lower than individual availability.

Services in Serial

In following example where App Service and SQL Database are connected in serial, each service is a failure mode. There could be three possibilities of failure,

  1. App Service may go down, SQL Database may still be up and running
  2. App Service may be up and running, SQL Database may go down
  3. Both App Service and SQL Database may go down together

availability 2

So, to calculate combines availability for serial connected services, simply multiply individual availability percentage, i.e.

Availability of App Service * Availability of SQL Database

=

99.95% * 99.99%

=

99.94%

Observation – combined availability of 99.94% is lesser than individual services availability.

Services in Parallel

Now to make this solution highly available, you can have same replica of this solution deployed in another region and add traffic manager to dynamically redirect traffic into one of the region. This may add larger failure modes, but we will see how it will enhance/increase solution availability.

As we calculated,

  • Availability across services in Region A = 99.94%
  • Availability across services in Region B (replica of Region A) = 99.94%

Both Region A and Region B are parallel to each other. So, to calculate combined availability for parallel services, use following formula,

1 – ((1 – Region-A availability) * (1 – Region-B Availability))

=

1 – ((1 – 99.4%) * (1 – 99.4%))

=

99.9999%

availability 3

Also observe, Traffic Manager is in series to both parallel regions. So combines solution availability will be,

Availability of Traffic Manager * Combined availability of both regions

=

99.99% * 99.9999%

=

99.99 %  

Observation – we are able to increase availability from three nines to four nines by adding a new region in parallel.

Please note, above is the combined availability of services (you have chosen) provided by Azure. This availability doesn’t include your custom code. Remember following diagram, which explains what is owned by cloud providers and what is owned by you based on cloud platform you choose,

availability 4

Going back to our web application example, using App Services and SQL Database, we have opted for PaaS platform. In that case, the availability we have calculated is from Runtime to Networking layers, which doesn’t include your custom code for Applications and Data layers. So those layers you still must design for high availability. You can refer some of the following techniques, which are useful while designing for high availability solution,

  1. Auto-scaling – design solution to increase and decrease instances, based on active load
  2. Self-Healing – dynamically identify failures and redirect traffic to healthy instances
  3. Exponential backoff – implement retries on the requester side, this simple technique increases the reliability of the application, and takes care of intermittent failures
  4. Broker pattern – implement message passing architecture using queues and allow decoupling of components

The price of availability

Please remember one thing, availability has a cost associated with it. The more available your solution need to be, the more complexity is required, and so forth more expensive it will be.

availability 5

High available solution requires high degree of automation and self-healing capabilities, which requires significant development, testing and validation. This will require time, money and right resources, and all this will impact cost.

In the last, analyzing your system and calculating theoretical availability will help you understand your solution capabilities and help you take right design decisions. However, this availability can be highly affected by your ability to react to failures and recover the system, either manually or through self-healing processes.

Multi-Tenancy – Authentication and Authorization

In the last post, we have seen how to design multi-tenant solution and what all factors influence design decisions. One of the questions I received on that post – what about authentication and authorization in multi-tenant scenario?

To understand authentication and authorization in a multi-tenant scenario, let’s refer back the example of Apartment Society, where each apartment is classified as single tenant within an Apartment Society. Each apartment may have multiple residents, which can be classified as users and all are authenticated before entering the apartment society. Each one of them can share common resources of apartment society. But when they have to enter any apartment, they are authorized first. That means, after authorization they can only enter their own apartment, not into any other apartment. So, in short, at the time of entering an apartment society, authentication occurs, and at the time of entering an apartment, authorization occurs.

Now for multi-tenant solution, this authentication and authorization experience can vary. That depends on, at what time user is selecting its tenant/organization to which they belong to. This experience can be categories into three major categories,

  • Tenant selection before authentication – In this case, user will be asked to provide/select tenant name along with authentication details. System will process authorization, along with authentication for this type of user.
  • Tenant selection after authentication – In this case, user will be authenticated first. After that user will be prompted to provide/select tenant name, based on that he/she will be authorized.

Blog 2 - 1

  • Automatic tenant selection based on domain – In this case, during the time of authentication, system will identify the user’s sub-domain or company’s organization from his/her email ID and based on that information user will be automatically authorized.

Now the question comes, is there a simple way to implement this authentication and authorization. Answer is YES, within Azure, you have two options – Azure AD B2B and Azure AD B2C.

  • Azure AD B2B is for scenario, where you would like to share organization resources with external users so they can collaborate. https://docs.microsoft.com/en-us/azure/active-directory/b2b/what-is-b2b
  • Azure AD B2C is primarily for customer-facing applications. Azure AD B2C can be leveraged as full-featured identity system for your application, where different tenant/organization identities can be supported.

Sign-in journey using Azure AD B2C

Following is an example of sign-in journey using Azure AD B2C,

Blog 2 - 2

  • Step 1 – user select identify provider
  • Step 2 – user provides username and password
  • Step 3 – leverage Azure AD B2C for authentication, which internally connects to multiple identity providers. Please refer tutorial about how to add identity providers – https://docs.microsoft.com/en-us/azure/active-directory-b2c/tutorial-add-identity-providers
  • Step 4 – Authorize user based on tenant, and additional attributes collated from any CRM system.
  • Step 5 – Issue Azure AD B2C token to the calling application
  • Step 6 – Calling application receives token, parses claims and accordingly process access to the user.

Multi-tenancy – Simplified

Recently I was having a conversation regarding multi-tenancy, and one of the responses I got – our application already supports multiple users, so it’s already a multi-tenant. Do multiple users mean multi-tenant – is that correct?

Before we even go further and explore multi-tenancy, lets understand difference between a user and a tenant.

  • User – User is smallest unit of classification. User can be independent or can belong to a role or a tenant. Simplest way to identify a user – user will have its own username and password to login into the system.
  • Role – Role is classification within a tenant and multiple users can belong to a role. E.g. within a system, a group of users may belong to an administrator role or sales role, etc. One role may overlap over another role, but a role can’t be a subset of another role.
  • Tenant – Tenant is the largest classification. Tenant is generally used to define a department, or an enterprise, which enforces their own policies or rules within a system. Each tenant will consist of different set of users. User is not shared across a tenant.

1

Apartment Society is a perfect example of a multi-tenancy. Apartment Society has centralized administration for security (CCTV, security at entrance, etc), electricity, water, and other facilities. These facilities are governed by the apartment association and shared by tenants. One apartment can be classified as one tenant, for whom the individual electricity/water bill is generated. But each apartment may have multiple users, i.e. family members staying within an apartment. And what is a benefit of an apartment – it provides security, privacy to a tenant, even if they are using shared resources of apartment society like electricity, water, parking etc.

So, what is Multi-tenancy with respect to Software?

When a single application is shared by multiple tenants (i.e. organizations or departments) is termed as multi-tenant application, considering each tenant is oblivious of another tenant. Multi-tenant application should be smart enough to have partitioning of data, compute and user experience based on each tenant.

Benefits of multi-tenancy

  1. Reduction of operational cost – as shared infrastructure is leveraged by multiple tenants.
  2. Centralized management – as shared infrastructure needs to be managed, rather than multiple infrastructures for each tenant.
  3. Reduces turn around time to on-board new tenant.

Things to be considered carefully while designing multi-tenant solution

  1. Data Privacy – strict authentication and authorization required, so that customers should only access their own data.
  2. Serviceability and Maintainability – single update to the application will affect many different tenants, so any update must be planned carefully.

What factors influence multi-tenant architecture?

Before we define a multi-tenant architecture, lets look at what factors will influence architecture patterns for multi-tenancy,

  1. Views – tenant can define styling of the application
  2. Business Rules – tenant can define their own business rules and logic
  3. Data Schema – tenant can define their own database schema for the application
  4. Users and Groups – tenant can define their own rules to achieve data/system level access controls
  5. User or System Load – each tenant may have different user-load requirements. E.g. Tenant-A has huge user base, and in comparison, Tenant-B & Tenant-C combined has a small user base.

Type of design patterns to implement Multi-tenancy

Multi-tenancy with a single multi-tenant database

This is the simplest form of multi-tenancy. It consists of single instance of app instance (both UI and business layers), along with single database shared across multiple tenants. Usually tenant data is segregated within the database using tenant key/ID.

2

Multi-tenancy with single database per tenant

Second form of multi-tenancy is – single instance of app instances (both UI and business layers), but database is segregated for each tenant. Cost will be higher than the first option, but operation complexity will reduce due to individual databases, and will make it easy to manage data schema changes for each tenant.

3

One of the solutions to implement this pattern is through Azure SQL Database Elastic Pool feature – https://docs.microsoft.com/en-us/azure/sql-database/sql-database-elastic-pool

Multi-tenancy with single-tenant business compute instances with single database per tenant

Third form of multi-tenancy is – single instance of UI app instance, but business compute layer and database are segregated for each tenant. Cost will increase more than the second option, but operation complexity will reduce due to individual databases and easy to manage business compute instances for each tenant. Will make it easy to implement and deploy business rules for each tenant.

4

With adoption of containerization and docker technology, it becomes easier to segregate compute instances, and even than manage them as one unit. Examples are Azure Service Fabrics and Kubernetes. Azure Service Fabric and Kubernetes are distributed systems platform that makes it easy to package, deploy, and manage scalable and reliable micro-services and containers.

Multi-tenancy with single tenant compute instances with single database per tenant

Fourth form of multi-tenancy is – all three layers, UI app instance, business compute layer and database are segregated for each tenant. Cost will be highest for this option, but will provide complete segregation of security, privacy, compute and data for each tenant.

5

Hybrid Approach for Multi-tenancy

We looked at four different patterns to implement multi-tenancy. There is another factor that influence a lot in selection of patterns, i.e. user/system load for a tenant. Suppose user load for Tenant-A is so huge that we can’t share its instances with another other tenant. But user load for Tenant-B & Tenant-C can be shared within a single instance.  So, for these scenarios, you can consider hybrid approach, i.e. mix and match option 1 to 4. Example – for Tenant-A, go for segregated compute and database instances, but for Tenant-B and Tenant-C, you can go for shared compute instances.

6

So in end, multi-tenant architecture will help you serve everyone, from small customer to large enterprise, using shared resources, reduced cost and brings operational efficiency.