Design for Availability – Game of 9s

Recently in one of the meeting I heard a statement – “for our solution, we require near 100% availability”. But do we really understand, what’s near 100% reality means. For me, anything above 99% is near 100. But in reality, there is huge difference in 99% availability and 99.9999% availability.

Let’s look at definition of Availability – “Availability is the percentage of time that the infrastructure, system or a solution remains operational under normal circumstances in order to serve its intended purpose.

The mathematical formula for Availability is: Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time

That means, for an SLA of 99.999 percent availability (the famous five nines), the yearly service downtime could be as much as 5.256 minutes.

availability 1

As an IT leader, we should be aware of differences between nines’ and define requirements properly for the development team. As higher the nines, higher will be operational and development cost.

Another statement I heard during discussion – “cloud provider mostly provides 99.95% availability, so our system also provides same.”. Really? That may be true, if you are using SaaS solution from any of the cloud provider. But if you are developing your own solution over any cloud provider’s IaaS or PaaS services, then consider following two things,

  1. SLA defined by cloud providers is of their individual services only. That means, combined SLA need to be calculated based on cloud services you have consumed within your solution. We will further see how this is calculated in the next section.
  2. Suppose you are using only PaaS services in your solution, then you still own Application and Data layer, any bug or issue in your code, will result in non-availability. That also need to be considered while calculating your solution availability.

Combined SLA for consumed cloud services

Suppose you are developing a simple web application using Azure PaaS services, such as Azure App Service and Azure SQL Database. Taken in isolation, these services usually provide something in the range of three to four nines of availability,

  • Azure App Service: 99.95%
  • Azure SQL Database: 99.99%
  • Azure Traffic Manager: 99.99%

However, when these services are combined within architecture there is possibility that any one component could suffer an outage, bringing overall solution availability lower than individual availability.

Services in Serial

In following example where App Service and SQL Database are connected in serial, each service is a failure mode. There could be three possibilities of failure,

  1. App Service may go down, SQL Database may still be up and running
  2. App Service may be up and running, SQL Database may go down
  3. Both App Service and SQL Database may go down together

availability 2

So, to calculate combines availability for serial connected services, simply multiply individual availability percentage, i.e.

Availability of App Service * Availability of SQL Database

=

99.95% * 99.99%

=

99.94%

Observation – combined availability of 99.94% is lesser than individual services availability.

Services in Parallel

Now to make this solution highly available, you can have same replica of this solution deployed in another region and add traffic manager to dynamically redirect traffic into one of the region. This may add larger failure modes, but we will see how it will enhance/increase solution availability.

As we calculated,

  • Availability across services in Region A = 99.94%
  • Availability across services in Region B (replica of Region A) = 99.94%

Both Region A and Region B are parallel to each other. So, to calculate combined availability for parallel services, use following formula,

1 – ((1 – Region-A availability) * (1 – Region-B Availability))

=

1 – ((1 – 99.4%) * (1 – 99.4%))

=

99.9999%

availability 3

Also observe, Traffic Manager is in series to both parallel regions. So combines solution availability will be,

Availability of Traffic Manager * Combined availability of both regions

=

99.99% * 99.9999%

=

99.99 %  

Observation – we are able to increase availability from three nines to four nines by adding a new region in parallel.

Please note, above is the combined availability of services (you have chosen) provided by Azure. This availability doesn’t include your custom code. Remember following diagram, which explains what is owned by cloud providers and what is owned by you based on cloud platform you choose,

availability 4

Going back to our web application example, using App Services and SQL Database, we have opted for PaaS platform. In that case, the availability we have calculated is from Runtime to Networking layers, which doesn’t include your custom code for Applications and Data layers. So those layers you still must design for high availability. You can refer some of the following techniques, which are useful while designing for high availability solution,

  1. Auto-scaling – design solution to increase and decrease instances, based on active load
  2. Self-Healing – dynamically identify failures and redirect traffic to healthy instances
  3. Exponential backoff – implement retries on the requester side, this simple technique increases the reliability of the application, and takes care of intermittent failures
  4. Broker pattern – implement message passing architecture using queues and allow decoupling of components

The price of availability

Please remember one thing, availability has a cost associated with it. The more available your solution need to be, the more complexity is required, and so forth more expensive it will be.

availability 5

High available solution requires high degree of automation and self-healing capabilities, which requires significant development, testing and validation. This will require time, money and right resources, and all this will impact cost.

In the last, analyzing your system and calculating theoretical availability will help you understand your solution capabilities and help you take right design decisions. However, this availability can be highly affected by your ability to react to failures and recover the system, either manually or through self-healing processes.

Non-Functional Requirements – most neglected aspect of Software Development

Everyone working in Software industry knows what Non-Functional Requirements (NFRs) are, but even after that, I have seen so many cases where solution is designed, developed and delivered without considering key aspects of NFRs or very poorly defined NFRs or team defined NFRs very late in the development cycle. Ultimately either solution fails, or business spend extra time and budget to get the solution fixed to meet these missing non-functional requirements.

In this article, we are going to discuss 3 important things about NFRs,

  • Why and When to capture NFRs
  • How to define measurable and testable NFRs
  • NFR trade-off matrix and its importance

Why NFRs and When to capture NFRs?

Non-Functional Requirement will not describe what the system will do, but how the system will do it, such as performance requirements, design constraints, scalability requirements, etc.

Missing out of NFRs will have direct impact on adoption of the system, such as,

  • System not scaling up to customer’s needs, system slows down and become unresponsive
  • Security breach of confidential data
  • System not available during the time when its required most, resulting in direct impact to business
  • Disaster Recovery and backup not configured, resulting in data loss
  • And many more..

Non-functional requirements (NFRs) should be gathered as early as possible in the development cycle, preferably along with functional requirements.

One more question which was very frequently asked – whom should I contact to define NFR, customer’s IT team or customer business folks? Answer is – BOTH.

  • IT team will provide you details like limitation of current IT infrastructure, portability requirements such as portability across different cloud platforms like AWS, Azure, etc.
  • Business will provide you details related to performance and scalability, such as how may user/market growth they are expecting in future, how this application can change this business and user interaction, etc.

So, contacting both business and IT is utmost important to capture NFRs.

Approach to define NFRs

Converting vague ideas about quality and making them measurable is both an art and science. Start with identifying which quality attribute you want to elaborate. Next, identify metrics that will be used to measure that quality attribute. Once you have identified measurable metric, use that metrics to define requirement that is both measurable and fulfill customer’s requirement too.

Following are couple of examples elaborating this process,

2

Trade-Off among NFRs

Assume you are developing an application for an enterprise, which can be accessed only by its employees. You want these employees to be productive and should be able to access this application from anywhere. But security is also important, as only company employees should be able to access this application and data. So, Security and usability are both important but there is potentially a trade-off to be made here. It would be convenient to be able to just pick up any device and access application without password, or application could be secured by requiring two factor authentication on every time application is accessed. So, these requirements contradict with each other.

In this scenario, a trade-off matrix helps us identify and communicate these trade-offs so that you can deal with them intelligently. Following is an example of trade-off matrix among 5 NFRs,

1

Read this table from left to right, you can see there is a negative relationship between security and usability. That means when there is conflict between Security and Usability, preference is given to Security. That doesn’t mean, you don’t give any preference to Usability. Ask a question to your self – how can you maximize Usability without compromising Security? For above example of multi-factor authentication, there could be solutions like supporting fingerprint ID or face recognition as second factor authentication, that can improve usability too, without compromising security.

I will highly recommend you to prepare this trade-off matrix during the requirement phase itself, and have a discussion with your stakeholders to review it, so that everyone understand these trade-offs, and no surprises during or end of the engagement. I am sure, you will be required to refer this trade-off matrix multiple times during your development phases.

 

In short, defining measurable and effective NFRs requires some thought and creativity. Highly recommend, plan for defining NFRs early in the development cycle and include NFRs as part of all phases of your software development, from requirement gathering to design to development and finally all the way up to testing.

 

Multi-Tenancy – Authentication and Authorization

In the last post, we have seen how to design multi-tenant solution and what all factors influence design decisions. One of the questions I received on that post – what about authentication and authorization in multi-tenant scenario?

To understand authentication and authorization in a multi-tenant scenario, let’s refer back the example of Apartment Society, where each apartment is classified as single tenant within an Apartment Society. Each apartment may have multiple residents, which can be classified as users and all are authenticated before entering the apartment society. Each one of them can share common resources of apartment society. But when they have to enter any apartment, they are authorized first. That means, after authorization they can only enter their own apartment, not into any other apartment. So, in short, at the time of entering an apartment society, authentication occurs, and at the time of entering an apartment, authorization occurs.

Now for multi-tenant solution, this authentication and authorization experience can vary. That depends on, at what time user is selecting its tenant/organization to which they belong to. This experience can be categories into three major categories,

  • Tenant selection before authentication – In this case, user will be asked to provide/select tenant name along with authentication details. System will process authorization, along with authentication for this type of user.
  • Tenant selection after authentication – In this case, user will be authenticated first. After that user will be prompted to provide/select tenant name, based on that he/she will be authorized.

Blog 2 - 1

  • Automatic tenant selection based on domain – In this case, during the time of authentication, system will identify the user’s sub-domain or company’s organization from his/her email ID and based on that information user will be automatically authorized.

Now the question comes, is there a simple way to implement this authentication and authorization. Answer is YES, within Azure, you have two options – Azure AD B2B and Azure AD B2C.

  • Azure AD B2B is for scenario, where you would like to share organization resources with external users so they can collaborate. https://docs.microsoft.com/en-us/azure/active-directory/b2b/what-is-b2b
  • Azure AD B2C is primarily for customer-facing applications. Azure AD B2C can be leveraged as full-featured identity system for your application, where different tenant/organization identities can be supported.

Sign-in journey using Azure AD B2C

Following is an example of sign-in journey using Azure AD B2C,

Blog 2 - 2

  • Step 1 – user select identify provider
  • Step 2 – user provides username and password
  • Step 3 – leverage Azure AD B2C for authentication, which internally connects to multiple identity providers. Please refer tutorial about how to add identity providers – https://docs.microsoft.com/en-us/azure/active-directory-b2c/tutorial-add-identity-providers
  • Step 4 – Authorize user based on tenant, and additional attributes collated from any CRM system.
  • Step 5 – Issue Azure AD B2C token to the calling application
  • Step 6 – Calling application receives token, parses claims and accordingly process access to the user.

Positive Vs Negative Attitude

Recently there has been instances, where I felt I am surrounded by lot of negative folks, who are trying hard to pull me into their negative thinking zone. This experience introduced a thought among me – “how to identify this negativity and can I do something about it?”

How to identify negative vs positive attitude?

Positive Attitude Individual Negative Attitude Individual
In case of a problem, individual will come up with solutions. In case of problem, individual will come up with excuses, and will try supporting them with hypothetical statements.
Individual will look for long term goals Individual will look for short term gains
Individual will took ownership of work and demonstrate leadership traits Individual will try to run away from ownership
Individual look for opportunities Individual look for limitations

How to retain positive thinking?

  1. Focus on problem solving for existing problems
  2. Spend time with positive people
  3. Stop complaining
  4. Be curious and embrace learning
  5. Look for long term, instead of short term goals
  6. Assume responsibility and choose your response
  7. Stay away from negativity (including folks with negative attitude)
  8. Always support people around you

 

In our Vedic scripture also, there is a reference of this,

“Vitarka badhane pratipaksha bhavanam’’

Vitarka = improper thoughts; Badhane = troubling ; Pratipaksha = opposite; Bhavanam  = side.

“When improper thoughts trouble you, then take the opposite side.”

Summary – the mind has capability of thinking only one thing at a time, so make it positive. Whenever you have a chain of bad thoughts, just think of the opposite, break the chain, go have some fresh air.

Multi-tenancy – Simplified

Recently I was having a conversation regarding multi-tenancy, and one of the responses I got – our application already supports multiple users, so it’s already a multi-tenant. Do multiple users mean multi-tenant – is that correct?

Before we even go further and explore multi-tenancy, lets understand difference between a user and a tenant.

  • User – User is smallest unit of classification. User can be independent or can belong to a role or a tenant. Simplest way to identify a user – user will have its own username and password to login into the system.
  • Role – Role is classification within a tenant and multiple users can belong to a role. E.g. within a system, a group of users may belong to an administrator role or sales role, etc. One role may overlap over another role, but a role can’t be a subset of another role.
  • Tenant – Tenant is the largest classification. Tenant is generally used to define a department, or an enterprise, which enforces their own policies or rules within a system. Each tenant will consist of different set of users. User is not shared across a tenant.

1

Apartment Society is a perfect example of a multi-tenancy. Apartment Society has centralized administration for security (CCTV, security at entrance, etc), electricity, water, and other facilities. These facilities are governed by the apartment association and shared by tenants. One apartment can be classified as one tenant, for whom the individual electricity/water bill is generated. But each apartment may have multiple users, i.e. family members staying within an apartment. And what is a benefit of an apartment – it provides security, privacy to a tenant, even if they are using shared resources of apartment society like electricity, water, parking etc.

So, what is Multi-tenancy with respect to Software?

When a single application is shared by multiple tenants (i.e. organizations or departments) is termed as multi-tenant application, considering each tenant is oblivious of another tenant. Multi-tenant application should be smart enough to have partitioning of data, compute and user experience based on each tenant.

Benefits of multi-tenancy

  1. Reduction of operational cost – as shared infrastructure is leveraged by multiple tenants.
  2. Centralized management – as shared infrastructure needs to be managed, rather than multiple infrastructures for each tenant.
  3. Reduces turn around time to on-board new tenant.

Things to be considered carefully while designing multi-tenant solution

  1. Data Privacy – strict authentication and authorization required, so that customers should only access their own data.
  2. Serviceability and Maintainability – single update to the application will affect many different tenants, so any update must be planned carefully.

What factors influence multi-tenant architecture?

Before we define a multi-tenant architecture, lets look at what factors will influence architecture patterns for multi-tenancy,

  1. Views – tenant can define styling of the application
  2. Business Rules – tenant can define their own business rules and logic
  3. Data Schema – tenant can define their own database schema for the application
  4. Users and Groups – tenant can define their own rules to achieve data/system level access controls
  5. User or System Load – each tenant may have different user-load requirements. E.g. Tenant-A has huge user base, and in comparison, Tenant-B & Tenant-C combined has a small user base.

Type of design patterns to implement Multi-tenancy

Multi-tenancy with a single multi-tenant database

This is the simplest form of multi-tenancy. It consists of single instance of app instance (both UI and business layers), along with single database shared across multiple tenants. Usually tenant data is segregated within the database using tenant key/ID.

2

Multi-tenancy with single database per tenant

Second form of multi-tenancy is – single instance of app instances (both UI and business layers), but database is segregated for each tenant. Cost will be higher than the first option, but operation complexity will reduce due to individual databases, and will make it easy to manage data schema changes for each tenant.

3

One of the solutions to implement this pattern is through Azure SQL Database Elastic Pool feature – https://docs.microsoft.com/en-us/azure/sql-database/sql-database-elastic-pool

Multi-tenancy with single-tenant business compute instances with single database per tenant

Third form of multi-tenancy is – single instance of UI app instance, but business compute layer and database are segregated for each tenant. Cost will increase more than the second option, but operation complexity will reduce due to individual databases and easy to manage business compute instances for each tenant. Will make it easy to implement and deploy business rules for each tenant.

4

With adoption of containerization and docker technology, it becomes easier to segregate compute instances, and even than manage them as one unit. Examples are Azure Service Fabrics and Kubernetes. Azure Service Fabric and Kubernetes are distributed systems platform that makes it easy to package, deploy, and manage scalable and reliable micro-services and containers.

Multi-tenancy with single tenant compute instances with single database per tenant

Fourth form of multi-tenancy is – all three layers, UI app instance, business compute layer and database are segregated for each tenant. Cost will be highest for this option, but will provide complete segregation of security, privacy, compute and data for each tenant.

5

Hybrid Approach for Multi-tenancy

We looked at four different patterns to implement multi-tenancy. There is another factor that influence a lot in selection of patterns, i.e. user/system load for a tenant. Suppose user load for Tenant-A is so huge that we can’t share its instances with another other tenant. But user load for Tenant-B & Tenant-C can be shared within a single instance.  So, for these scenarios, you can consider hybrid approach, i.e. mix and match option 1 to 4. Example – for Tenant-A, go for segregated compute and database instances, but for Tenant-B and Tenant-C, you can go for shared compute instances.

6

So in end, multi-tenant architecture will help you serve everyone, from small customer to large enterprise, using shared resources, reduced cost and brings operational efficiency.

Cloud Agnostic – Not a myth anymore, just architect solution differently

Vendor Lock-In – a term which every IT decision maker is either talking and thinking about. No one wants to be locked with one cloud provider but like to have flexibility to move around or at-least have an option to move around enterprise workloads between different cloud providers.

Multi-Cloud Strategy

Knowingly or unknowingly many organizations already find themselves of adopting multi-cloud strategy. Meaning they are using different cloud providers for their multiple workloads, example – one workload in AWS and another workload in Azure, but don’t have the flexibility to move workloads from AWS to Azure or vice-versa.

1

Example of Multi-cloud strategy

Within the above multi-cloud strategy example, if IT wants to move Workload-2 from Azure to AWS, it will not be straightforward, and it will require some effort. Reason for that – because Workload-2 was not designed to be cloud agnostic.

So, what is Cloud Agnostics?

A workload that can be migrated seamlessly around different vendor clouds will be called a true cloud agnostics workload. But is it possible to be true cloud agnostic? Yes, it is, you just have to design and architect it to be cloud agnostic. But before we look into how to design a cloud agnostic solution, let’s look at different models in the way a workload can be cloud agnostics.

Dev/Test and Production Segregation

Running development and testing on one environment and production on another environment is one of the most common scenarios. Benefits of this setup is, running development and testing on cloud environments which are cheaper and don’t require scalability, and only use Production where scalability is required.

2

Example of Dev/Test and Production segregation

Disaster Recovery in another cloud platform

Leveraging another cloud provider for Disaster Recovery environment is the most useful scenario for being cloud agnostics. It requires running production environment on one cloud provider and running same copy of application on another cloud provider. If production environment goes down, all requests should be diverted to DR environment hosted on another cloud provider. While designing this cloud agnostics DR strategy, you have to keep RTO (Recovery Time Objective) and RPO (Recovery Point objective) in mind and design it accordingly, which will drive decisions like – how frequently data sync need to happen across DR site, etc.

3

Example of Disaster recovery in another cloud platform

Production in multi-cloud (truly cloud agnostic)

It requires running application simultaneously on different vendor clouds and both sites are up and running simultaneously and sharing user load. Benefits of this approach is – in case one cloud provider runs into issue and goes down, your workload is still up and running without any or with minimal RPO (Recovery Point Objective) impact. Only drawback of this approach will be – cost, as you will be paying double to both cloud providers. But it’s an ideal solution of business mission critical applications.

4

Example of Production environments in multi-cloud

How to architect workloads to be cloud agnostics?

Before you decide designing your solution, you need to understand what all services are common across different cloud providers. To understand that, you can categorize cloud services broadly within three categories,

  1. Base Services – Services which has become standard to provide any cloud-based services, such as virtualization, networking etc.
  2. Broadly Accepted Services – Services which has been industry accepted and mostly available across all cloud provides. Example of these type of services are, Dockers, Kubernetes, MongoDB, PostgreSQL, etc. As the cloud adoption increases, you will see more an more services getting moved within this category.
  3. Unique Services – These are services which are unique to each cloud provider and becomes their selling point and differentiator. Examples of these are AWS Lambda, Azure IoT Hub, Azure Cognitive Services, etc.

5

Segregation or categorization of services

Now to design/architect your solution and be cloud agnostic, prefer to choose services from first two categories. If you choose to use any of the unique services provided by cloud providers, in that case you are automatically tied up with that provider and you are no more cloud agnostic.

Managed Services

Question arises, if you restrict yourself up to broadly accepted services, in that case you may miss out most of the managed services provided by cloud platforms. If any of the managed service provides huge advantage in comparison to building on your own, in that case it makes sense to use it, rather than not using it for the sake of vendor lock-in. If you still want to be vendor independent, in that case there are two options,

  1. Look for managed services which accepts common protocols. Example of this is – Azure Cosmos DB, which can be accessed using MongoDB wire protocol. So, your application can have flexibility to leverage Azure Cosmos DB using MongoDB APIs within Azure, while also leveraging Mongo DB in AWS, getting a benefit of managed service within Azure.
  2. Second option is to have your own layer of segregation before leveraging managed services, so that you can switch to another provider later.

6

Example of custom providers for different cloud platforms

 

Microservices – especially what it is NOT !!

Microservice is another term in the industry, which was picked up in last few years, and everyone using it in one context or another. Some are using it in right context, and some are using to look cool, without properly understanding the concept.

Microservice in itself doesn’t have a standard, it’s an architectural style, and there are many flavors of interpretations and implementations of microservice.

According to Martin Fowler: “Microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare minimum of centralized management of these services, which may be written in different programming languages and use different data storage technologies.

To put it in simple statement – breakdown application into small services that each perform one business function and can be developed, deployed, and scaled independently.

Microservice properties

There is already lot of literature, books and articles available to understand Microservices. One of the book I will personally recommend will be Microservices with Azure, authored by Namit TS and Rahul Rai, and I personally got a chance to tech review it.

But rather than going into Microservices details, let me list down few things which doesn’t qualify as microservices, and technologist should understand that and stop creating confusion.

What is NOT a Microservice

  • A published API is not a microservice. Microservice could be an architectural implementation behind API, but that detail is independent of API specification. You don’t say “here is Microservice ready to be consumed by other applications”, you only say “here is API specifications ready to be consumed by other applications”
  • A functionality exposed as an API by another system is not a microservice, it’s just an API. It doesn’t matter how it is implemented in the backend, for consumer application it’s just an API.
  • A simple API implemented as part of a monolithic application is not a microservice, again it’s just an API.
  • If a service is dependent on another service, it’s not a microservice
  • If a service is not independently deployable, it’s not a microservice.
  • If a service can’t be updated independently, it’s not a microservice.
  • A monolithic set of services grouped together and deployed in a Docker container is not a microservice.

and in the end,

  • Stop putting terms like microservices within Conceptual Architecture diagrams. Conceptual architecture is all about appropriate decomposition of the system without going into specifics of interface specification or implementation details.