Design for Reliability – driven by RTO and RPO

Reliability, maintainability, and availability (RAM) are important aspects of any software development. Availability we discussed in detailed in the last post. In this post let’s talk about Reliability.

Let’s first look at what’s the definition of reliability is – “Reliability is the probability that a product will continue to work normally over a specified interval of time, under specified conditions.

Recent disruption on Garmin services through ransomware attack, which disrupted not only sports activities, but also critical services like flight navigation and routing. This attack highlighted the importance of reliability in software development. Failures are imminent, reliable solutions are those which can recover from those failures, with minimal impact on users or consumer services.

Reliability and maintenance are inversely proportional to each other, means more reliable product requires less time and effort for maintenance. So, it’s often a NFR trade-off between reliability and maintainability. This relationship is depicted very well within following diagram,

Reliability Metrics

Traditionally system reliability was measures through MTTF (Mean time to failure) and MTTR (Mean time to restoration) metrics. MTTF is the measure of time for next failure, and MTTR is a measure of time how quickly system can be restored back after failure.

Now days, another set of similar metrics are used to define reliability requirements – RTO (recovery time objective) and RPO (recovery point objective).

  • Recovery Time Objective (RTO) is the metric which defines the time which your business can afford to be in downtime state after the disaster occurs. But if the downtime crosses defined RTO time limit, business will suffer irreparable harm. Objective is to recover services back to operational within this defined RTO time limit, after the disaster occurs.
  • Recovery Point Objective (RPO) is a measure to define the maximum tolerable amount of data that can be lost. This helps in defining the maximum time interval can occur between data backup and disaster occurrence, without causing significant damage to business.

Both RTO and RPO are business metrics, getting them defined early in the development cycle will help you in better designing and architecting the solution. In the next section, let’s look at the process, which can help you design solutions for reliability.

Design for Reliability (DfR)

DfR (design for reliability) is a process so that system performs its function within a given environment over expected lifetime. Two primary benefits of DfR process,

  1. Assurance of Reliability – DfR process embeds specific activities within the development lifecycle, which helps in ensuring reliability is baked in within the final product.
  2. Cost Control and preserve profit – helps in keeping control to the budget, which ultimately helps in preserving market share.

DfR is basically divided into 5 key phases,

  1. Define – define clear and quantifiable reliability requirements, to meet business and end-user needs. There are many factors that play role in defining reliability requirements, such as cost, customer expectations, competitive analysis, benchmarks, etc. Once the requirements are defined, those are translated further into design requirements, development requirements, validation requirements, monitoring requirements and disaster recovery requirements.
  2. Identify and Design – whether it’s a new product development or upgradation project, purpose of this phase is to identify key reliability risk items, prioritize them and detail out corrective actions to mitigate those risks by making corrective design decisions. One of the tools which really helps in identifying these risks is DFMEA (Design failure mode and effect analysis). To know more about DFMEA, please refer its elaboration within next section of this post.
  3. Analyze – analysis of design changes or new design decisions based on DFMEA analysis, is done by executing them against previous failure data or against different design concepts. The focus is to explore, discover and reveal design weaknesses in order to allow design changes to improve product reliability.
  4. Verify – this phase starts, when the design changes or new designs are implemented by the development team. In this phase these changes are validated by executing load testing, performance testing, stress testing, and DR (disaster recovery) drills. During these testing, identified failure scenarios are simulated and corrective actions are verified. If any of the test fails, go back to the design phase, and take corrective actions to mitigate those failures.
  5. Sustain (monitor and control) – once all changes are released to production, product is continuously monitored for failures, either through monitoring system performances or through synthetic testing or by monitoring degradation of health parameters. This is important phase, as this will help you measure product reliability in production, as well as help you in improving it in future. In case the disaster happens, measure RTO and RPO, that will help in measuring the reliability of the product based on the specifications defined at the start of the project.

DFMEA

Design failure mode and effect analysis (DFMEA) is a systematic activity used to recognize and evaluate potential systems, products or process failures. DFMEA identifies the effects and outcomes of these failures or actions and mitigates the failures.

Components of DFMEA template,

  1. Item – component or sub-system of the product to be analyzed, it will consist of one of many functions.
  2. Function – function within the product/item, and it will consist of one or many requirements.
  3. Requirement – requirement of the function, and it will consist of one or many potential failure modes.
  4. Potential Failure Mode – the way component may fail to meet the requirement, and it will result in one or many potential effects.
  5. Potential Effects – the way the customer or consumer services are affected due to failure.
  6. Severity – ranking of failure based on severe potential effect.
  7. Class – failure categorization based on high or low risk impact.
  8. Cause – the reason of the failure. There could be multiple causes for single failure.
  9. Control Method
    • Prevention Control – design or architecture action to prevent potential failure from occurring.
    • Detention Control – design or architecture action to detect the failure.
  10. Correction
    • Corrective Action – action to remove or reduce the chance of cause of failure mode.
    • Responsibility – team or individual responsible to implement the recommended corrective action.

DFMEA should be a living document during the development process and should be kept updated as the product life cycle progresses.

Recovery or failover approaches

With cloud adoption and cloud native applications becoming a norm, designing for failure is becoming more and more important. That makes DfR process so important for any development process, so that you can incorporate and plan for failure as early as possible within the development cycle. Based on the RTO and RPO defined for the product, you will have to identify and implement failure and recovery approaches for your product too. Following is brief overview of three approaches that can be considered for failover approaches,

Backup and redeploy on disaster – This is the most straightforward disaster recovery strategy. In this approach, only the primary region has product services running. Keep taking the backup of the data on periodic basis as per defined RPO. The secondary region is not set up for an automatic failover. So, when a disaster occurs, you must spin up all the parts of the product services in the new region. This includes setting up new product services, restoring the data, and network configurations. Although this is the most affordable of the multiple-region options, it has the worst RTO and RPO characteristics.

Active/Passive (Warm spare) – In an active-passive approach, the choice that many companies favor. This pattern provides improvements to the RTO with a relatively small increase in cost over the redeployment pattern. In this scenario, there is again a primary and a secondary region. All the traffic goes to the active deployment on the primary region. The secondary region is better prepared for disaster recovery because the data services are running on both regions and synched. This standby approach can involve two variations: a database-only approach or a complete light deployment in the secondary region.

Active/Active (Hot spare) – In an active-active approach, the product services and database are fully deployed in both regions. Unlike the active-passive model, both regions receive user traffic. This option yields the quickest recovery time. The product services are already scaled to handle a portion of the load at each region. Network traffic configurations are already enabled to use the secondary region. It is also, however, the most expensive approach, but you will achieve the best RTO and RPO.

Technical Interview using Visual Studio Code

In this difficult time of Covid-19, work from home has become a norm. Offices are inaccessible, and all meetings have moved to online, including technical interviews. Technical interviews are critical criteria of judgement for hiring software engineers. Technical interviews are primarily used to judge problem solving skills and critical thinking abilities of the candidate. In software industry, technical interview usually consists of interviewer asking a puzzle or algorithm problem, and candidate need to write a program to solve that.

Now in online interviews using video/audio calls, this becomes really difficult to conduct such live technical interviews, where candidate can solve the problem in real time and have conversation with interviewer at the same time. There are some commercial off-the-shelf solutions available in the market, that allow interviewers to conduct such live programming interviews. But two limitations with those commercial solutions – first they require a learning curve for both interviewer and candidate to get familiar with those platform before they can have productive interview; second recurring cost/price is associated with those commercial platforms, which could be an obstacle for smaller companies, where interviews are not conducted frequently.

So how about conducting these live technical interviews using an IDE, which is most popular within industry programmers and with no additional cost, i.e. using Visual Studio Code. These technical interviews can be conducted using Visual Studio Code Live Share collaboration sessions. In the subsequent section of this post, you will see how easy it is to setup a live share collaboration session between interviewer and candidate, and most important thing to note – candidate doesn’t need to install anything, they can participate within live share session using browser itself.

Prerequisite – Interviewer need to have Visual Studio Code installed, along with Live Share extension pack.

Step 1

Setup the environment with problem definition and basic program structure

1

Step 2

Schedule Interview by creating a Live Share session

2

Step 3

Copy and Share live share collaboration link with candidate over email or chat

3

Step 4

Candidate access live share collaboration link and joins using browser option

4

Step 5

Once candidate joins the session on browser, they can collaborate in the same way as they are collaborating from desktop client. This collaboration will be in real time, i.e. whatever changes candidate will be making within the code, interview can visualize those changes in real time on their end.

5

Step 6

Candidate code for the problem, and finally execute execute/run within the browser itself and test out the solution

6

Step 7

During the interview, interviewer also has an option to start an audio call with candidate directly from with the Visual Studio Code, or collaborate using the Chat session within the Visual Studio Code.

7

Step 8

Finally when the live interview is over, interviewer need to end the live session

8

 

Benefits of using Visual Studio Code Live Sharing Session for Live Interviews

  1. No extra cost associate while conducting live interviews.
  2. Flexibility to conduct interview is any of the programming languages supported by Visual Studio Code.

Please note – Visual Studio Live Share currently doesn’t support creation of sessions in advance.

Design for Availability – Game of 9s

Recently in one of the discussions I heard a statement – “for our solution, we require near 100% availability”. But do we really understand, what’s near 100% really means. For me, anything above 99% is near 100. But in reality, there is huge difference in 99% availability and 99.9999% availability.

Let’s look at definition of Availability – “Availability is the percentage of time that the infrastructure, system or a solution remains operational under normal circumstances in order to serve its intended purpose.

The mathematical formula for Availability is: Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time

That means, for an SLA of 99.999 percent availability (the famous five nines), the yearly service downtime could be as much as 5.256 minutes.

availability 1

As an IT leader, we should be aware of differences between nines’ and define requirements properly for the development team. As higher the nines, higher will be operational and development cost.

Another statement I heard during discussion – “cloud provider mostly provides 99.95% availability, so our system also provides same.”. Really? That may be true, if you are using SaaS solution from any of the cloud provider. But if you are developing your own solution over any cloud provider’s IaaS or PaaS services, then consider following two things,

  1. SLA defined by cloud providers is of their individual services only. That means, combined SLA need to be calculated based on cloud services you have consumed within your solution. We will further see how this is calculated in the next section.
  2. Suppose you are using only PaaS services in your solution, then you still own Application and Data layer, any bug or issue in your code, will result in non-availability. That also need to be considered while calculating your solution availability.

Combined SLA for consumed cloud services

Suppose you are developing a simple web application using Azure PaaS services, such as Azure App Service and Azure SQL Database. Taken in isolation, these services usually provide something in the range of three to four nines of availability,

  • Azure App Service: 99.95%
  • Azure SQL Database: 99.99%
  • Azure Traffic Manager: 99.99%

However, when these services are combined within architecture there is possibility that any one component could suffer an outage, bringing overall solution availability lower than individual availability.

Services in Serial

In following example where App Service and SQL Database are connected in serial, each service is a failure mode. There could be three possibilities of failure,

  1. App Service may go down, SQL Database may still be up and running
  2. App Service may be up and running, SQL Database may go down
  3. Both App Service and SQL Database may go down together

availability 2

So, to calculate combines availability for serial connected services, simply multiply individual availability percentage, i.e.

Availability of App Service * Availability of SQL Database

=

99.95% * 99.99%

=

99.94%

Observation – combined availability of 99.94% is lesser than individual services availability.

Services in Parallel

Now to make this solution highly available, you can have same replica of this solution deployed in another region and add traffic manager to dynamically redirect traffic into one of the region. This may add larger failure modes, but we will see how it will enhance/increase solution availability.

As we calculated,

  • Availability across services in Region A = 99.94%
  • Availability across services in Region B (replica of Region A) = 99.94%

Both Region A and Region B are parallel to each other. So, to calculate combined availability for parallel services, use following formula,

1 – ((1 – Region-A availability) * (1 – Region-B Availability))

=

1 – ((1 – 99.4%) * (1 – 99.4%))

=

99.9999%

availability 3

Also observe, Traffic Manager is in series to both parallel regions. So combines solution availability will be,

Availability of Traffic Manager * Combined availability of both regions

=

99.99% * 99.9999%

=

99.99 %  

Observation – we are able to increase availability from three nines to four nines by adding a new region in parallel.

Please note, above is the combined availability of services (you have chosen) provided by Azure. This availability doesn’t include your custom code. Remember following diagram, which explains what is owned by cloud providers and what is owned by you based on cloud platform you choose,

availability 4

Going back to our web application example, using App Services and SQL Database, we have opted for PaaS platform. In that case, the availability we have calculated is from Runtime to Networking layers, which doesn’t include your custom code for Applications and Data layers. So those layers you still must design for high availability. You can refer some of the following techniques, which are useful while designing for high availability solution,

  1. Auto-scaling – design solution to increase and decrease instances, based on active load
  2. Self-Healing – dynamically identify failures and redirect traffic to healthy instances
  3. Exponential backoff – implement retries on the requester side, this simple technique increases the reliability of the application, and takes care of intermittent failures
  4. Broker pattern – implement message passing architecture using queues and allow decoupling of components

The price of availability

Please remember one thing, availability has a cost associated with it. The more available your solution need to be, the more complexity is required, and so forth more expensive it will be.

availability 5

High available solution requires high degree of automation and self-healing capabilities, which requires significant development, testing and validation. This will require time, money and right resources, and all this will impact cost.

In the last, analyzing your system and calculating theoretical availability will help you understand your solution capabilities and help you take right design decisions. However, this availability can be highly affected by your ability to react to failures and recover the system, either manually or through self-healing processes.

Non-Functional Requirements – most neglected aspect of Software Development

Everyone working in Software industry knows what Non-Functional Requirements (NFRs) are, but even after that, I have seen so many cases where solution is designed, developed and delivered without considering key aspects of NFRs or very poorly defined NFRs or team defined NFRs very late in the development cycle. Ultimately either solution fails, or business spend extra time and budget to get the solution fixed to meet these missing non-functional requirements.

In this article, we are going to discuss 3 important things about NFRs,

  • Why and When to capture NFRs
  • How to define measurable and testable NFRs
  • NFR trade-off matrix and its importance

Why NFRs and When to capture NFRs?

Non-Functional Requirement will not describe what the system will do, but how the system will do it, such as performance requirements, design constraints, scalability requirements, etc.

Missing out of NFRs will have direct impact on adoption of the system, such as,

  • System not scaling up to customer’s needs, system slows down and become unresponsive
  • Security breach of confidential data
  • System not available during the time when its required most, resulting in direct impact to business
  • Disaster Recovery and backup not configured, resulting in data loss
  • And many more..

Non-functional requirements (NFRs) should be gathered as early as possible in the development cycle, preferably along with functional requirements.

One more question which was very frequently asked – whom should I contact to define NFR, customer’s IT team or customer business folks? Answer is – BOTH.

  • IT team will provide you details like limitation of current IT infrastructure, portability requirements such as portability across different cloud platforms like AWS, Azure, etc.
  • Business will provide you details related to performance and scalability, such as how may user/market growth they are expecting in future, how this application can change this business and user interaction, etc.

So, contacting both business and IT is utmost important to capture NFRs.

Approach to define NFRs

Converting vague ideas about quality and making them measurable is both an art and science. Start with identifying which quality attribute you want to elaborate. Next, identify metrics that will be used to measure that quality attribute. Once you have identified measurable metric, use that metrics to define requirement that is both measurable and fulfill customer’s requirement too.

Following are couple of examples elaborating this process,

2

Trade-Off among NFRs

Assume you are developing an application for an enterprise, which can be accessed only by its employees. You want these employees to be productive and should be able to access this application from anywhere. But security is also important, as only company employees should be able to access this application and data. So, Security and usability are both important but there is potentially a trade-off to be made here. It would be convenient to be able to just pick up any device and access application without password, or application could be secured by requiring two factor authentication on every time application is accessed. So, these requirements contradict with each other.

In this scenario, a trade-off matrix helps us identify and communicate these trade-offs so that you can deal with them intelligently. Following is an example of trade-off matrix among 5 NFRs,

1

Read this table from left to right, you can see there is a negative relationship between security and usability. That means when there is conflict between Security and Usability, preference is given to Security. That doesn’t mean, you don’t give any preference to Usability. Ask a question to your self – how can you maximize Usability without compromising Security? For above example of multi-factor authentication, there could be solutions like supporting fingerprint ID or face recognition as second factor authentication, that can improve usability too, without compromising security.

I will highly recommend you to prepare this trade-off matrix during the requirement phase itself, and have a discussion with your stakeholders to review it, so that everyone understand these trade-offs, and no surprises during or end of the engagement. I am sure, you will be required to refer this trade-off matrix multiple times during your development phases.

 

In short, defining measurable and effective NFRs requires some thought and creativity. Highly recommend, plan for defining NFRs early in the development cycle and include NFRs as part of all phases of your software development, from requirement gathering to design to development and finally all the way up to testing.