Design for Reliability – driven by RTO and RPO – Manish's

Reliability, maintainability, and availability (RAM) are important aspects of any software development. Availability we discussed in detailed in the last post. In this post let’s talk about Reliability.

Let’s first look at what’s the definition of reliability is – “Reliability is the probability that a product will continue to work normally over a specified interval of time, under specified conditions.”

Recent disruption on Garmin services through ransomware attack, which disrupted not only sports activities, but also critical services like flight navigation and routing. This attack highlighted the importance of reliability in software development. Failures are imminent, reliable solutions are those which can recover from those failures, with minimal impact on users or consumer services.

Reliability and maintenance are inversely proportional to each other, means more reliable product requires less time and effort for maintenance. So, it’s often a NFR trade-off between reliability and maintainability. This relationship is depicted very well within following diagram,

Reliability Metrics

Traditionally system reliability was measures through MTTF (Mean time to failure) and MTTR (Mean time to restoration) metrics. MTTF is the measure of time for next failure, and MTTR is a measure of time how quickly system can be restored back after failure.

Now days, another set of similar metrics are used to define reliability requirements – RTO (recovery time objective) and RPO (recovery point objective).

Recovery Time Objective (RTO) is the metric which defines the time which your business can afford to be in downtime state after the disaster occurs. But if the downtime crosses defined RTO time limit, business will suffer irreparable harm. Objective is to recover services back to operational within this defined RTO time limit, after the disaster occurs.
Recovery Point Objective (RPO) is a measure to define the maximum tolerable amount of data that can be lost. This helps in defining the maximum time interval can occur between data backup and disaster occurrence, without causing significant damage to business.

Both RTO and RPO are business metrics, getting them defined early in the development cycle will help you in better designing and architecting the solution. In the next section, let’s look at the process, which can help you design solutions for reliability.

Design for Reliability (DfR)

DfR (design for reliability) is a process so that system performs its function within a given environment over expected lifetime. Two primary benefits of DfR process,

Assurance of Reliability – DfR process embeds specific activities within the development lifecycle, which helps in ensuring reliability is baked in within the final product.
Cost Control and preserve profit – helps in keeping control to the budget, which ultimately helps in preserving market share.

DfR is basically divided into 5 key phases,

Define – define clear and quantifiable reliability requirements, to meet business and end-user needs. There are many factors that play role in defining reliability requirements, such as cost, customer expectations, competitive analysis, benchmarks, etc. Once the requirements are defined, those are translated further into design requirements, development requirements, validation requirements, monitoring requirements and disaster recovery requirements.
Identify and Design – whether it’s a new product development or upgradation project, purpose of this phase is to identify key reliability risk items, prioritize them and detail out corrective actions to mitigate those risks by making corrective design decisions. One of the tools which really helps in identifying these risks is DFMEA (Design failure mode and effect analysis). To know more about DFMEA, please refer its elaboration within next section of this post.
Analyze – analysis of design changes or new design decisions based on DFMEA analysis, is done by executing them against previous failure data or against different design concepts. The focus is to explore, discover and reveal design weaknesses in order to allow design changes to improve product reliability.
Verify – this phase starts, when the design changes or new designs are implemented by the development team. In this phase these changes are validated by executing load testing, performance testing, stress testing, and DR (disaster recovery) drills. During these testing, identified failure scenarios are simulated and corrective actions are verified. If any of the test fails, go back to the design phase, and take corrective actions to mitigate those failures.
Sustain (monitor and control) – once all changes are released to production, product is continuously monitored for failures, either through monitoring system performances or through synthetic testing or by monitoring degradation of health parameters. This is important phase, as this will help you measure product reliability in production, as well as help you in improving it in future. In case the disaster happens, measure RTO and RPO, that will help in measuring the reliability of the product based on the specifications defined at the start of the project.

DFMEA

Design failure mode and effect analysis (DFMEA) is a systematic activity used to recognize and evaluate potential systems, products or process failures. DFMEA identifies the effects and outcomes of these failures or actions and mitigates the failures.

Components of DFMEA template,

Item – component or sub-system of the product to be analyzed, it will consist of one of many functions.
Function – function within the product/item, and it will consist of one or many requirements.
Requirement – requirement of the function, and it will consist of one or many potential failure modes.
Potential Failure Mode – the way component may fail to meet the requirement, and it will result in one or many potential effects.
Potential Effects – the way the customer or consumer services are affected due to failure.
Severity – ranking of failure based on severe potential effect.
Class – failure categorization based on high or low risk impact.
Cause – the reason of the failure. There could be multiple causes for single failure.
Control Method
- Prevention Control – design or architecture action to prevent potential failure from occurring.
- Detention Control – design or architecture action to detect the failure.
Correction
- Corrective Action – action to remove or reduce the chance of cause of failure mode.
- Responsibility – team or individual responsible to implement the recommended corrective action.

DFMEA should be a living document during the development process and should be kept updated as the product life cycle progresses.

Recovery or failover approaches

With cloud adoption and cloud native applications becoming a norm, designing for failure is becoming more and more important. That makes DfR process so important for any development process, so that you can incorporate and plan for failure as early as possible within the development cycle. Based on the RTO and RPO defined for the product, you will have to identify and implement failure and recovery approaches for your product too. Following is brief overview of three approaches that can be considered for failover approaches,

Backup and redeploy on disaster – This is the most straightforward disaster recovery strategy. In this approach, only the primary region has product services running. Keep taking the backup of the data on periodic basis as per defined RPO. The secondary region is not set up for an automatic failover. So, when a disaster occurs, you must spin up all the parts of the product services in the new region. This includes setting up new product services, restoring the data, and network configurations. Although this is the most affordable of the multiple-region options, it has the worst RTO and RPO characteristics.

Active/Passive (Warm spare) – In an active-passive approach, the choice that many companies favor. This pattern provides improvements to the RTO with a relatively small increase in cost over the redeployment pattern. In this scenario, there is again a primary and a secondary region. All the traffic goes to the active deployment on the primary region. The secondary region is better prepared for disaster recovery because the data services are running on both regions and synched. This standby approach can involve two variations: a database-only approach or a complete light deployment in the secondary region.

Active/Active (Hot spare) – In an active-active approach, the product services and database are fully deployed in both regions. Unlike the active-passive model, both regions receive user traffic. This option yields the quickest recovery time. The product services are already scaled to handle a portion of the load at each region. Network traffic configurations are already enabled to use the secondary region. It is also, however, the most expensive approach, but you will achieve the best RTO and RPO.