The Art of Change Management: A Product Manager’s Perspective

I am sure you would have heard this many times in your professional journey – “Change is the only constant”. This quote is actually not new, it’s originally been said by ancient Greek philosopher Heraclitus some 2500 years ago – “There is nothing permanent except change”. Though the quote was said hundreds of centuries ago, it is the truth that is applicable even today and will be relevant forever.

“Change is the only constant !!”

As Product Managers, you are constantly faced with evolving market demands, technological advancements, and organizational shifts. Effective change management is crucial for navigating these changes successfully and ensuring the long-term success of your products. In this blog post, you will learn about the importance of change management, different techniques, and strategies to identify and influence supporters and resisters within the organization.

Why Change Management is Important and Should be Planned Way Ahead in Product Development

Change management refers to the structured approach to transitioning individuals, teams, and organizations from the current state to a desired future state. It plays a pivotal role in product development for several reasons,

  • Stakeholder Engagement and Buy-in: By involving key stakeholders from the beginning, you can build consensus, address concerns, and cultivate champions who will advocate for and drive the change forward, automatically fostering buy-in, collaboration, and support from stakeholders.
  • Risk Mitigation: Change can introduce risks such as resistance from stakeholders, disruption in workflows, and potential errors in implementation. Early planning enables you to identify potential challenges, risks, and barriers to change. By anticipating these issues, you can develop proactive strategies to mitigate risks and ensure smoother transitions.
  • Communication: Early change management planning will enable you to develop and execute communication strategies that keep stakeholders informed, engaged, and aligned with the change initiatives.
  • Training: Timely training will ensure that employees are prepared to embrace new processes, technologies, or ways of working, reducing resistance and accelerating adoption.

Different Techniques of Change Management

Below are a few basic techniques of change management that Product Managers can adopt,

Top-Down Approach

The top-down approach involves senior leadership driving change initiatives and cascading them down through the organization. This approach is effective for implementing strategic changes and ensuring alignment with organizational goals. The benefit of a top-down approach is that it ensures alignment and faster adoption, but you may run into the risk of facing resistance.

Example: A Product Manager wants to introduce a new agile development methodology across the organization. Senior leadership champions the initiative, provides necessary resources, and communicates the benefits to the entire organization, fostering buy-in and adoption.

Open-Source Approach

The open-source approach or bottom-up approach encourages collaboration, transparency, and participation from all levels of the organization. It empowers employees to contribute ideas, provide feedback, and take ownership of change initiatives. The benefit of the open-source approach is that it fosters ownership and creativity, but it can be a very time-consuming process to get alignment across the board with open-source approach.

Example: A Product Manager initiates a collaborative workshop where cross-functional teams brainstorm ideas to improve product features. Employees from different departments collaborate, share insights, and co-create solutions, leading to innovative outcomes.

Hybrid Approach

A hybrid approach combines elements of both top-down and open-source techniques, leveraging the strengths of each to drive successful change initiatives.

Example: A Product Manager leads a strategic initiative to adopt a new project management tool. Senior leadership provides direction and support, while teams across the organization are involved in the selection process, customization, and implementation, ensuring alignment with user needs and organizational requirements.

Other than the above listed change management techniques, there are few change management planning frameworks too that you can leverage, such as,

Kotter’s 8-Step Change Model

Developed by John Kotter, this model outlines eight sequential steps for leading change, including creating urgency, building a guiding coalition, and anchoring changes in the organizational culture. The benefit of the Kotter’s 8 Step Change model is that it provides comprehensive roadmap for leading change, but it requires a sustained commitment, and may be perceived as rigid process.

  • Create urgency.
  • Build a guiding coalition.
  • Form a strategic vision and initiatives.
  • Enlist a volunteer army.
  • Enable action by removing barriers.
  • Generate short-term wins.
  • Sustain acceleration.
  • Institute changes.

Example: A Product Manager identifies a need to improve product quality and customer satisfaction. They create a sense of urgency by highlighting market trends and customer feedback, form a cross-functional team to lead the initiative, and implement quality improvement processes, reinforcing the changes through continuous monitoring and feedback.

Lewin’s Change Management Model

This model proposes a three-stage process for managing change: Unfreeze (preparing for change), Change (implementing new processes or behaviors), and Refreeze (stabilizing the change and embedding it into the organizational culture). The benefit of Lewin’s Change management model is that it offers a straightforward and intuitive framework for managing change, but it requires very careful planning and executive to ensure each stage is effectively managed.

Unfreeze – Change – Refreeze

Example: A Product Manager introduces a new product development framework. They communicate the need for change (Unfreeze), facilitate workshops and training sessions to introduce the new framework (Change), and monitor performance and gather feedback to ensure the new processes are effectively adopted and integrated into the organization (Refreeze).

Identifying Supporters and Resisters Early in the Process

Identifying supporters and resisters early in the change process is crucial for effective change management. Supporters are individuals or groups who are enthusiastic about the change and willing to actively contribute to its success. Resisters, on the other hand, are skeptical or opposed to the change and may resist adoption or implementation.

Influencing Strategies to Convert Resisters into Supporters

  • Support and Empathy: Show empathy towards resisters’ concerns, try to understand implicit reasons for their resistance and provide emotional support to help them navigate through the change process.
  • Education and Training: Provide resisters with the necessary knowledge, skills, and training to adapt to the change effectively and alleviate fears of the unknown.
  • Open Communication: Engage in open and honest dialogues with resisters to understand their concerns, address misconceptions, and communicate the benefits of the change.
  • Involvement and Participation: Involve resisters in the change process by seeking their input, incorporating their ideas, and giving them a sense of ownership and control over the change initiative.

By identifying supporters and resisters early, you can tailor your change management strategies to address specific needs, alleviate concerns, and foster a collaborative environment favorable to successful change adoption.

Conclusion

Change management is a critical competency for Product Managers to master. Remember, change is not just about implementing new processes or technologies, it’s about leading people through transitions, fostering collaboration, and creating a culture that embraces continuous improvement and adaptation. As Product Managers, let’s embrace change management as a strategic imperative and empower our organizations to thrive in an ever-evolving marketplace.

Design for Reliability – driven by RTO and RPO

Reliability, maintainability, and availability (RAM) are important aspects of any software development. Availability we discussed in detailed in the last post. In this post let’s talk about Reliability.

Let’s first look at what’s the definition of reliability is – “Reliability is the probability that a product will continue to work normally over a specified interval of time, under specified conditions.

Recent disruption on Garmin services through ransomware attack, which disrupted not only sports activities, but also critical services like flight navigation and routing. This attack highlighted the importance of reliability in software development. Failures are imminent, reliable solutions are those which can recover from those failures, with minimal impact on users or consumer services.

Reliability and maintenance are inversely proportional to each other, means more reliable product requires less time and effort for maintenance. So, it’s often a NFR trade-off between reliability and maintainability. This relationship is depicted very well within following diagram,

Reliability Metrics

Traditionally system reliability was measures through MTTF (Mean time to failure) and MTTR (Mean time to restoration) metrics. MTTF is the measure of time for next failure, and MTTR is a measure of time how quickly system can be restored back after failure.

Now days, another set of similar metrics are used to define reliability requirements – RTO (recovery time objective) and RPO (recovery point objective).

  • Recovery Time Objective (RTO) is the metric which defines the time which your business can afford to be in downtime state after the disaster occurs. But if the downtime crosses defined RTO time limit, business will suffer irreparable harm. Objective is to recover services back to operational within this defined RTO time limit, after the disaster occurs.
  • Recovery Point Objective (RPO) is a measure to define the maximum tolerable amount of data that can be lost. This helps in defining the maximum time interval can occur between data backup and disaster occurrence, without causing significant damage to business.

Both RTO and RPO are business metrics, getting them defined early in the development cycle will help you in better designing and architecting the solution. In the next section, let’s look at the process, which can help you design solutions for reliability.

Design for Reliability (DfR)

DfR (design for reliability) is a process so that system performs its function within a given environment over expected lifetime. Two primary benefits of DfR process,

  1. Assurance of Reliability – DfR process embeds specific activities within the development lifecycle, which helps in ensuring reliability is baked in within the final product.
  2. Cost Control and preserve profit – helps in keeping control to the budget, which ultimately helps in preserving market share.

DfR is basically divided into 5 key phases,

  1. Define – define clear and quantifiable reliability requirements, to meet business and end-user needs. There are many factors that play role in defining reliability requirements, such as cost, customer expectations, competitive analysis, benchmarks, etc. Once the requirements are defined, those are translated further into design requirements, development requirements, validation requirements, monitoring requirements and disaster recovery requirements.
  2. Identify and Design – whether it’s a new product development or upgradation project, purpose of this phase is to identify key reliability risk items, prioritize them and detail out corrective actions to mitigate those risks by making corrective design decisions. One of the tools which really helps in identifying these risks is DFMEA (Design failure mode and effect analysis). To know more about DFMEA, please refer its elaboration within next section of this post.
  3. Analyze – analysis of design changes or new design decisions based on DFMEA analysis, is done by executing them against previous failure data or against different design concepts. The focus is to explore, discover and reveal design weaknesses in order to allow design changes to improve product reliability.
  4. Verify – this phase starts, when the design changes or new designs are implemented by the development team. In this phase these changes are validated by executing load testing, performance testing, stress testing, and DR (disaster recovery) drills. During these testing, identified failure scenarios are simulated and corrective actions are verified. If any of the test fails, go back to the design phase, and take corrective actions to mitigate those failures.
  5. Sustain (monitor and control) – once all changes are released to production, product is continuously monitored for failures, either through monitoring system performances or through synthetic testing or by monitoring degradation of health parameters. This is important phase, as this will help you measure product reliability in production, as well as help you in improving it in future. In case the disaster happens, measure RTO and RPO, that will help in measuring the reliability of the product based on the specifications defined at the start of the project.

DFMEA

Design failure mode and effect analysis (DFMEA) is a systematic activity used to recognize and evaluate potential systems, products or process failures. DFMEA identifies the effects and outcomes of these failures or actions and mitigates the failures.

Components of DFMEA template,

  1. Item – component or sub-system of the product to be analyzed, it will consist of one of many functions.
  2. Function – function within the product/item, and it will consist of one or many requirements.
  3. Requirement – requirement of the function, and it will consist of one or many potential failure modes.
  4. Potential Failure Mode – the way component may fail to meet the requirement, and it will result in one or many potential effects.
  5. Potential Effects – the way the customer or consumer services are affected due to failure.
  6. Severity – ranking of failure based on severe potential effect.
  7. Class – failure categorization based on high or low risk impact.
  8. Cause – the reason of the failure. There could be multiple causes for single failure.
  9. Control Method
    • Prevention Control – design or architecture action to prevent potential failure from occurring.
    • Detention Control – design or architecture action to detect the failure.
  10. Correction
    • Corrective Action – action to remove or reduce the chance of cause of failure mode.
    • Responsibility – team or individual responsible to implement the recommended corrective action.

DFMEA should be a living document during the development process and should be kept updated as the product life cycle progresses.

Recovery or failover approaches

With cloud adoption and cloud native applications becoming a norm, designing for failure is becoming more and more important. That makes DfR process so important for any development process, so that you can incorporate and plan for failure as early as possible within the development cycle. Based on the RTO and RPO defined for the product, you will have to identify and implement failure and recovery approaches for your product too. Following is brief overview of three approaches that can be considered for failover approaches,

Backup and redeploy on disaster – This is the most straightforward disaster recovery strategy. In this approach, only the primary region has product services running. Keep taking the backup of the data on periodic basis as per defined RPO. The secondary region is not set up for an automatic failover. So, when a disaster occurs, you must spin up all the parts of the product services in the new region. This includes setting up new product services, restoring the data, and network configurations. Although this is the most affordable of the multiple-region options, it has the worst RTO and RPO characteristics.

Active/Passive (Warm spare) – In an active-passive approach, the choice that many companies favor. This pattern provides improvements to the RTO with a relatively small increase in cost over the redeployment pattern. In this scenario, there is again a primary and a secondary region. All the traffic goes to the active deployment on the primary region. The secondary region is better prepared for disaster recovery because the data services are running on both regions and synched. This standby approach can involve two variations: a database-only approach or a complete light deployment in the secondary region.

Active/Active (Hot spare) – In an active-active approach, the product services and database are fully deployed in both regions. Unlike the active-passive model, both regions receive user traffic. This option yields the quickest recovery time. The product services are already scaled to handle a portion of the load at each region. Network traffic configurations are already enabled to use the secondary region. It is also, however, the most expensive approach, but you will achieve the best RTO and RPO.

Technical Interview using Visual Studio Code

In this difficult time of Covid-19, work from home has become a norm. Offices are inaccessible, and all meetings have moved to online, including technical interviews. Technical interviews are critical criteria of judgement for hiring software engineers. Technical interviews are primarily used to judge problem solving skills and critical thinking abilities of the candidate. In software industry, technical interview usually consists of interviewer asking a puzzle or algorithm problem, and candidate need to write a program to solve that.

Now in online interviews using video/audio calls, this becomes really difficult to conduct such live technical interviews, where candidate can solve the problem in real time and have conversation with interviewer at the same time. There are some commercial off-the-shelf solutions available in the market, that allow interviewers to conduct such live programming interviews. But two limitations with those commercial solutions – first they require a learning curve for both interviewer and candidate to get familiar with those platform before they can have productive interview; second recurring cost/price is associated with those commercial platforms, which could be an obstacle for smaller companies, where interviews are not conducted frequently.

So how about conducting these live technical interviews using an IDE, which is most popular within industry programmers and with no additional cost, i.e. using Visual Studio Code. These technical interviews can be conducted using Visual Studio Code Live Share collaboration sessions. In the subsequent section of this post, you will see how easy it is to setup a live share collaboration session between interviewer and candidate, and most important thing to note – candidate doesn’t need to install anything, they can participate within live share session using browser itself.

Prerequisite – Interviewer need to have Visual Studio Code installed, along with Live Share extension pack.

Step 1

Setup the environment with problem definition and basic program structure

1

Step 2

Schedule Interview by creating a Live Share session

2

Step 3

Copy and Share live share collaboration link with candidate over email or chat

3

Step 4

Candidate access live share collaboration link and joins using browser option

4

Step 5

Once candidate joins the session on browser, they can collaborate in the same way as they are collaborating from desktop client. This collaboration will be in real time, i.e. whatever changes candidate will be making within the code, interview can visualize those changes in real time on their end.

5

Step 6

Candidate code for the problem, and finally execute execute/run within the browser itself and test out the solution

6

Step 7

During the interview, interviewer also has an option to start an audio call with candidate directly from with the Visual Studio Code, or collaborate using the Chat session within the Visual Studio Code.

7

Step 8

Finally when the live interview is over, interviewer need to end the live session

8

 

Benefits of using Visual Studio Code Live Sharing Session for Live Interviews

  1. No extra cost associate while conducting live interviews.
  2. Flexibility to conduct interview is any of the programming languages supported by Visual Studio Code.

Please note – Visual Studio Live Share currently doesn’t support creation of sessions in advance.

Design for Availability – Game of 9s

Recently in one of the discussions I heard a statement – “for our solution, we require near 100% availability”. But do we really understand, what’s near 100% really means. For me, anything above 99% is near 100. But in reality, there is huge difference in 99% availability and 99.9999% availability.

Let’s look at definition of Availability – “Availability is the percentage of time that the infrastructure, system or a solution remains operational under normal circumstances in order to serve its intended purpose.

The mathematical formula for Availability is: Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time

That means, for an SLA of 99.999 percent availability (the famous five nines), the yearly service downtime could be as much as 5.256 minutes.

availability 1

As an IT leader, we should be aware of differences between nines’ and define requirements properly for the development team. As higher the nines, higher will be operational and development cost.

Another statement I heard during discussion – “cloud provider mostly provides 99.95% availability, so our system also provides same.”. Really? That may be true, if you are using SaaS solution from any of the cloud provider. But if you are developing your own solution over any cloud provider’s IaaS or PaaS services, then consider following two things,

  1. SLA defined by cloud providers is of their individual services only. That means, combined SLA need to be calculated based on cloud services you have consumed within your solution. We will further see how this is calculated in the next section.
  2. Suppose you are using only PaaS services in your solution, then you still own Application and Data layer, any bug or issue in your code, will result in non-availability. That also need to be considered while calculating your solution availability.

Combined SLA for consumed cloud services

Suppose you are developing a simple web application using Azure PaaS services, such as Azure App Service and Azure SQL Database. Taken in isolation, these services usually provide something in the range of three to four nines of availability,

  • Azure App Service: 99.95%
  • Azure SQL Database: 99.99%
  • Azure Traffic Manager: 99.99%

However, when these services are combined within architecture there is possibility that any one component could suffer an outage, bringing overall solution availability lower than individual availability.

Services in Serial

In following example where App Service and SQL Database are connected in serial, each service is a failure mode. There could be three possibilities of failure,

  1. App Service may go down, SQL Database may still be up and running
  2. App Service may be up and running, SQL Database may go down
  3. Both App Service and SQL Database may go down together

availability 2

So, to calculate combines availability for serial connected services, simply multiply individual availability percentage, i.e.

Availability of App Service * Availability of SQL Database

=

99.95% * 99.99%

=

99.94%

Observation – combined availability of 99.94% is lesser than individual services availability.

Services in Parallel

Now to make this solution highly available, you can have same replica of this solution deployed in another region and add traffic manager to dynamically redirect traffic into one of the region. This may add larger failure modes, but we will see how it will enhance/increase solution availability.

As we calculated,

  • Availability across services in Region A = 99.94%
  • Availability across services in Region B (replica of Region A) = 99.94%

Both Region A and Region B are parallel to each other. So, to calculate combined availability for parallel services, use following formula,

1 – ((1 – Region-A availability) * (1 – Region-B Availability))

=

1 – ((1 – 99.4%) * (1 – 99.4%))

=

99.9999%

availability 3

Also observe, Traffic Manager is in series to both parallel regions. So combines solution availability will be,

Availability of Traffic Manager * Combined availability of both regions

=

99.99% * 99.9999%

=

99.99 %  

Observation – we are able to increase availability from three nines to four nines by adding a new region in parallel.

Please note, above is the combined availability of services (you have chosen) provided by Azure. This availability doesn’t include your custom code. Remember following diagram, which explains what is owned by cloud providers and what is owned by you based on cloud platform you choose,

availability 4

Going back to our web application example, using App Services and SQL Database, we have opted for PaaS platform. In that case, the availability we have calculated is from Runtime to Networking layers, which doesn’t include your custom code for Applications and Data layers. So those layers you still must design for high availability. You can refer some of the following techniques, which are useful while designing for high availability solution,

  1. Auto-scaling – design solution to increase and decrease instances, based on active load
  2. Self-Healing – dynamically identify failures and redirect traffic to healthy instances
  3. Exponential backoff – implement retries on the requester side, this simple technique increases the reliability of the application, and takes care of intermittent failures
  4. Broker pattern – implement message passing architecture using queues and allow decoupling of components

The price of availability

Please remember one thing, availability has a cost associated with it. The more available your solution need to be, the more complexity is required, and so forth more expensive it will be.

availability 5

High available solution requires high degree of automation and self-healing capabilities, which requires significant development, testing and validation. This will require time, money and right resources, and all this will impact cost.

In the last, analyzing your system and calculating theoretical availability will help you understand your solution capabilities and help you take right design decisions. However, this availability can be highly affected by your ability to react to failures and recover the system, either manually or through self-healing processes.