Introduction

‘Root cause analysis’ is a common term heard across a variety of industries today. Manufacturers developed this system in the 1950s for a better understanding of industrial events.

For instance, Toyota invented the 5 Whys technique, a root cause analysis tool.

Over time, root cause analysis has found use in industries such as safety management, software development, business management, cybersecurity, data science, engineering, healthcare, maintenance, quality control, risk management and shipping.

By reading this article, you will gain a deeper understanding of the core concept of root cause analysis and become familiar with the root cause analysis process. You will also go through a list of the popular analysis techniques used in the industry today.

Let’s start from the beginning.

What is root cause analysis (RCA)?

Root cause analysis is an approach to fault finding and analysis that helps us identify the root cause behind any kind of incident.

It is a reactive process, meaning that it is performed after an incident has occurred. But as root causes are found and fixed over time, it turns proactive as we are now preventing problematic incidents from occurring in the future.

RCA states that behind every incident, there may be one or more root causes. These root causes lead to faults that lead to further faults.

This kind of cascading effect leads to a final fault that is far more devastating than initial faults and can lead to major loss of lives, property and the environment.

RCA finds the first point of failure, where it all begins. These points are usually far easier to fix than the final fault.

Fixing them also prevents several other faults that initiate from that root cause(s). These events haven’t occurred yet but could if we don’t fix the root causes of the problem.

To visualise RCA, let us take a look at the following example.

Visual representation of root cause analysis

The graphic shows that one or more root causes can lead to bigger events/faults that lead to still bigger faults resulting in a high-consequence event.

This high-consequence event is the visible fault/symptom that rears its head when the safety barriers have failed, forcing us to deal with it.

When we eliminate root causes, we eliminate or at least reduce the possibility of larger events and the final event.

Root cause analysis example

When a bearing overheats in a machine, there could be several reasons. It could be poor quality, misalignment, overloading, lubrication failure, improper mounting and so on.

All of these causes lead to an overheated bearing.

Now, an overheated bearing may still go unnoticed. But when the inevitable bearing failure occurs, it becomes a visible fault or symptom that leads to increased losses.

Thus in our example:

  1. Root causes = poor quality, misalignment, overloading, lubrication failure, improper mounting
  2. Intermediate cause = Overheated bearing
  3. Final event = Bearing failure
Root cause analysis example
Visual representation of the root cause analysis example

Suppose we did notice the bearing overheating, and changed the bearing in time. This will prevent bearing failure in the short term but the root causes stay as they are.

If the overheating was due to shaft misalignment, it is prudent to fix that in order to not only prevent bearing failure but also overheating and frequent replacement.

Thus, identifying and rectifying root causes is the most effective approach to not only prevent incidents but keep operating costs low.

This is not to say that we must not fix the visible symptom. But we must not lose sight of the root causes that led to the incident.

We must also put barriers in place to prevent them from happening again.

When to use root cause analysis?

RCA can sometimes be a lengthy process spanning several days or even weeks and does not promise any reliable results.

This can prevent organizations from taking this approach in every case and restrict its use to certain special cases. Some of the important cases where root cause analysis can provide immense value are as follows.

Frequent faults of a similar nature

Sometimes, it may happen that fixing the visible symptom may rectify the root cause and prevent it from happening again.

Let’s go back to our bearing overheating problem. If the root cause was improper mounting and it was fixed when the bearing was replaced, then it is unlikely that the problem will reappear.

But if after repeated replacements, the bearing overheats, the problem is worth investigating further to reduce the risk of bearing failure.

For instance, if a pump bearing is overheating due to small particles in the pumped fluid, we can fix the root cause by placing a filter on the inlet line.

Such fixes apply to a variety of cases and prevent frequent failures.

Impact to safety

In cases where a safety incident has occurred, a root cause analysis is highly recommended.

This allows the leadership to eliminate the systemic or underlying root causes of the incident rather than fixing the immediate or apparent cause.

Such an approach will reduce the frequency or eliminate the same or similar incidents and significantly improve workplace safety.

There are several advantages to using RCA to reduce the number of safety incidents.

Besides preventing injuries and casualties, a vigorous safety program results in effective hazard control, increased staff retention, revenues and process reliability, and decreased downtime, production, maintenance and insurance premium costs.

Critical failures

Critical failures are failures that result in a significant halt to production systems, catastrophic repair costs and high impacts on safety, security and continuation of services at recommended levels.

Such failures, when they happen, must be analyzed using root cause analysis.

Even in cases where they have not happened but you have identified inconsistencies that may lead to such an event, root cause analysis (RCA) is a worthwhile approach to prevent the incident as well as the causes leading to it.

Whether or not to carry out an RCA to predict critical failure will depend on the importance of a piece of equipment or subprocess to affect the entire system.

How to do a root cause analysis?

Root cause analysis comes across as an esoteric term but it is one of the most simple and straightforward processes out there.

Its application may not necessarily be limited to a small number of high-consequence cases. We can indeed use it for routine tasks once we clearly understand and apply the steps.

Such application can not only yield significant results in our day-to-day lives but also prepare us for root cause analysis of high-complexity problems when the need arises.

The core principles remain the same across all disciplines.

The root cause analysis process is made up of six simple steps. These are as follows:

  1. Problem definition
  2. Data collection
  3. Identification of contributing factors
  4. Root cause identification
  5. Find and apply solutions
  6. Verify their effectiveness

Problem definition

The first step in any problem-solving process is defining the problem. In root cause analysis as well, we first define the problem by asking several questions such as:

  • What is happening?
  • What are the visible symptoms?

Straightforward answers when defining the problem will help us set the scope for the root cause analysis.

Data collection

The next step is to collect as much data as possible from multiple reliable sources. You should have the following details:

  • The duration for which the problem has existed
  • How does the problem compromise established safety/operational standards?
  • Was the process performed according to set processes or were there any deviations? If so, how many and how serious was each?
  • Has anybody dealt with this problem before? What actions and control measures were put in place?

By the end of this step, we should have a pretty good idea of the factors that led to the problem.

Identification of contributing factors

In this step, we investigate the factors or causes that may have contributed to the incident.

We can use several tools such as fault tree analysis, 5 whys and Pareto analysis to find the various causes and their weightage in causing the incident.

These causes may be dependent or independent of each other.

The important point is to not stop until you have a long list of factors that can contribute to the incident. The more factors you can list in this step, the more effective will RCA be.

Root cause identification

Once we have the list of contributing factors, we backtrack and go back until we have found the root cause of the problem.

The 5 Whys technique is a good starting point. We take a visible symptom or fault and keep asking why to get to the first level cause, then higher level causes and finally to the root cause.

Root cause is the first cause that leads to all other causes and finally, the safety incident.

Allow us to elaborate on this with a root cause analysis example of an equipment failure.

5 why technique in root cause analysis
Finding the root cause using the 5 Whys technique

Thus, one of the root causes appears to be the use of a paper-based maintenance program due to which the maintenance slipped through the cracks.

Thus, not having an automated maintenance program that reminds us of due service sessions led to all of the failures and the final, most undesirable event of equipment failure.

Find and apply solutions

Once we have all of the root causes responsible for the final incident, we find effective barriers or controls to prevent those root causes in the future.

We may turn to the Hierarchy of Controls to find effective control measures. The Hierarchy arranges five types of controls from most effective to least effective in controlling risks from root causes.

The five types of controls are Elimination, Substitution, Engineering, Administrative and PPE controls. We shall learn about each of these types of controls in a subsequent article.

Verify effectiveness

The job doesn’t end once we have placed controls.

We must verify that these controls are working as intended. This ensures the root cause is no longer part of the process.

By repeating the above steps every time an incident occurs, we can eliminate most of the root causes, increasing the organization’s safety performance dramatically in the process.

Root cause analysis methods

There are many effective tools available for us to carry out root cause analysis. Some of the most common ones are:

  • 5 Whys
  • Fishbone diagram
  • Change analysis/Event analysis
  • Scatter chart
  • Barrier analysis
  • Pareto analysis
  • Fault tree analysis
  • DMAIC
  • 8D problem solving
  • Affinity diagram
  • Kepner-Tregoe Problem Solving and Decision Making

Conclusion

Root cause analysis is a powerful tool to deal with undesirable incidents of all kinds and prevent their recurrence.

But we can also use root cause analysis to analyze positive events, get to their root cause and replicate it for more positive outcomes.

For instance, if there were unusually high sales on a particular day, we can use root cause analysis to glean greater insight into why it happened and what can we do to increase its frequency and magnitude.

Thus, root cause analysis provides a great analytical and systematic approach to finding why incidents occurred and what can be done to increase or decrease their frequency.