Incident Management
Incident management is the process of detecting and handling negative events. The goal here is to find a quick resolution or workaround that reduces losses. This can be contrasted with problem management that solves the root cause of the incident to prevent recurring issues. For example, if a system is down incident response teams may reboot a machine to resolve the incident. The incident is closed when service is restored. Problem management would then investigate why the machine was malfunctioning to determine if further corrective action is required. The problem is closed when the root cause of the incident is addressed.Root Cause Analysis
When an incident occurs there are often several layers of cause. For example:Causes of a late flight:
The aircraft requires maintenance.
A sensor is malfunctioning.
An electrochemical reaction damaged a component of the sensor.
The sensor wasn't tested at last maintenance.
The sensor is low quality.
The sensor has a flawed design.
The airline lacks the capability to evaluate quality in any meaningful way when purchasing aircraft.
The manufacturer rushed the design and production of the aircraft.
The manufacturer has a culture that prioritizes time to market over quality.
The manufacturer has executive pay incentives for achieving time to market but does not meaningfully penalize executives for quality failures.
Root cause analysis tends to be a complex and open-ended exercise such that any two teams that look at the same problem are likely to reach different conclusions. As a rule of thumb, the goal is to find the cause with the greatest explanatory power that is within your ability to fix. For example, the cause "the sensor wasn't tested at last maintenance" is likely to be selected as it can be addressed by the airline to prevent future incidents.
The aircraft requires maintenance.
A sensor is malfunctioning.
An electrochemical reaction damaged a component of the sensor.
The sensor wasn't tested at last maintenance.
The sensor is low quality.
The sensor has a flawed design.
The airline lacks the capability to evaluate quality in any meaningful way when purchasing aircraft.
The manufacturer rushed the design and production of the aircraft.
The manufacturer has a culture that prioritizes time to market over quality.
The manufacturer has executive pay incentives for achieving time to market but does not meaningfully penalize executives for quality failures.
Corrective Action
Corrective action is an action that solves a current problem. For example, replacing a faulty sensor on an aircraft.Preventative Action
Preventative action is an action that prevents future incidents. For example, testing sensors on a monthly basis to prevent safety issues and flight delays.Design Thinking
Problems can often be solved with design practices such as reliability engineering. For example, redesigning a user interface to prevent latent human error.Resilience
Resilience is an approach to solving problems by designing your society, city, organization, processes and practices in a fundamentally sound way. For example, a city that uses land in a high risk tsunami zone as a park that is easily evacuated as compared to a city that builds hospitals, schools, houses, nuclear power facilities and other vulnerable structures on the same land.Continuous Improvement
In many cases, a problem isn't resolved with a single action but requires an ongoing and sustained program of improvement. For example, a series of pervasive customer service incidents that require training and improvements to your customer service culture that may take years to fully achieve.Knowledge Management
Problem management tends to generate a great deal of knowledge. For example, you may identify process gaps that aren't prioritized to be fixed. This knowledge can be captured, shared and communicated.Known Problem Management
The process of monitoring for incidents related to a known problem to apply a standard workaround or fix. For example, a manual workaround that a team can use to complete their work when a system is experiencing availability issues.Problem Review
The process of reviewing each problem to identify organizational weakness that can be improved.Problem Communication
Problems tend to capture the attention of stakeholders such as executive management, business units and customers. As such, communicating the status of problems and managing relationships with stakeholders is a key element of problem management. For example, managing communication with a customer who has reported a problem.Risk Management
Risk management is the process of identifying potential incidents and treating them before they occur. This can be integrated with problem management as problem management teams can contribute to the identification and reduction of risk.Quality Assurance
Quality assurance is the practice of addressing the root cause of quality failures. This is essentially problem management under a different name or vice versa.Summary
The process of investigating and fixing the root cause of issues and incidents. This is often a formal process that is covered by standards such as ITIL and ISO/IEC 20000.Overview: Problem Management | ||
Type | ||
Definition | The process of resolving and preventing the root cause of incidents. | |
Related Concepts |