ITIL problem management guide
IT problem management is the ITIL practice that reduces the likelihood and impact of incidents by identifying actual and potential causes and by managing workarounds and known errors. In plain English, incident management restores service fast, while problem management makes repeat failures less likely and less damaging.
This page keeps the official ITIL framing, but it also pulls in evidence from large-scale cloud operations, postmortem practice, safety science, and organizational learning so the best practices are not just process theory.
Quick List: ITIL Problem Management Best Practices
If you need the short version first, these are the practices that matter most. The detailed sections below explain what each one means, why it works, and how to apply it in an ITIL-aligned workflow.
1. Separate restore from prevent
Restore service quickly through incident management, then continue the deeper cause analysis through problem management.
2. Work reactively and proactively
Use repeat incidents, weak signals, monitoring gaps, and risk trends to open problems before the next outage.
3. Open records early
Start a problem record as soon as a pattern or risk is real enough to manage, even if the cause is already partly known.
4. Prioritize by impact and recurrence
Do not run the backlog by age alone. Focus on blast radius, business criticality, repeat rate, and cost of delay.
5. Use blameless analysis
Look for contributing causes across people, process, tooling, monitoring, design, and dependencies instead of hunting for one person to blame.
6. Publish workarounds fast
Known errors and workarounds should be operational assets that service desk and on-call teams can reuse immediately.
7. Favor strong actions
Prefer automation, guardrails, detection, safer defaults, and design changes over reminders and one-time retraining alone.
8. Measure the right outcomes
Track repeat incidents, time to workaround, time to fix identified, action completion, and backlog health, not just closure speed.
One-line summary
Good ITIL problem management is not slow paperwork after an outage. It is the disciplined work of reducing repeat pain through faster learning, better workarounds, stronger fixes, and better risk decisions.
What Is IT Problem Management in ITIL?
Officially, a problem is a cause, or potential cause, of one or more incidents. The purpose of problem management is to reduce the likelihood and impact of incidents by identifying actual and potential causes and by managing workarounds and known errors.
What is a problem?
In ITIL language, a problem is not the visible outage itself. It is the underlying cause, or possible cause, behind one or more incidents.
What is problem management?
Problem management is the practice that investigates those causes, reduces the chance of repeat incidents, and lowers impact through fixes, workarounds, and managed known errors.
What does it produce?
Typical outputs include a problem record, linked incidents, analysis notes, workarounds, known error information, improvement actions, and where appropriate a change or engineering fix.
What success looks like
A mature practice means fewer repeat incidents, faster mitigation when incidents do happen, clearer operational knowledge, and better decisions about when to fix, mitigate, improve process, or accept risk.
Official idea in plain English
Incident management answers, “How do we get service back now?” Problem management answers, “Why did this happen, how do we stop it happening again, and what can we do in the meantime?”
Important nuance for accuracy
Not every problem ends with a permanent code fix. In ITIL-aligned practice, a problem can also close because the impact is mitigated with a workaround, a process or training improvement solves the issue, or the remaining exposure is deliberately accepted as a risk.
- PMFix: remove the underlying cause.
- KEKnown error: enough understanding exists to guide diagnosis and response.
- WAWorkaround: reduce impact while the full fix is pending, unnecessary, or not yet approved.
Incident Management vs Problem Management in ITIL
These two practices work together, but they are not interchangeable. Treating them as the same thing usually slows restoration, weakens learning, or both.
| Area | Incident management | Problem management |
|---|---|---|
| Main mission | Restore normal service operation as quickly as possible and minimize negative impact. | Reduce the likelihood and impact of incidents by addressing underlying causes and managing workarounds and known errors. |
| Main question | How do we get users back to service now? | Why is this happening, and how do we reduce recurrence or impact next time? |
| Time horizon | Immediate and short-term. | Near-term and long-term. |
| Typical trigger | Service disruption, degradation, alert, or user report. | Repeat incidents, major incidents, weak signals, trend analysis, or identified service risk. |
| Common outputs | Mitigation, service restoration, communications, incident record. | Problem record, linked incidents, workaround, known error, change request, action plan, closure rationale. |
| Good success signal | Fast detection and restoration with controlled impact. | Fewer repeat incidents, lower impact, better knowledge reuse, and strong preventive actions. |
Fast mental model
Incident management restores. Problem management learns, prevents, mitigates, and improves. Mature teams do both without forcing one practice to do the other practice’s job.
ITIL Problem Management Process: A Practical Lifecycle
ITIL does not force one rigid workflow for every organization, but the practical shape is usually similar. The lifecycle below keeps the work structured without turning it into bureaucracy.
Detect a pattern or a risk
Open the door with repeat incidents, a major incident, a post-incident review, a near miss, a monitoring gap, a supplier issue, capacity risk, or any other signal that the same pain may happen again.
Open the problem record
Create one place to manage the issue. Link related incidents, services, alerts, changes, knowledge articles, and teams so the whole pattern is visible instead of scattered.
Assess impact and priority
Decide how urgently the problem should be worked based on user impact, recurrence, critical services, blast radius, regulatory exposure, supplier dependency, and cost of delay.
Investigate contributing causes
Build a timeline, review recent changes, examine telemetry, inspect dependencies, and identify the technical and non-technical conditions that allowed the incident pattern to happen.
Publish workaround and known error information
If a safe workaround exists, publish it early. Make it searchable and usable by service desk, operations, engineering, and any supplier teams that will face the same issue again.
Fix, mitigate, improve, or accept risk
The end state may be a permanent engineering fix, infrastructure change, operational guardrail, process improvement, training improvement, or explicit decision to accept the remaining risk.
Why this lifecycle matters
Mature problem management keeps enough structure to measure bottlenecks, action quality, and recurrence, but it stays flexible enough to handle technical, process, supplier, and human-factor problems in one operating model.
ITIL Problem Management Best Practices With Science-Backed Rationale
Each practice below ties together official ITIL guidance and evidence from real operations research. The goal is not to make the article academic. The goal is to ensure the advice survives contact with reality.
Separate incident recovery from problem resolution
Restore service first when users are hurting. Then keep the problem work going until the underlying pattern is understood and a deliberate end state is chosen.
What it means
Do not hold restoration hostage to a perfect diagnosis. Rollback, reroute, restart, or use a safe workaround if that is the fastest path back to acceptable service.
Why it works
Operational studies of large cloud services show that many incidents are mitigated without an immediate code fix. That makes fast stabilization and later deeper analysis a practical necessity, not a compromise.
How to apply it
Create a clear handoff from major incident to problem record. Capture the workaround, affected services, and next owner before the incident bridge closes.
Work both reactively and proactively
Problem management should not wait for the next serious outage. It should also mine risks from monitoring gaps, trend changes, near misses, and repeated low-severity incidents.
What it means
Treat repeated tickets, weak alerts, supplier instability, certificate expiry, capacity drift, and recurring manual fixes as valid intake paths for a problem record.
Why it works
ITIL explicitly covers actual and potential causes. Empirical cloud research also shows that tuning thresholds and adding telemetry can materially improve automated detection.
How to apply it
Run a weekly proactive review of repeat incidents, “should-have-detected-earlier” events, and high-risk service trends. Feed the output straight into the problem backlog.
Open problem records early and link everything related
One good problem record is a control point. It gives the organization one place to connect incidents, changes, services, workarounds, owners, and decisions.
What it means
Raise the record when there is enough signal to manage, not only when the cause is still a mystery. If the cause is partly known but further action is needed, the problem still exists.
Why it works
Research on linked incidents shows that related incident information helps teams find mitigation paths and identify causes faster. Axelos’s DWP case also emphasizes linking incidents to assess total impact.
How to apply it
Require links to related incidents, changes, services, and knowledge articles. Add fields for recurrence pattern, affected tiers, workaround status, and closure rationale.
Prioritize by impact, recurrence, and risk instead of age alone
The most important problems are not always the oldest ones. They are the ones that create the most risk, burn the most operational time, or threaten the most important services.
What it means
Rank problems using user impact, service criticality, recurrence, blast radius, supplier dependency, compliance exposure, and the cost of leaving the issue unresolved.
Why it works
Studies of online services show that incidental incidents can consume a large share of maintenance effort, while poor triage and reassignment can greatly inflate mitigation time.
How to apply it
Use a simple scoring model that combines impact, frequency, and strategic importance. Review the model with service owners so prioritization reflects business reality, not ticket gravity alone.
Use blameless, multi-factor analysis instead of single-cause blame
Most serious failures are not explained well by one person, one step, or one “root cause.” Strong analysis looks at the full set of contributing conditions that made the failure possible.
What it means
Build a timeline and ask what changes, assumptions, missing signals, permissions, dependencies, process gaps, and communication conditions shaped the event.
Why it works
Google’s SRE guidance recommends blameless postmortems focused on contributing causes and preventive actions. Psychological safety research also shows that people learn more when the environment is safe for admitting mistakes, asking for help, and speaking up.
How to apply it
Use a facilitator, a shared live document, and cross-functional participation. Ask, “What made this action reasonable at the time?” before asking, “What should change?”
Treat workarounds, known errors, and troubleshooting guides as first-class assets
In mature operations, workaround quality is part of reliability. If teams cannot find or trust the guidance during an incident, the same problem keeps costing time.
What it means
Document symptoms, scope, safe commands, rollback conditions, side effects, owners, last review date, and links to the related problem record and services.
Why it works
Axelos’s DWP case allows workarounds at any stage of the problem lifecycle. Microsoft incident research also found that weak documentation, procedures, and manual effort are real causes of mitigation delay.
How to apply it
Version known error articles, review them after major changes, and make them available to service desk, on-call, and supplier teams. Retire stale guidance aggressively.
Turn reviews into strong actions and automation, not weak reminders
The value of a problem review is not the document. It is the quality and follow-through of the actions that reduce future risk.
What it means
Prefer stronger controls such as automation, safer defaults, better observability, permissions fixes, deployment guardrails, capacity protection, and simpler rollback paths.
Why it works
Safety science literature warns that education, reminders, and new policies alone are usually weak actions. Google’s postmortem guidance also centers follow-up actions as the real prevention mechanism.
How to apply it
Label actions by strength, assign a single owner, add a due date, and review implementation status until the change is in production and the risk reduction is observable.
Measure balanced outcomes and keep learning social
Metrics should help teams improve the system, not game the process. The best dashboards balance speed, quality, recurrence, knowledge reuse, and action completion.
What it means
Track recurrence, time to workaround, time to fix identified, action-item completion, backlog health, and how often major or repeat incidents are linked to a problem record.
Why it works
Axelos’s DWP case shows that single KPI targets can distort behavior, including delaying the opening of a problem record just to satisfy the metric. Balanced metrics reduce that risk.
How to apply it
Review the metrics with incident management, change enablement, service desk, knowledge management, service owners, and suppliers so learning is shared across the people who can actually change the system.
The pattern behind all eight practices
Strong problem management shortens the path from “this happened” to “we know what to do next time” and then to “the system is now harder to break in the same way.” That is why workarounds, shared knowledge, and preventive action quality matter just as much as root cause notes.
Metrics That Make IT Problem Management Better
Metrics should help decision-making and improvement. They should not push teams into hiding problems, delaying records, or optimizing for a number that looks good but changes nothing.
Metrics worth tracking
- 1Time to workaround: How fast can the organization publish a safe response once the problem is recognized?
- 2Time to fix identified: How long does it take to reach a credible remediation path?
- 3Repeat incidents after closure: Did the problem really stop hurting people?
- 4Linked major or repeat incidents: Are important incidents being connected to problem records consistently?
- 5Action completion rate: Are the preventive actions actually landing on time?
- 6Backlog age by priority: Is risk accumulating in the wrong places?
- 7Known error article health: Are workarounds reviewed, current, and usable?
Official example metric from ITIL guidance
Axelos’s measurement guidance gives an example productivity metric for the problem management practice. It is useful as a throughput signal, but it should never be the only number running the practice.
- !Use carefully: throughput is useful, but it does not prove impact reduction by itself.
- !Balance it: pair throughput with recurrence, workaround speed, and action quality.
Metric trap to avoid
A low count of problem records is not proof of operational health. If incidents are still frequent or hard to resolve, a low problem count can simply mean the organization is failing to open and manage the work explicitly.
Problem Management Roles
Problem management breaks when ownership is unclear. Not because people are lazy, but because nobody knows who is supposed to drive the work once the incident is over.
Clear ownership keeps problems from sitting in a backlog with no movement.
Core Roles In Practice
- Problem Owner
- Owns the problem record end to end
- Keeps progress moving
- Ensures the issue does not get abandoned
- Service Owner
- Decides business impact and priority
- Flags critical services and customer risk
- Engineering / Operations
- Investigates causes
- Builds fixes or mitigations
- Validates technical changes
- Service Desk
- Uses and validates workarounds
- Feeds repeat incident signals back into the problem
- Change / Release Owners
- Control rollout of fixes
- Ensure safe implementation
Non-Negotiable Rule
- One owner per problem
- One owner per action
- Shared input from all relevant teams
If ownership is split or vague, the problem will stall. Every time.
Fix vs Mitigate vs Accept
The Real Decision
Not every problem should be fixed immediately.
Problem management is about deciding:
- Fix it permanently
- Mitigate it safely
- Improve process instead
- Accept the risk
How To Decide
Evaluate the problem across:
- Frequency → How often does it happen
- Impact → How bad is it when it happens
- Blast Radius → How many users or services are affected
- Operational Cost → How much time it keeps burning
- Risk Type → Security, compliance, financial
Decision Paths
- Fix
- High impact
- High recurrence
- Critical systems
- Mitigate
- Reliable workaround exists
- Fix is complex or risky
- Impact can be controlled
- Process Improvement
- Root cause is human or workflow-related
- Training, automation, or guardrails solve it
- Accept Risk
- Low impact
- Rare occurrence
- Fix cost outweighs benefit
Critical Rule
If you accept risk:
- Document it
- Assign an owner
- Set a review point
Silent acceptance is not strategy. It is just neglect.
Common IT Problem Management Mistakes
These mistakes show up in real organizations because they feel efficient in the short term. They usually make reliability worse over time.
Treating problem management as delayed incident management
When the same people, goals, and queues handle both practices, teams often either rush the analysis or slow down restoration. Separate the goals even if the same humans participate in both.
Waiting for certainty before opening a problem
If you only open a record after the cause is perfectly known, the organization loses traceability, impact visibility, and the chance to coordinate recurring pain in one place.
Searching for one root cause and one guilty person
Single-cause storytelling feels neat, but complex failures usually need a systems view with multiple contributing causes and multiple intervention points.
Assuming every problem must end in a code change
Many problems are resolved or contained through rollback, configuration, infrastructure action, automation, process improvement, or managed risk acceptance instead.
Keeping workarounds in private chat or tribal memory
A workaround that only one engineer knows is not part of your operating model. Publish it, review it, and make sure the next responder can actually use it safely.
Running the practice from one KPI
One target often creates gaming. Balanced metrics produce better behavior because they reward risk reduction, learning quality, and usable knowledge, not just motion.
FAQ: ITIL Problem Management Questions People Actually Ask
These answers are intentionally concise, but each one is detailed enough to stand on its own in search results and internal documentation.
What is IT problem management in ITIL?
It is the ITIL practice that reduces the likelihood and impact of incidents by identifying actual and potential causes of incidents and by managing workarounds and known errors.
What is the difference between incident and problem management?
Incident management restores service as quickly as possible. Problem management investigates causes, manages workarounds, and reduces the chance or impact of repeat incidents.
Is problem management reactive or proactive?
Both. It reacts to incidents that already happened and proactively addresses risks or weak signals that could generate future incidents.
Does every problem require a permanent fix?
No. Some problems close through a workaround, process improvement, training improvement, infrastructure change, or explicit risk acceptance. The key is that the closure decision is deliberate and recorded.
Why are blameless reviews important?
Because people share more useful information when the goal is learning instead of punishment. That makes the analysis deeper and the preventive actions stronger.
What metrics should a mature team track?
Focus on repeat incidents, time to workaround, time to fix identified, action completion, backlog health, and knowledge quality. Use throughput metrics as part of the picture, not the whole picture.
Best short answer for an intro paragraph
ITIL problem management is the practice of finding and managing the causes of incidents so the organization can reduce repeat failures, lower service impact, and respond faster with better knowledge the next time something goes wrong.
Sources Behind This Article
These links were chosen to keep the article grounded in current official ITIL guidance and credible evidence from operations research, postmortem practice, safety science, and organizational learning.
Official ITIL context and definitions
PeopleCert ITIL framework and current certification landscape
Used for the official problem management purpose, scope framing, and the current Version 5 plus ITIL 4 practice-module context.ITIL in practice
Used for lifecycle examples, closure logic, KPI cautions, workarounds, risk acceptance, and practical operating patterns.Postmortem and learning culture
Used for the case for blameless reviews, focus on contributing causes, and the importance of follow-up actions and shared learning.Empirical cloud incident research
Microsoft Research: How to Fight Production Incidents?
Microsoft Research: An Empirical Investigation of Incident Triage
Used for evidence about non-code causes, mitigation patterns, documentation quality, linked incidents, and the operational cost of poor triage.Safety science and corrective action quality
Used for the caution against weak actions, single-cause thinking, and analysis that stops at paperwork instead of system change.Organizational learning and psychological safety
Amy Edmondson: Psychological Safety and Learning Behavior in Work Teams
Used for the connection between psychologically safe teams and the learning behaviors needed for better incident and problem analysis.Use problem management to reduce repeat pain, not just close records
The best ITIL problem management practices create a loop: detect patterns, publish usable workarounds, understand contributing causes, implement stronger actions, and verify that recurrence and impact really go down.
Back to top





