Rise above firefighter mode with structured incident management to enable effective problem management
Besides the small introduction, subscribers and consulting clients within this management domain have access to:
Step-by-step methodology to identify existing challenges, clarify process and role expectations, and create concise effective process documentation to drive improvement. Review the executive brief at the start of the slide deck for an overview of the methodology and the value it can provide your organization.
Document process and role expectations to drive consistent and effective incident response.
Create incident response workflows to clarify steps and identify opportunities to improve.
Use these examples to guide your KB article templates and to clarify appropriate level of detail.
Modify these examples to suit your requirements and expedite incident status communications.
Define your problem management process, roles, and techniques.
Clarify problem intake and action steps in a workflow format that is easier for stakeholders to consume.
Use this example as a guide to create a problem ticket template in your ITSM tool.
Capture initiatives to educate staff and drive buy-in from senior leadership on improvements to your incident and problem management processes.
Translate ideas into specific initiatives to improve your incident and problem management processes.
[infographic]
Workshops offer an easy way to accelerate your project. If you are unable to do the project yourself, and a Guided Implementation isn't enough, we offer low-cost delivery of our project workshops. We take you through every phase of your project and ensure that you have a roadmap in place to complete your project successfully.
Improve how tickets logged, categorized, and prioritized.
Efficient ticket processing and consistent treatment of tickets based on severity.
1.1 Review the incident lifecycle and your current challenges.
1.2 Improve how you identify, log, and categorize incidents.
1.3 Define a ticket prioritization scheme.
1.4 Consistent ticket prioritization scheme.
1.5 Drive more efficient ticket intake.
Challenges summary.
Action items to improve initial ticket processing.
Streamline how users submit tickets.
Clarify incident management steps, roles, and responsibilities.
Incident Management SOP and Workflows documented to drive consistent and effective incident response.
2.1 Document your target-state Incident Management Workflow.
2.2 Document your target-state Critical Incident Response Workflow.
2.3 Define SLOs and escalation rules.
Incident Management Workflow.
Critical Incident Response Workflow.
SLOs and escalation timelines.
Efficient and effective problem management, reducing incident recurrence and impact.
3.1 Identify knowledgebase article candidates and create templates to expedite incident response.
3.2 Identify opportunities to improve efficiency with shift-left and automation.
3.3 Define problem management.
3.4 Standardize your problem intake process.
3.5 Standardize your problem action process (investigate, root cause analysis, resolve).
Knowledgebase article candidates identified.
Action items to explore shift-left and automation opportunities.
Problem management parameters defined.
Problem intake process documented.
Plan how you will implement improvements.
Translate ideas into action, with specific steps to implement tangible improvements in the areas of people (training), process, and technology.
4.1 Establish appropriate problem management governance.
4.2 Create a plan to communicate process changes.
4.3 Create a project roadmap to implement improvements.
4.4 Review workshop results.
Problem Management SOP updated.
Initiatives to communicate process improvements.
Project roadmap to improve incident and problem management.
Workshop outcomes and next steps summarized.
Incident management teams often find themselves too busy to create the knowledgebase (KB) articles or track the incident data that will save them time in the future. It becomes a vicious cycle that keeps them constantly in firefighter mode.
The key to breaking this cycle is to keep it simple as you seek to implement better structure and processes and right-size your approach. For example, avoid complex categorization schemes, and start with KB articles for known recurring incidents. Don’t jump to automation before you have the processes and resources to support it.
Similarly, when it comes to problem management, keep it simple by starting with Sev 1 tickets and recurring incidents that are obvious candidates for problem management. Support problem management with a consistent, structured approach that enables you to prioritize your limited resources.
As you build momentum with quick wins and better structure, improved incident management will drive more effective problem management and reduce future incidents as the incident-problem lifecycle comes full circle.
Frank Trovato
Research Director, Infrastructure and Operations
Info-Tech Research Group
Establish a consistent incident management process to better categorize, prioritize, and resolve incidents.
Enable faster resolution time through well-defined escalation protocols.
Prevent incidents from happening in the first place by identifying and resolving the underlying root cause via problem management.
Leverage event management to predict problems before potential incidents occur.
IT managers have conflicting accountabilities. It can be difficult to set aside time for preventing incidents (i.e., problem management) when staff are already busy resolving existing incidents and working on projects.
Resolving incidents quickly boosts confidence in IT, but recurring incidents erodes confidence, as does the need to use cumbersome workarounds.
Implement structured incident management to drive efficiency (e.g., effective use of categorization to drive appropriate ticket routing), and build out a knowledgebase to expedite future incident response.
On the problem management side, acknowledge that you have limited time for this, so start with obvious problems (e.g., recurring incidents) and then expand from there as problem management starts to reduce incident volume.
Info-Tech Insight
Effective problem management drives business value by preventing incidents, but it starts with good incident management that produces the data needed to identify problems that are driving recurring and related incidents. Specifically, logical categorization and resolution codes drive effective trend analysis to identify problems, and documenting troubleshooting, resolution details, and known errors provides a solid starting point for root cause analysis via problem management.
Unresolved issues
Low productivity
Poor planning
ITIL Incident Mgmt. Lifecycle | Key data and documentation to improve incident management |
---|---|
1. Detection (identify, triage) | Improve ticket intake methods and triage to gather better data upfront (e.g., a web portal that can make required data mandatory). |
2. Registration (log ticket) | Capture as much detail as you can (e.g., context, affected system) to expedite troubleshooting, post-incident review, & problem management. |
3. Classification (categorize, prioritize) | Define a categorization scheme that drives appropriate ticket routing and identifying recurring incidents, but keep it simple — 3 layers max. |
4. Diagnosis (investigate) | Document known errors and KB articles for common incidents to increase first-call resolution and expedite troubleshooting. |
5. Resolution (solve, validate) | *Record solution details, update the category if necessary, and assign a resolution code to ensure more-accurate trends reporting. |
6. Closure (final updates) | Determine if a KB would expedite future troubleshooting or incident resolution. Don’t let lessons learned float away into the ether. |
*Category and resolution can also be updated at Closure if needed.
The Info-Tech difference:
The Info-Tech difference:
Specifically, the following are pre-requisites for this blueprint:
1. Optimize Ticket Intake and Routing | 2. Standardize and Streamline Incident Response | 3. Establish Effective Problem Management | 4. Implement Improvements | |
---|---|---|---|---|
Phase Steps |
|
|
|
|
Phase Deliverables |
|
|
|
A good knowledge base expedites incident resolution and supports “shift-left” (e.g., enabling Tier 1 to solve incidents that would otherwise escalate to Tier 2 or 3).
Every incident is potentially an opportunity to document a solution, troubleshoot steps, or establish relevant operational documentation needed solve the incident.
If you capture this information only in the ticket or your own personal repository, you limit the ability to shift left and expedite future incident resolution.
All hands on deck doesn’t mean abandoning processes. Instead, supplement your existing incident management processes to maintain structure to your response. For example:
Time must be allocated to problem management to get the long-term benefits. It’s not going to be driven by the urgency of an outage, but rather the foresight to predict and prevent future incidents.
Effective problem management follows a structured process to get the most out of the time allocated to this proactive effort. This includes appropriate prioritization, a root cause analysis methodology, and a decision point on whether to adopt a workaround or continue to pursue a permanent solution.
If problem management is ad-hoc or “when I have time,” something else will always take precedence.
Use the examples as a guide for your KB article templates.
Workflows are critical to communication process expectations and driving consistent execution.
Modify our examples to suit your requirements.
Identify, prioritize, and present initiatives to improve incident and problem management.
Clarify process and role expectations to improve consistency, efficiency, and effectiveness.
“Our team has already made this critical project a priority, and we have the time and capability, but some guidance along the way would be helpful.”
“Our team knows that we need to fix a process, but we need assistance to determine where to focus. Some check-ins along the way would help keep us on track.”
“We need to hit the ground running and get this project kicked off immediately. Our team has the ability to take this over once we get a framework and strategy in place.”
“Our team does not have the time or the knowledge to take this project on. We need assistance through the entirety of this project.”
Diagnostics and consistent frameworks used throughout all four options
A Guided Implementation (GI) is a series
of calls with an Info-Tech analyst to help implement our best practices in your organization.
A typical GI is between eight and 12 calls over the course of four to six months.
Phase 1 | Phase 2 | Phase 3 | Phase 4 |
---|---|---|---|
Call #1: Scope requirements, objectives, and your specific challenges. | Call #3: Incident Management Workflows. | Call #6: Problem ticket sources. | Call #9: Plan how you will communicate changes. |
Call #2: Incident ticket intake and routing. | Call #4: Critical Incident Workflows. | Call #7: Problem management workflows. | Call #10: Create a project roadmap to implement improvements. |
Call #5: Complete the Incident Management SOP | Call #8: Complete the Problem Management SOP |
Contact your account representative for more information.
workshops@infotech.com 1-888-670-8889
Day 1 | Day 2 | Day 3 | Day 4 | |
---|---|---|---|---|
Activities | Optimize Ticket Intake and Routing | Standardize and Streamline Incident Response | Incident Wrap-Up and Establish Effective Problem Management | Problem Management Wrap-Up and Next Steps |
1.1 Review the incident lifecycle and your current challenges. 1.2 Improve how you identify, log, and categorize incidents. 1.3 Define a ticket prioritization scheme. 1.4 Drive more efficient ticket intake. |
2.1 Document your target-state Incident Management Workflow. 2.2 Document your target-state Critical Incident Response Workflow. 2.3 Define SLOs and escalation rules. |
2.4 Identify knowledgebase article candidates and create templates to expedite incident response. 2.5 Identify opportunities to improve efficiency with shift-left and automation (introduction). 3.1 Define problem management. 3.2 Standardize your problem intake process. 3.3 Standardize your problem action process (investigate, root cause analysis, resolve). |
3.4 Establish appropriate problem management governance. 4.1 Create a plan to communicate process changes. 4.2 Create a project roadmap to implement improvements. 4.3 Review workshop results. |
|
Deliverables |
|
|
|
|
Optimize Ticket Intake and Routing
Standardize and Streamline Incident Response
Establish Effective Problem Management
Implement Improvements
This phase will walk you through the following steps:
Improve Incident and Problem Management
1.1.1 Identify challenges with your existing incident management processes
This step will guide you through the following content and activities:
This step involves the following participants:
You will use this data as you work through this blueprint to help you make decisions on what the target state of your incident management program looks like.
You will need:
This image will help remind you to search through your own ticket data to help guide your decisions during the design phase of incident management.
1) Detection: User reporting an issue, event triggering an alert, and so on. Conduct initial triage/discovery. Confirm it’s an incident (for service requests, follow a separate process).
2) Registration: Create/update the ticket based on initial triage (e.g., incident details) or monitoring system that generated the alert (e.g., relevant system).
3) Classification: Categorize, prioritize, and conduct initial investigation (e.g., check KB for known errors). Escalate or re-assign if necessary.
4) Diagnosis: Additional investigation if solution not already identified. Peer discussion, check KB, and/or consult vendor. Escalate or re-assign if necessary.
5) Resolution: Apply solution (permanent fix or workaround) to restore service. If applicable, submit a change request to move the fix into production.
6) Closure: Finalize ticket details, including status (Closed). Provide final update to affected users. Identify if a KB is needed to expedite future troubleshooting or incident resolution.
Note: Ideally, steps 1 to 3 are executed by Tier 1 staff so that Tier 2 and 3 are included only when an issue needs to be escalated. This drives lower-cost resolution and frees time for Tier 2 and 3 to focus on project work, more-complex incidents, and problem management. Ticket updates occur throughout and are finalized as needed at Closure.
1-3 hours
1.2.1 Review and update your categorization scheme
1.2.2 Define resolution codes to further improve reporting
This step will guide you through the following content and activities:
This step involves the following participants:
Outcomes of this step
Incidents
Are unexpected disruptions to normal business processes and require attempts to restore services as soon as possible (e.g. the printer is not working).
Service Requests
Are tasks that don’t involve something that is broken or has an immediate impact on services. They can typically be scheduled (e.g. request for new software).
Incidents | Key Differences | Service Requests |
---|---|---|
Incidents will be prioritized based on urgency and impact to the organization. | Prioritization | Service requests will be scheduled and only increase in prioritization if there is a request process issue (e.g. I forgot to request Visio and I need it for a presentation today). Track these exceptions and report on non-compliance. |
Did incidents get resolved according to prioritization rules? | Service Level Agreement | Did service requests get completed on time? |
Incidents will typically need to be triaged at the service desk unless specific types of issues are set up to go directly to a specialist. | Routing of tickets | Service requests don’t need triage (typically) and can be routed automatically for approvals and fulfillment. |
Keep these guidelines in mind:
Type | Category | Subcategory |
---|---|---|
Hardware | Mobile Device | Surface |
iPad | ||
Desktop | Laptops | |
Monitor | ||
CPU | ||
Accessories | Docking Station | |
USB Drives | ||
Webcams | ||
Infrastructure | Network | Switches/Routers |
Connectivity/ISP | ||
Wi-Fi | ||
LAN/WAN Appliances |
Info-Tech Insight
Think about how you will use the data to determine which components need to be included in reports. If components won’t be used for reporting, routing, or warranty, reporting down to the component level adds little value.
1-3 hours
When building the categories, ask these questions:
Download the Service Desk Ticket Categorization Schemes
A resolution code is a field within the ticketing system that clarifies the primary way the ticket was resolved – e.g., incident resolution required a configuration change or training for the user, etc. See the list to the right or the Resolution Codes section in the SOP for examples.
The resolution code improves reporting by adding another level to the categorization scheme. Use reporting by category and resolution code to identify knowledgebase article candidates, training needs, or potential problem ticket candidates.
Activity Instructions:
Example Resolution Codes
Download the Incident Management and Service Desk SOP
1.3.1 Define your impact and urgency scales
This step will guide you through the following content and activities:
This step involves the following participants:
Tip: Four severity levels works well for most organizations. It allows Severity 1 to be reserved for truly critical incidents (potentially require invoking your DRP or BCP if it can’t be resolved soon) and three remaining levels for High, Medium, Low severities.
Severity Level = Impact x Urgency
Impact = The effect of the incident on the organization
Urgency = Is the incident impact, time-sensitive
URGENCY | |||||
---|---|---|---|---|---|
Critical | High | Medium | Low | ||
IMPACT | Extensive |
Severity 1 |
2 | 3 | 3 |
Significant | 2 | 2 | 3 | 4 | |
Moderate | 3 | 3 | 3 | 4 | |
Low | 3 | 4 | 4 | 4 |
Refer to the Incident Management and Service Desk SOP template for an example
Example:
Note: The example above and in the SOP are reasonable but not universal. Adjust the scales and/or the severities assigned to each cross-section if necessary to suit your requirements or circumstances.
1.4.1 Identify action items to improve ticket intake
This step will guide you through the following content and activities:
This step involves the following participants:
Outcomes of this step
The web portal is the most efficient intake method, but ensure it is user friendly before promoting it.
Maintain the phone for users from other locations and for critical incidents, but encourage users who call in to submit a ticket through the portal.
Email works well if it automatically creates a ticket in your ticketing system, but users often don’t provide enough information in unstructured emails. Use required fields and ticket templates to ensure the ticket is properly categorized.
If walk-ins are permitted, formalize the support so it can be scheduled and managed rather than interrupt driven. Ensure all interrupt-driven work is ticketed for proper workload management.
If chat is available, make it structured through the ticket queue management. Otherwise, it can lead to interruptions and prioritization challenges.
Formalize walk-ins
Build a self-service portal
How is the phone used?
Deal with email efficiently
Info-Tech Best Practice
The two most efficient intake channels should be encouraged for most tickets.
1-3 hours
Note: These potential initiatives will feed the project roadmap exercise in Phase 4 of this blueprint.
Optimize Ticket Intake and Routing
Standardize and Streamline Incident Response
Establish Effective Problem Management
Implement Improvements
This phase will guide you through the following steps:
Improve Incident and Problem Management
Activities
2.1.1 Use tabletop planning to capture your current-state workflow and gaps
2.1.2 Document your target-state workflow and where change needs to occur
2.1.3 Complete the RACI chart in your SOP to clarify expectations for each role
This step will guide you through the following content and activities:
This step involves the following participants:
Outcomes of this step
Workflow elements include:
Example workflow
Download Incident Management Workflow Library
1-3 hours
Tabletop planning is a walk-through exercise. In this case, we will be walking through how you would respond to an actual incident using the incident lifecycle steps.
1. Detection (identify, triage)
2. Registration (log ticket)
3. Classification (categorize, prioritize)
4. Diagnosis (investigate)
5. Resolution (solve, validate)
6. Closure (final updates)
1-3 hours
Download Incident Management Workflow Library
Download Incident Management and Service Desk SOP
Specifically, the RACI chart documents:
RACI chart example from the Incident Management and Service Desk SOP in this blueprint.
Tier | Duties | Example |
---|---|---|
Tier 1 | Ticket intake (initial triage, categorization, and assigning tickets if beyond Tier 1 expertise). Resolve low-complexity incidents or where a KB enables Tier 1 first-call resolution. |
|
Tier 2 | More senior incident response, though not specialists. Tier 2 provides all of the capabilities of tier 1 plus the ability to resolve incidents that require deeper knowledge. |
|
Tier 2 (Specialist) | Reports to the infrastructure manager or the applications manager, but not Tier 3 expertise. Tier 2 specialists are required when certain permissions or expertise is required beyond the general Tier 2 staff capabilities. |
|
Tier 3 | Reports to the infrastructure manager or the applications manager. Handles the most challenging incidents. |
|
1-3 hours
Activities
2.2.1 Use tabletop planning to capture your current-state workflow and gaps for critical incidents
2.2.2 Document your target-state critical incident workflow and where change needs to occur
This step will guide you through the following content and activities:
This step involves the following participants:
Outcomes of this step
Workflow elements once an incident is identified as a Severity 1 include the following (in addition to normal non-critical incident management elements):
Download Incident Management Workflow Library
Example workflow
1-3 hours
1-3 hours
Download Incident Management Workflow Library
Activities
2.3.1 Define SLOs and escalation timelines for each severity level
2.3.2 Identify system owners to expedite escalations
This step will guide you through the following content and activities:
This step involves the following participants:
Outcomes
Use metrics to measure existing operational processes (e.g., time to respond to a ticket, time to resolve, etc.) to identify bottlenecks, drive improvement, and ultimately establish reasonable service level targets. Define those targets as Service Level Objectives (SLOs), which are internal IT-facing metrics to keep the focus initially on process improvement. You can choose to make SLOs business-facing to set expectations, but they are a goal, not a commitment.
As you mature your incident management program, you can be more confident about establishing business-facing commitments in a Services Level Agreement (SLA).
The table below further clarifies the differences between SLOs and SLAs.
Service Level Objectives (SLOs) | Service Level Agreements (SLAs) |
---|---|
Internal objectives within IT. | Service levels agreed to with your customer. |
SLOs can be defined for components of an overall service. | SLAs measure customer-facing service levels, not the timeline for IT sub-steps required to meet the SLA. |
SLO breaches are tracked to identify opportunities for improvement. | SLA breaches/compliance metrics are typically reported to the customer. |
For both SLOs and SLAs, escalation timelines are defined to ensure added resources are applied when needed for the best chance of meeting the overall SLO or SLA. |
Note: For additional guidance on metrics, including the use of tension metrics to avoid gaming the system or driving unintended behavior, please refer to the Standardize the Service Desk blueprint.
Example – Sev 2 incident timeline:
1-3 hours
Example SLO and Escalation Timelines
1-3 hours
To automate or further streamline ticket routing and escalations, also do the following:
Activities
2.4.1 Identify and assign candidates for KB articles
2.4.2 Create incident status templates to simplify communication
2.4.3 Create an incident report template for critical incidents
This step will guide you through the following content and activities:
This step involves the following participants:
Outcomes of this step
Knowledge Management
Knowledgebase
Use the knowledgebase to document:
Service desk teams are often overwhelmed by the idea of building and maintaining a comprehensive integrated knowledgebase that covers an extensive amount of information.
Don’t let this idea stop you from building a knowledgebase! It takes time to build a comprehensive knowledgebase and you have to start somewhere.
Start with existing documentation or knowledge that is easy to document and you will soon see the benefits.
Then continue to build and improve from there. Eventually, knowledge management will be a part of the culture.
Note: This section focuses on getting started with capturing KB articles.
For more details on building and maintaining a knowledgebase, refer to the blueprint Standardize the Service Desk.
Use the Incident Knowledge Base Article Examples document in this blueprint as a guide to create templates in your ITSM tool Knowledge Base module or an equivalent tool that allows for version control, triggering reviews, and role-based access to automate at least some of the knowledge management tasks.
Key elements to include in your template and KB articles:
Example Incident KB Article – download the Incident Knowledge Base Article Examples document for more details and examples
1-3 hours
For more information about setting up a Knowledge Base, see the blueprint Standardize the Service Desk.
1-3 hours
Use the Incident Status Updates and Incident Report Templates document in this blueprint as a guide to create communication templates (e.g., in your ITSM tool) to simplify and standardize status updates.
Download the Incident Status Updates and Incident Report Templates to see the examples indicated above. Note: The Incident report is described on a separate slide.
1-3 hours
Incident reports are typically created only for severity 1 issues or as requested by senior leadership: i.e., where the impact of the incident warrants providing a formal report to senior leadership. The Incident Status Updates and Incident Report Templates document provides an example (see the excerpt to the right).
Download the Incident Status Updates and Incident Report Templates to see the full example and guidelines.
2.5.1 Identify shift-left and automation initiatives for your organization
This step will guide you through the following content and activities:
This step involves the following participants:
Outcomes of this step
Download Optimize the Service Desk with a Shift-Left Strategy for a more in-depth look
Don’t get swept away by the hype.
It’s easy to fall into the trap of thinking that AI will seamlessly automate all your processes and solve all your problems. AI and automation will certainly support your shift-left strategy, but it needs to be implemented carefully and slowly with the right foundations behind it in order to reap the benefits.
AI is a long-term investment and takes time and resources to plan and execute. The best way to start to realize the benefits of AI is by building your AI-enabled capabilities around the goals of your shift-left project and organizational goals.
The scope of AI is also beyond just the service desk, so consider the full business benefits of automation solutions before starting an automation project.
Optimize the Service Desk With a Shift-Left Strategy - The best type of service desk ticket is the one that doesn’t exist.
Accelerate Your Automation Processes - Integrate automation solutions and take the first steps to building an automation suite.
Build a Chatbot Proof of Concept - Create value for your business with your chatbot implementation.
See the blueprints above for more details on adding automation to your service desk.
1-3 hours
Automation Goal/Objective | Tasks/Projects/Implementations |
---|---|
Expediting self-service and ticket intake |
|
Automatically categorizing incidents based on issue |
|
Automatically routing tickets to the right queue/agent |
|
Optimize Ticket Intake and Routing
Standardize and Streamline Incident Response
Establish Effective Problem Management
Implement Improvements
This phase will walk you through the following steps:
Improve Incident and Problem Management
Activities
3.1.1 Outline your problem management lifecycle challenges
This step will guide you through the following content and activities:
This step involves the following participants:
Outcomes of this step
*Problem Identification (Intake)
Problem Control (Action)
Error Control (Output / Value Created)
*ITIL refers to problem identification, problem control, and error control. For greater clarity, this blueprint refers to these phases as intake, action, and output (value). In addition, ITIL uses the term problem categorization to refer to capturing relevant details as part of logging the ticket. To avoid, confusion with incident categories, this blueprint refrains from describing that process as “categorization.”
Potential sources for problem tickets:
Problem Management
Problem Management
1-3 hours
3.2.1 Validate your process to identify related (or recurring) incidents
3.2.2 Identify opportunities for proactive and non-IT problem management
3.2.3 Set guidelines for identifying problem tickets
3.2.4 Create a problem ticket template
3.2.5 Define problem prioritization guidelines based on risk
This step will guide you through the following content and activities:
This step involves the following participants:
Outcomes of this step
This example includes several intake channels, including Sev 1 tickets, system monitoring, and a monthly review of incidents to identify problem ticket candidates.
If you aren’t ready yet for more advanced problem intake, such as via systems monitoring, then set that as a future goal and adjust your workflow accordingly.
The key elements to include in your intake workflow are:
Download the Problem Management Workflow
1-3 hours
An effective event management program will have these characteristics:
Example Event Categories and Potential Actions
Event Categories | Potential Actions |
---|---|
Normal operation. E.g.:
|
Identified as Informational.
|
Exceptions (alarms indicate failure). E.g.:
|
Identified as a Warning.
|
Thresholds exceeded. E.g.:
|
Identified as a Caution.
|
1-3 hours
1-3 hours
Note:
Example Problem Ticket Intake Guidelines
1-3 hours
Use the example Problem Ticket Template as a guide to create a template in your ITSM tool to clarify the information to capture to support problem investigation and tracking. Depending on your ITSM tool, the template might also facilitate auto-filling some fields.
Below are recommended fields to include at a minimum (besides auto-generated fields such as a ticket number):
Example Problem Ticket Prioritization Scheme
Activities
3.3.1 Conduct a series of RCAs to clarify how and when to use each technique
3.3.2 Establish how you will decide between a permanent solution and a workaround or if you will leave it as a known error
3.3.3 Document your problem intake and action workflow
This step will guide you through the following content and activities:
This step involves the following participants:
Outcomes of this step
Key elements to include :
Download the Problem Management Workflow Library
Method | Description | Effort | When to Use |
---|---|---|---|
Brainstorm and Eliminate | Brainstorm possibilities from a wide variety of perspectives. Eliminate unlikely causes. | Low | Use as a starting point. This might be all you need if the solution is easily identified, or if your environment has low complexity. |
Five Whys | Ask Why? five times in an effort to dig deeper than the initial suspected root cause. Helps clarify the issue and potential solutions. | Medium | Use this if brainstorming does not generate a suitable solution. Drill down further with the Five Whys technique. |
Ishikawa/Fishbone Diagram | Use a fishbone diagram to capture the problem statement (the spine), problem categories (the ribs), and brainstorm potential causes (branches off the ribs). A visual method to organize potential causes. | High | Use this where there are potentially multiple root causes, or where the other approaches do not generate a suitable outcome. |
Process
Example Output
Problem Statement: The microwave isn’t working; everyone’s fish is cold.
Potential Causes (Brainstormed)
The strikethroughs represent unlikely causes or causes that have been eliminated empirically by investigation.
Process
Example Output
In this example, the ultimate root cause appears to be a cold office. However:
Process
Example Output
1-3 hours
Example Problems
1-3 hours
Example: CRM not sending meeting invites
Decision Inputs | Example |
---|---|
Root cause | Conflicting SalesOps CRM processes overwriting invitation send. |
Current problem risk | Five incidents per month. Typically the issue is detected, but when it’s not, a client meeting is missed. |
Permanent solution | Configuration change requiring approximately 30 minutes of dev time. Low risk of affecting other aspects of the CRM. Straightforward roll back if it causes unexpected issues. |
Workaround | Developer can manually push out invite when the issue is reported via an incident ticket (~30 minutes to resolve). No risk with the workaround; the risk is that the issue is not always detected. |
Decision | Low risk and time commitment for the permanent solution. Task assigned and will go through change management for approval. |
Note: The initial workaround might come from the incident resolution. Problem management would seek to find a permanent solution or a better workaround.
1-3 hours
Problem management has fairly standard components so you can use the example workflow in this blueprint as your starting point. With this in mind, follow these steps to create your workflow:
Activities
3.4.1 Identify key performance indicators to track problem management success
3.4.2 Identify what you will track on a Problem Management dashboard
3.4.3 Identify problem management roles and responsibilities
3.4.4 Create a meeting schedule for the problem management team
This step will guide you through the following content and activities:
This step involves the following participants:
Outcomes of this step
1-3 hours
Problem management is a part-time role for most, which makes it even more important to clarify expectations. A RACI chart is an effective tool for setting and communicating those expectations.
Example Problem Management KPIs
Key performance indicator | Description |
---|---|
Number of incidents per problem | How many incidents are linked to each problem ticket? |
Mean time to root cause (MTRC) | How long does it take the problem management team to find the root cause of the problem? |
Average root cause analysis effort | How many hours to identify RCA (base this on mean time to root cause and number of FTEs involved)? |
Percentage of problems not resolved | How often is a problem returned to the backlog with no permanent solution or workaround identified? |
Average problem severity | How many problems are at the higher end of the risk scale? |
1-3 hours
Tip: Make the problem dashboard available to all members of the problem management team. This will help team leaders manage the tickets assigned to them and report successes.
1-3 hours
Problem management is a part-time role for most, which makes it even more important to clarify expectations. A RACI chart is an effective tool for setting and communicating those expectations.
Problem Management RACI Chart Example
1-3 hours
Optimal problem management, however, involves holding regular meetings (as opposed to ad hoc), consistent in terms of membership, focused, and retrospective.
Optimize Ticket Intake and Routing
Standardize and Streamline Incident Response
Establish Effective Problem Management
Implement Improvements
This phase will guide you through the following steps:
Improve Incident and Problem Management
Activities
4.1.1 Identify and assign communication action items
This step will guide you through the following content and activities:
This step involves the following participants:
Outcomes of this step
1-3 hours
Incident and problem management depend on collaboration with a wide range of stakeholders; for most of them, incident and problem management is not their primary concern, so shifting behavior will require effective communication and some perseverance.
With the above in mind, use the Communication Initiatives Template to identify specific initiatives to communicate process changes, your expectations of stakeholders, and what’s in it for them, as outlined below:
Communication Initiatives Template
4.2.1 Prioritize initiatives based on factors such as effort, cost, and risk
4.2.2 Review the dashboard to fine-tune the roadmap
This step will guide you through the following content and activities:
This step involves the following participants:
Outcomes of this step
Use the Incident and Problem Management Project Roadmap Tool in this blueprint as your template to build your project roadmap to improve incident and problem management. At a high-level, the steps are:
Download the Incident and Problem Management Project Roadmap Tool
Note: This tool is based on the DRP Roadmap Tool (although it’s labeled for DRP, the same tool can be used to create any project roadmap as we have done here). For additional instructions if needed and any updates to the source project roadmap tool, refer to Info-Tech’s Create a Right-Sized Disaster Recovery Plan blueprint.
1-3 hours
Use the Incident and Problem Management Project Roadmap Tool to prioritize initiatives:
Project Roadmap Data Entry Example
1-3 hours
Review your project roadmap results:
Project Roadmap Dashboard Example
This blueprint helped you define or improve your incident and problem management processes and create supporting documentation such as KB articles and status update templates.
Phase 1: Optimize Ticket Intake and Routing
Phase 2: Standardize and Streamline Incident Response
Phase 3: Establish Effective Problem Management
Phase 4: Implement Improvements
Create a consistent customer service experience for service desk patrons and increase efficiency, first-call resolution, and end-user satisfaction with the Service Desk.
Develop and Implement a Security Incident Management ProgramCreate a scalable incident response program for a wide range of potential security incidents. Refer to this blueprint for additional details on overall security incident management.
Create a Ransomware Incident Response PlanTake a deeper dive specifically into ransomware readiness and incident response.
Create a Right-Sized Disaster Recovery PlanAvoid over- or under-provisioning your disaster recovery (DR) solution. Prioritize business requirements, determine your ability to meet those requirements, and then identify projects to close the gap between your current and required DR capabilities.
Implement Crisis Management Best PracticesDon’t be another example of what not to do. Implement an effective crisis response plan to minimize the impact on business continuity, reputation, and profitability.
Hardy Baker
Incident and Problem Manager
Waste Management
Rob England
Managing Director
Two Hills Ltd, Blogger at Itskeptic.org
Rishi Bhargava
Co-Founder
Demisto Inc.
Steven Ingram
Data Engineer
Wave HQ
George Jucan
Founder
Organizational Performance Enablers Network
Rick Moroz
Associate Director, Information Systems
University of Guelph
Note: In addition to the above, several anonymous external interviewees contributed to this project.
ASQ. “Fishbone (Ishikawa) Diagram.” ASQ. N.d. Web. November 24, 2014.
“Creating Problem Tickets.” Boston University Information Services and Technology. N.d. Web. November 24, 2016.
Draper, Steve. “Correlation and causation.” University of Glasgow. October 21, 2014. Web. November 24, 2016.
England, Rob. “Measuring Problem Management.” The IT Skeptic. February 1, 2014. Web. November 24, 2016.
England, Rob. Owning ITIL. Two Hills. 2009.
England, Rob. “Rob England: Proactive Problem Management.” December 5, 2012. Web. November 24, 2016.
Galley, Mark. “Improving on the Fishbone: Effective Cause-and-Effect Analysis: Cause Mapping.” ThinkReliability. 2007. Web. November 24, 2016.
“Problem Management.” ISACA. N.d. Web. November 24, 2016.
Higginson, Simon. “Four Problem Management SLAs you really can’t live without.” The ITSM Review. February 28, 2013. Web. November 24, 2016.
“How to use the Fishbone Tool for Root Cause Analysis.” Centers for Medicare and Medicaid Services. N.d. Web. November 24, 2016.
“Incident and Problem Management Dashboard.” IBM Knowledge Center. 2009. Web. November 24, 2016.
Isbell, Douglas, and Don Savage. “Mars Climate Orbiter Failure Board Releases Report, Numerous NASA Actions Underway in Response.” National Aeronautics and Space Administration. November 10, 1999. Web. November 24, 2016.
“Incident Management ITIL4 Practice Guide.” AXELOS.com. January 11, 2020.
“Problem Management ITIL4 Practice Guide.” AXELOS.com. January 11, 2020.