Build an event management practice that is situated in the larger service management environment. Purposefully choose valuable events to track and predefine their associated actions to cut down on data clutter.
Event management is useless in isolation. The goals come from the pain points of other ITSM practices. Build handoffs to other service management practices to drive the proper action when an event is detected.
Create a repeatable framework to define monitored events, their root cause, and their associated action. Record your monitored events in a catalog to stay organized.
Besides the small introduction, subscribers and consulting clients within this management domain have access to:
Engineer your event management practice with tracked events informed by the business impact of the related systems, applications, and services. This storyboard will help you properly define and catalog events so you can properly respond when alerted.
Use this tool to define your workflow for adding new events to track. This cookbook includes the considerations you need to include for every tracked event as well as the roles and responsibilities of those involved with event management.
Use this tool to record your tracked events and alerts in one place. This catalog allows you to record the rationale, root-cause, action, and data governance for all your monitored events.
Use this template to help define your event management handoffs to other service management practices including change management, incident management, and problem management.
Use this tool to implement and continually improve upon your event management process. Record, prioritize, and assign your action items from the event management blueprint.
Workshops offer an easy way to accelerate your project. If you are unable to do the project yourself, and a Guided Implementation isn't enough, we offer low-cost delivery of our project workshops. We take you through every phase of your project and ensure that you have a roadmap in place to complete your project successfully.
Determine goals and challenges for event management and set the scope to business-critical systems.
Defined system scope of Event Management
Roles and responsibilities defined
1.1 List your goals and challenges
1.2 Monitoring and event management RACI
1.3 Abbreviated business impact analysis
Event Management RACI (as part of the Event Management Cookbook)
Abbreviated BIA (as part of the Event Management Cookbook)
Define your in-scope configuration items and their operational conditions
Operational conditions, related CIs and dependencies, and CI thresholds defined
2.1 Define operational conditions for systems
2.2 Define related CIs and dependencies
2.3 Define conditions for CIs
2.4 Perform root-cause analysis for complex condition relationships
2.5 Set thresholds for CIs
Event Management Catalog
Pre-define actions for every monitored event
Thresholds and actions tied to each monitored event
3.1 Set thresholds to monitor
3.2 Add actions and handoffs to event management
Event Catalog
Event Management Workflows
Effectively implement event management
Establish an event management roadmap for implementation and continual improvement
4.1 Define your data policy for event management
4.2 Identify areas for improvement and establish an implementation plan
Event Catalog
Event Management Roadmap
Event management is useless in isolation.
Event management creates no value when implemented in isolation. However, that does not mean event management is not valuable overall. It must simply be integrated properly in the service management environment to inform and drive the appropriate actions.
Every step of engineering event management, from choosing which events to monitor to actioning the events when they are detected, is a purposeful and explicit activity. Ensuring that event management has open lines of communication and actions tied to related practices (e.g. problem, incident, and change) allows efficient action when needed.
Catalog your monitored events using a standardized framework to allow you to know:
Properly engineering event management allows you to effectively monitor and understand your IT environment and bolster the proactivity of the related service management practices.
Benedict Chang
Research Analyst, Infrastructure & Operations
Info-Tech Research Group
Strive for proactivity. Implement event management to reduce response times of technical teams to solve (potential) incidents when system performance degrades.
Build an integrated event management practice where developers, service desk, and operations can all rely on event logs and metrics.
Define the scope of event management including the systems to track, their operational conditions, related configuration items (CIs), and associated actions of the tracked events.
Managed services, subscription services, and cloud services have reduced the traditional visibility of on- premises tools.
System(s) complexity and integration with the above services has increased, making true cause and effect difficult to ascertain.
Clearly define a limited number of operational objectives that may benefit from event management.
Focus only on the key systems whose value is worth the effort and expense of implementing event management.
Understand what event information is available from the CIs of those systems and map those against your operational objectives.
Write a data retention policy that balances operational, audit, and debugging needs against cost and data security needs.
More is NOT better. Even in an AI-enabled world, every event must be collected with a specific objective in mind. Defining the purpose of each tracked event will cut down on data clutter and response time when events are detected.
In 2020, 33% of organizations listed network monitoring as their number one priority for network spending. 27% of organizations listed network monitoring infrastructure as their number two priority.
Source: EMA, 2020; n=350
33% of all IT organizations reported that end users detected and reported incidents before the network operations team was aware of them.
Source: EMA, 2020; n=350
64% of enterprises use 4-10 monitoring tools to troubleshoot their network.
Source: EMA, 2020; n=350
Define how event management informs other management practices.
Monitoring and event management can be used to establish and analyze your baseline. The more you know about your system baselines, the easier it will be to detect exceptions.
Events can inform needed changes to stay compliant or to resolve incidents and problems. However, it doesn’t mean that changes can be implemented without the proper authorization.
The best use case for event management is to detect and resolve incidents and problems before end users or IT are even aware.
Events sitting in isolation are useless if there isn’t an effective way to pass potential tickets off to incident management to mitigate and resolve.
Events can identify problems before they become incidents. However, you must establish proper data logging to inform problem prioritization and actioning.
1. Situate Event Management in Your Service Management Environment | 2. Define Your Monitoring Thresholds and Accompanying Actions | 3. Start Monitoring and Implement Event Management | |
Phase Steps |
1.1 Set Operational and Informational Goals 1.2 Scope Monitoring and States of Interest |
2.1 Define Conditions and Related CIs 2.2 Set Monitoring Thresholds and Alerts 2.3 Action Your Events |
3.1 Define Your Data Policy 3.2 Define Future State |
Event Cookbook Event Catalog |
|||
Phase Outcomes |
Monitoring and Event Management RACI Abbreviated BIA |
Event Workflow |
Event Management Roadmap |
The goals come from the pain points of other ITSM practices. Build handoffs to other service management practices to drive the proper action when an event is detected.
Trying to organize a catalog of events is difficult when working from the bottom up. Start with the business drivers of event management to keep the scope manageable.
Defining tracked events with their known conditions, root cause, and associated actions allows you to be proactive when events occur.
Start small if need be. It is better and easier to track a few items with proper actions than to try to analyze events as they occur.
Even in an AI-enabled world, every event must be collected with a specific objective in mind. Defining the purpose of each tracked event will cut down on data clutter and response time when events are detected.
Supplement the predictive value of a single event by aggregating it with other events.
Each step of this blueprint is accompanied by supporting deliverables to help you accomplish your goals:
Event Management Cookbook
Use the framework in the Event Management Cookbook to populate your event catalog with properly tracked and actioned events.
Event Management RACI
Define the roles and responsibilities needed in event management.
Event Management Workflow
Define the lifecycle and handoffs for event management.
Event Catalog
Consolidate and organize your tracked events.
Event Roadmap
Roadmap your initiatives for future improvement.
INDUSTRY - Research and Advisory
SOURCE - Anonymous Interview
One staff member’s workstation had been infected with a virus that was probing the network with a wide variety of usernames and passwords, trying to find an entry point. Along with the obvious security threat, there existed the more mundane concern that workers occasionally found themselves locked out of their machine and needed to contact the service desk to regain access.
The system administrator wrote a script that runs hourly to see if there is a problem with an individual’s workstation. The script records the computer's name, the user involved, the reason for the password lockout, and the number of bad login attempts. If the IT technician on duty notices a greater than normal volume of bad password attempts coming from a single account, they will reach out to the account holder and inquire about potential issues.
The IT department has successfully proactively managed two distinct but related problems: first, they have prevented several instances of unplanned work by reaching out to potential lockouts before they receive an incident report. They have also successfully leveraged event management to probe for indicators of a security threat before there is a breach.
“Our team has already made this critical project a priority, and we have the time and capability, but some guidance along the way would be helpful.”
“Our team knows that we need to fix a process, but we need assistance to determine where to focus. Some check-ins along the way would help keep us on track.”
“We need to hit the ground running and get this project kicked off immediately. Our team has the ability to take this over once we get a framework and strategy in place.”
“Our team does not have the time or the knowledge to take this project on. We need assistance through the entirety of this project.”
Phase 1 | Phase 2 | Phase 3 |
---|
Call #1: Scope requirements, objectives, and your specific challenges. |
Call #2: Introduce the Cookbook and explore the business impact analysis. |
Call #4: Define operational conditions. |
Call #6: Define actions and related practices. |
Call #8: Identify and prioritize improvements. |
Call #3: Define system scope and related CIs/ dependencies. |
Call #5: Define thresholds and alerts. |
Call #7: Define data policy. |
A Guided Implementation (GI) is a series of calls with an Info-Tech analyst to help implement our best practices in your organization.
A typical GI is between 6 to 12 calls over the course of 4 to 6 months.
Contact your account representative for more information.
workshops@infotech.com 1-888-670-8889
Day 1 | Day 2 | Day 3 | Day 4 | Day 5 | |
---|---|---|---|---|---|
Situate Event Management in Your Service Management Environment | Define Your Event Management Scope | Define Thresholds and Actions | Start Monitoring and Implement Event Management | Next Steps and Wrap-Up (offsite) | |
Activities |
1.1 3.1 Set Thresholds to Monitor 3.2 Add Actions and Handoffs to Event Management Introductions 1.2 Operational and Informational Goals and Challenges 1.3 Event Management Scope 1.4 Roles and Responsibilities |
2.1 Define Operational Conditions for Systems 2.2 Define Related CIs and Dependencies 2.3 Define Conditions for CIs 2.4 Perform Root-Cause Analysis for Complex Condition Relationships 2.4 Set Thresholds for CIs |
3.1 Set Thresholds to Monitor 3.2 Add Actions and Handoffs to Event Management |
4.1 Define Your Data Policy for Event Management 4.2 Identify Areas for Improvement and Future Steps 4.3 Summarize Workshop |
5.1 Complete In-Progress Deliverables From Previous Four Days 5.2 Set Up Review Time for Workshop Deliverables and to Discuss Next Steps |
Deliverables |
|
|
|
|
|
Phase 1 | Phase 2 | Phase 3 |
---|---|---|
1.1 Set Operational and Informational Goals |
2.1 Define Conditions and Related CIs |
3.1 Define Your Data Policy |
Engineer Your Event Management Process
1.1.1 List your goals and challenges
1.1.2 Build a RACI chart for event management
1.2.1 Set your scope using business impact
Infrastructure management team
IT managers
1.1.1 List your goals and challenges
1.1.2 Build a RACI chart for event management
Set the overall scope of event management by defining the governing goals. You will also define who is involved in event management as well as their responsibilities.
Infrastructure management team
IT managers
Define the goals and challenges of event management as well as their data proxies.
Have a RACI matrix to define roles and responsibilities in event management.
Event management needs to interact with the following service management practices:
Event management may log real-time data for operational goals and non-real time data for informational goals
Event Management |
||||
---|---|---|---|---|
Operational Goals (real-time) |
Informational Goals (non-real time) |
|||
Incident Response & Prevention |
Availability Scaling |
Availability Scaling |
Modeling and Testing |
Investigation/ Compliance |
Gather a diverse group of IT staff in a room with a whiteboard.
Have each participant write down their top five specific outcomes they want from improved event management.
Consolidate similar ideas.
Prioritize the goals.
Record these goals in your Event Management Cookbook.
Priority | Example Goals |
---|---|
1 | Reduce response time for incidents |
2 | Improve audit compliance |
3 | Improve risk analysis |
4 | Improve forecasting for resource acquisition |
5 | More accurate RCAs |
The infrastructure team is accountable for deciding which events to track, how to track, and how to action the events when detected.
The service desk may respond to events that are indicative of incidents. Setting a root cause for events allows for quicker troubleshooting, diagnosis, and resolution of the incident.
Problem and change management may be involved with certain event alerts as the resultant action could be to investigate the root cause of the alert (problem management) or build and approve a change to resolve the problem (change management).
Download the Event Management Cookbook
Event Management Task | IT Manager | SME | IT Infrastructure Manager | Service Desk | Configuration Manager | (Event Monitoring System) | Change Manager | Problem Manager |
Defining systems and configuration items to monitor | R | C | AR | R | ||||
Defining states of operation | R | C | AR | C | ||||
Defining event and event thresholds to monitor | R | C | AR | I | I | |||
Actioning event thresholds: Log | A | R | ||||||
Actioning event thresholds: Monitor | I | R | A | R | ||||
Actioning event thresholds: Submit incident/change/problem ticket | R | R | A | R | R | I | I | |
Close alert for resolved issues | AR | RC | RC |
1.2.1 Set your scope using business impact
Situate Event Management in Your Service Management Environment
Tracking too many events across too many tools could decrease your responsiveness to incidents. Start tracking only what is actionable to keep the signal-to-noise ratio of events as high as possible.
11 Tools: 52">
Source: Riverbed, 2016
Systems/Services/Applications | Tier | |
---|---|---|
1 | Core Infrastructure | Gold |
2 | Internet Access | Gold |
3 | Public-Facing Website | Gold |
4 | ERP | Silver |
… | ||
15 | PaperSave | Bronze |
It might be tempting to jump ahead and preselect important applications. However, even if an application is not on the top 10 list, it may have cross-dependencies that make it more valuable than originally thought.
For a more comprehensive BIA, see Create a Right-Sized Disaster Recovery Plan
Download the Event Management Cookbook
Phase 1 | Phase 2 | Phase 3 |
---|---|---|
1.1 Set Operational and Informational Goals | 2.1 Define Conditions and Related CIs | 3.1 Define Your Data Policy |
Engineer Your Event Management Process
2.1.1 Define performance conditions
2.1.2 Decompose services into related CIs
For each monitored system, define the conditions of interest and related CIs.
Business system owners
Infrastructure manager
IT managers
List of conditions of interest and related CIs for each monitored system.
2.2.1 Verify your CI conditions with a root-cause analysis
2.2.2 Set thresholds for your events
Set monitoring thresholds for each CI related to each condition of interest.
Business system managers
Infrastructure manager
IT managers
Service desk manager
List of events to track along with their root cause.
Separate the serious from trivial to keep the signal-to-noise ratio high.
You must set your own monitoring criteria based on operational needs. Events triggering an action should be reviewed via an assessment of the potential project and associated risks.
Examples:
Web sever – how many pages per minute
Network – Mbps
Storage – I/O read/writes per sec
Web Server – page load failures
Network – packets dropped
Storage – disk errors
Web Server – % load
Network – % utilization
Storage – % full
RCAs postulate why systems go down; use the RCA to inform yourself of the events leading up to the system going down.
Dependency | CIs | Tool | Metrics |
---|---|---|---|
ISP | WAN | SNMP Traps | Latency |
Telemetry | Packet Loss | ||
SNMP Pooling | Jitter | ||
Network Performance | Web Server | Response Time | |
Connection Stage Errors | |||
Web Server | Web Page | DOM Load Time | |
Performance | |||
Page Load Time |
At the end of the day, most of us can only monitor what our systems let us. Some (like Exchange Servers) offer a crippling number of parameters to choose from. Other (like MPLS) connections are opaque black boxes giving up only the barest of information. The metrics you choose are largely governed by the art of the possible.
Exhaustive RCAs proved that 54% of issues were not caused by storage.
INDUSTRY - Enterprise IT
SOURCE - ESG, 2017
Despite a laser focus on building nothing but all-flash storage arrays, Nimble continued to field a dizzying number of support calls.
Variability and complexity across infrastructure, applications, and configurations – each customer install being ever so slightly different – meant that the problem of customer downtime seemed inescapable.
Nimble embedded thousands of sensors into its arrays, both at a hardware level and in the code. Thousands of sensors per array multiplied by 7,500 customers meant millions of data points per second.
This data was then analyzed against 12,000 anonymized app-data gap-related incidents.
Patterns began to emerge, ones that persisted across complex customer/array/configuration combinations.
These patterns were turned into signatures, then acted on.
54% of app-data gap related incidents were in fact related to non-storage factors! Sub-optimal configuration, bad practices, poor integration with other systems, and even VM or hosts were at the root cause of over half of reported incidents.
Establishing that your system is working fine is more than IT best practice – by quickly eliminating potential options the right team can get working on the right system faster thus restoring the service more quickly.
Event data determined to be of minimal predictive value is shunted aside.
De-duplication and combination of similar events to trigger a response based on the number or value of events, rather than for individual events.
Ignoring events that occur downstream of a known failed system. Relies on accurate models of system relationships.
Initiating the appropriate response. This could be simple logging, any of the exception event responses, an alert requiring human intervention, or a pre-programmed script.
If the event management team toggles the threshold for an alert too low (e.g. one is generated every time a CPU load reaches 60% capacity), they will generate too many false positives and create far too much work for themselves, generating alert fatigue. If they go the other direction and set their thresholds too high, there will be too many false negatives – problems will slip through and cause future disruptions.
Dependency | Metrics | Threshold |
Network Performance | Latency | 150ms |
Packet Loss | 10% | |
Jitter | >1ms | |
Web Server | Response Time | 750ms |
Performance | ||
Connection Stage Errors | 2 | |
Web Page Performance | DOM Load time | 1100ms |
Page Load time | 1200ms |
2.3.1 Set actions for your thresholds
2.3.2 Build your event management workflow
With your list of tracked events from the previous step, build associated actions and define the handoff from event management to related practices.
Event management team
Infrastructure team
Change manager
Problem manager
Incident manager
Event management workflow
For informational alerts, log the event for future analysis.
For a warning or exception event or a set of events with a well-known root cause, you may have an automated resolution tied to detection.
For warnings and exceptions, human intervention may be needed. This could include manual monitoring or a handoff to incident, change, or problem management.
Outcome | Metrics | Threshold | Response (s) | |
---|---|---|---|---|
Network Performance | Latency | 150ms | Problem Management | Tag to Problem Ticket 1701 |
Web Page Performance | DOM Load time | 1100ms | Change Management |
Download the Event Management Catalog
Data Fields |
|
---|---|
Device |
Date/time |
Component |
Parameters in exception |
Type of failure |
Value |
Phase 1 | Phase 2 | Phase 3 |
---|---|---|
1.1 Set Operational and Informational Goals | 2.1 Define Conditions and Related CIs | 3.1 Define Your Data Policy |
3.1.1 Define data policy needs
3.2.1 Build your roadmap
Business system owners
Infrastructure manager
IT managers
Activities
3.1.1 Define data policy needs
Your overall goals from Phase 1 will help define your data retention needs. Document these policy statements in a data policy.
CIO
Infrastructure manager
IT managers
Service desk manager
Outcomes of this step
Data retention policy statements for event management
Logs |
Metrics |
||
---|---|---|---|
A log is a complete record of events from a period:
|
Missing entries in logs can be just as telling as the values existing in other entries. | A metric is a numeric value that gives information about a system, generally over a time series. | Adjusting the time series allows different views of the data. |
Logs are generally internal constructs to a system:
|
Completeness and context make logs excellent for:
|
As a time series, metrics operate predictably and consistently regardless of system activity. |
This independence makes them ideal for:
|
Large amounts of log data can make it difficult to:
|
Context insensitivity means we can apply the same metric to dissimilar systems:
|
Source: SolarWinds
Security | Logs may contain sensitive information. Best practice is to ensure logs are secure at rest and in transit. Tailor your security protocol to your compliance regulations (PCI, etc.). |
---|---|
Architecture and Availability | When production infrastructure goes down, logging tends to go down as well. Holes in your data stream make it much more difficult to determine root causes of incidents. An independent secondary architecture helps solve problems when your primary is offline. At the very least, system agents should be able to buffer data until the pipeline is back online. |
Performance | Log data grows: organically with the rest of the enterprise and geometrically in the event of a major incident. Your infrastructure design needs to support peak loads to prevent it from being overwhelmed when you need it the most. |
Access Control | Events have value for multiple process owners in your enterprise. You need to enable access but also ensure data consistency as each group performs their own analysis on the data. |
Retention | Near-real time data is valuable operationally; historic data is valuable strategically. Find a balance between the two, keeping in mind your obligations under compliance frameworks (GDPR, etc.). |
Metrics/Log | Retention Period | Data Sensitivity | Data Rate |
---|---|---|---|
Latency | 150ms | No | |
Packet Loss | 10% | No | |
Jitter | >1ms | No | |
Response Time | 750ms | No | |
HAProxy Log | 7 days | Yes | 3GB/day |
DOM Load time | 1100ms | ||
Page Load time | 1200ms | ||
User Access | 3 years | Yes |
Download the Event Management Catalog
3.2.1 Build your roadmap
Event management maturity is slowly built over time. Define your future actions in a roadmap to stay on track.
CIO
Infrastructure manager
IT managers
Event management roadmap and action items
Engineer your event management practice to be predictive. For example:
If the expected consequence is not observed there are three places to look:
While impractical to look at every action resulting from an alert, a regular review process will help improve your process. Effective alerts are crafted with specific and measurable outcomes.
False positives are worse than missed positives as they undermine confidence in the entire process from stakeholders and operators. If you need a starting point, action your false positives first.
Mind Your Event Management Errors
Source: IEEE Communications Magazine March 2012
You now have several core systems, their CIs, conditions, and their related events listed in the Event Catalog. Keep the Catalog as your single reference point to help manage your tracked events across multiple tools.
The Event Management Cookbook is designed to be used over and over. Keep your tracked events standard by running through the steps in the Cookbook.
An additional step you could take is to pull the Cookbook out for event tracking for each new system added to your IT environment. Adding events in the Catalog during application onboarding is a good way to manage and measure configuration.
Use the framework in the Event Management Cookbook to populate your event catalog with properly tracked and actioned events.
Add the following in-scope goals for future improvement. Include owner, timeline, progress, and priority.
You now have a structured event management process with a start on a properly tracked and actioned event catalog. This will help you detect incidents before they become incidents, changes needed to the IT environment, and problems before they spread.
Continue to use the Event Management Cookbook to add new monitored events to your Event Catalog. This ensures future events will be held to the same or better standard, which allows you to avoid drowning in too much data.
Lastly, stay on track and continually mature your event management practice using your Event Management Roadmap.
Contact your account representative for more information
workshops@infotech.com
1-888-670-8889
If you would like additional support, have our analysts guide you through other phases as part of an Info-Tech Workshop.
To accelerate this project, engage your IT team in an Info-Tech workshop with an Info-Tech analyst team.
Info-Tech analysts will join you and your team at your location or welcome you to Info-Tech’s historic Toronto office to participate in an innovative onsite workshop.
Contact your account representative for more information.
workshops@infotech.com 1-888-670-8889
The following are sample activities that will be conducted by Info-Tech analysts with your team:
Define and document the roles and responsibilities in event management.
Define and prioritize in-scope systems and services for event management.
Improve customer service by driving consistency in your support approach and meeting SLAs.
Don’t let persistent problems govern your department
Build a service configuration management practice around the IT services that are most important to the organization.
DeMattia, Adam. “Assessing the Financial Impact of HPE InfoSight Predictive Analytics.” ESG, Softchoice, Sept. 2017. Web.
Hale, Brad. “Estimating Log Generation for Security Information Event and Log Management.” SolarWinds, n.d. Web.
Ho, Cheng-Yuan, et al. “Statistical Analysis of False Positives and False Negatives from Real Traffic with Intrusion Detection/Prevention Systems.” IEEE Communications Magazine, vol. 50, no. 3, 2012, pp. 146-154.
ITIL Foundation ITIL 4 Edition = ITIL 4. The Stationery Office, 2019.
McGillicuddy, Shamus. “EMA: Network Management Megatrends 2016.” Riverbed, April 2016. Web.
McGillicuddy, Shamus. “Network Management Megatrends 2020.” Enterprise Management Associates, APCON, 2020. Web.
Rivas, Genesis. “Event Management: Everything You Need to Know about This ITIL Process.” GB Advisors, 22 Feb. 2021. Web.
“Service Operations Processes.” ITIL Version 3 Chapters, 21 May 2010. Web.