Event handling is monitoring for and detecting events (changes in state of significance for stakeholders, services, or service management system components, including informational, warnings, and exceptions); raising alerts (notifications), reports, and escalations. In the case of an informational and warning event, possibly triggering automated or manual control actions; In the case of exceptions, triggering incident handling.
Event handling in OSM is not just about capturing infrastructure or application events, but capturing and notifying about all “things worth managing”, e.g., stakeholders are “things worth managing”; users are one type of stakeholder; if there is a sudden drop in user’s overall satisfaction with a service, this is in-scope for event handling in OSM.
Performance is effective when…
We have the right set of event detection and alert notification mechanisms set up, so that we can detect all changes of state in stakeholders, services, and the SMS that are significant and may require a control response; These include normal situations, e.g., informational “chron job just ended”, to warning “you set a threshold of 15% utilization on this storage array and told me to tell you when it happened; I am telling you”; to exception / abnormal situations “server down”; we have these set up to monitor for all quality of all services, including financials, service levels, continuity, availability, throughput, configuration, security, and compliance, as well as the components of services, including environment (E.g., temperature, fire, water detection),hardware, system software, networks, applications, and DBM.
The general trend is to take point solutions with an API and a command line interface, and string them together to create a toolchain, rather than seeking monolithic solutions. Part of the interesting shift in recent year has been that tools that were built assuming a physical on-prem environment haven’t necessarily made it over to the cloud, because their core model is so different, and cloud solutions are being produced more through iteration (leading initially to smaller, more point solution offerings), which seems to have lead to this trend.
Why do it?
Event management is a thing worth managing because you want to know when a thing worth managing changes state in a material way, so that you can take action, where appropriate; you do this whether the event generates an alert that is informational, e.g., “chron job finished”, a warning based on a threshold you set, e.g., “circuit utilization is at 26%”, or “user sat scores have dropped below 85% rating us a “5” for overall satisfaction”, or an exception, e.g., server down, service down.
How to do it
Make a list of all the specific things worth managing, under the categories Stakeholders (customers, users, provider, suppliers)
For each, identify the 3 types of events you want to be informed of, with alerts: informational, warning, exception. Rate the events by priority.
Put in place mechanisms that trap these events and generate alerts, automating mechanisms in priority order..
Monitor and control events. Establish a rhythmic pattern of reporting and review and action.
Practice continuous improvement, based on what your telemegey / instrumentation is telling you.
Aim for establishing monitoring and control in priority order, with a system you can manage, so not all possible monitors and controls but the most important ones currently, just as you would do for your health. we
IRL (Real world example)
In real life, you might be concerned with maintaining a state of health for a process, like change handling, or an IT-led service, like your order entry system, a human-led service, like moves, adds and changes, to a function like your team, to tooling, like your ticketing system, to individual moments of truth like your post mortems, or your employee, customer, user and supplier satisfaction.
For example, Walt owns the health of his chat box system that appears on the elearning website for sales inquiries, support and so on. For event handling, he is most concerned when users have technical difficulties with the site. His priorities are to know of and fix any issues quickly, first for issues that affect all users, then individual courses, and lastly individual users. So walt builds that prioritization scheme into the cat ticketing system scheme. Next, Walt wants to put some thresholds in place to make sure certain things stay under control. For example, the elearning hoster charges additional fees for each active user over 500, so he puts a notification in place when 450 active users is reached so he can go in and clear out users who haven’t logged in for over a year and mark them inactive to free up headroom. Lastly, Walt wants to be informed about key statistics like concurrent users, usage by hour by day, by region, most popular topics, quiz pass rates and so on, so he stands up these analytics in the LMS.
Where to go
To learn more about this thing worth managing General Information
Event Management Generic Job Aids for this Practice
The MOF Service Monitoring & Control Function document within
the MOF 4.0 core content download is a good job aid for this. Vendor, Tool and Platform-specific Guidance & Job Aids for this Practice
Spiceworks has a free
network monitoring tool.