Event Definitions

Last updated
Save as PDF

Best practices for creating event definitions in SPM.

Overview

FireScope has considerable flexibility to define events in just about any way you can imagine. This broad range in capabilities can be daunting for new users, and therefore this guide has been prepared with key considerations for creating your own Event Definitions (EDs).

Whenever possible EDs should be created at the Blueprint level, in order to apply common event logic to all monitored assets of a similar type. This also simplifies future updates, as you need only adjust the ED at the Blueprint level to mass change every Configuration Item (CI) that uses the Blueprint.

Eliminating Flapping and False Positives

Flapping describes scenarios where attribute values can change wildly over a short period of time, such as CPU utilization. There are many innocuous situations where CPU and memory utilization may show temporary spikes, such as when applications are started or services are restarted, which do not necessitate notification.

When determining the best function to use when creating an ED, consider how frequently the related attribute data changes and what constitutes a realistic scenario that you want to identify. For frequently changing attributes, try to use functions that evaluate multiple data samples to eliminate temporary peaks and valleys.

CPU, Memory and Network Performance Events

In the case of these three types of attributes, which can be highly volatile, we recommend using the ‘Average of all values in the last {t} seconds or samples’ function. Select Samples rather than Count for this ED to avoid having to adjust the ED whenever attribute collection intervals is updated, or when using flexible data collection schedules. If we average these attributes over 3 or more samples, we have a much better chance of identifying persistently high load and avoid flapping.

For event better event analysis, create a Percentile (95^th for peaks such as high CPU utilization, 5^th for lows such as free memory) attribute for the core attribute, and use the complex ED builder to compare the average of the core attribute to the Percentile. In effect, we are using the Percentile to derive a baseline for normal recent activity, and the ED is looking for sustained, abnormal activity that merits investigation.

On / Off Events

A few attributes such as Network TCP Check, Ping, Filesystem File Exists, Network DNS Availability, Network TCP Port Check, Network TCP Service Check and others return a 0 for Failed or a 1 for Pass. However, there are circumstances where network latency or other factors may cause an occasional fail response that may not indicate an issue. Like CPU and Memory, we want to evaluate multiple samples to eliminate flapping, but the average of values isn’t the best approach. In these cases, use “evaluate the number of times a desired value was returned”, using 0 as the target value, and T=5 and N > 2. This essentially is looking for scenarios where 3 or more, out of 5 data samples, are failing.

The Current Value Function is Not Your Friend

While attractive, the Current Value evaluation function is highly susceptible to flapping and therefore it is strongly recommended that you avoid this function unless the attribute being evaluated should not experience any fluctuation in normal operations. The Windows Service State using the FireScope Agent, or Last Failed Step in User Experience Checks (in which 0 indicates no errors) are among the few scenarios where any value other than 0 represents a realistic operational issue.