How to set up alarms and notifications, and how to handle noisy alarms

July 31, 2018 Kristopher Heinrich

The complexity of enterprise applications has been increasing steadily due to the adoption of microservice-oriented architectures and cloud technologies. DevOps teams supporting these infrastructures require a smarter way to manage this growing complexity. NMLStream’s Quadrant uses backward chaining technology to reduce complexity through rapid localization and root cause analysis of problems across all monitored metrics. However, knowing what constitutes a problem remains crucial to understanding the health of a complex distributed system. Quadrant allows DevOps engineers to monitor resource health in much the same way as many of the familiar APM providers. Users can define critical conditions and receive notifications when a metric or service degrades in performance, allowing quick response and remediation before a problem worsens.

In order to monitor specific system performance characteristics, users define Health Rules for aspects of their system and associate alerting policies which determine how the system responds when a rule is violated. Setting up a Health Rule involves three steps: defining a threshold condition, describing the enforcement details for the rule, and configuring notification alerts for when the rule is violated.

First, a user defines a threshold value which is used to classify a metric's health. Thresholds may be defined as either static (greater than 300 ms) or dynamic (above 90% of the range) conditions. Threshold conditions which are dynamically defined may be calculated using either a percent of the historic range of values of a metric, or in terms of standard deviations above or below the baseline value for a particular time of day. Quadrant also provides users with feedback about the amount of alarm time a particular condition would have triggered in the recent past, so users may adjust a threshold according to the desired alarm sensitivity.

Next, a user specifies the enforcement details for a Health Rule. These include scheduling when a rule is active, the hours during a day as well as specific days of the week, and selecting which services the rule will affect.

Finally, a user configures notification settings to manage how, when, and to whom alerts are sent when an alarm condition is met. Users may choose which lifecycle events they want to notify about, as well as under what conditions notifications will be suppressed. This is useful for avoiding "alarm storms", whereby DevOps engineers can quickly become overwhelmed by a large volume of alarms in a short time. Users may also elect to notify before a condition is met, allowing for early detection before any critical value is actually experienced. Teams responsible for a particular domain or service can be targeted when setting up notifications to ensure the right people respond when an incident occurs. In addition to providing email notifications, Quadrant integrates with Slack and PagerDuty, so your company can notify via their preferred mechanism.

When a rule is violated, a summary of the incident, with information about the violating metric and affected service, is sent to the notified parties, along with a description of the most likely causal path, based on NMLStream's backward chaining technology. Engineers responding to an incident can then quickly investigate the problem, understand the root cause, and remediate the issue with little to no downtime.

July 31, 2018 Kristopher Heinrich


All rights reserved