Technology

Learn how NMLStream uses AI to anticipate problems
and guide DevOps to solutions.

Uptime and system performance are critical to a successful digital business.
Managing cloud and micro-services based systems is a continuous challenge in an era of growing complexity and rapid change.

download

"AI on duty" at Scale

NMLStream’s AI solution, called Quadrant, uses AI to anticipate problems and guide users to solutions even in large scale distributed systems. Consequently, Quadrant’s architecture is structured end-to-end to handle this scale. Everything from data ingestion and compute to user interaction support responsive AI interaction for even the most demanding business-critical enterprise applications.

Summary

NMLStream’s Quadrant is AI on Duty. Quadrant uses AI to anticipate problems and guide users to solutions even in large scale distributed systems. Consequently, Quadrant’s architecture is structured end-to-end to handle this scale. Everything from data ingestion and compute to user interaction support responsive AI interaction for even the most demanding business-critical enterprise applications.

Efficient Data Management

Quadrant’s AI operates on application telemetry. Data can come from, for example, a database such as InfluxDB or from an APM solution such as Microsoft’s Application Insights. Quadrant leverages the scalability of the underlying data platform to preprocess data. Actual data storage and routine transformations have been addressed effectively in monitoring solutions. Therefore, Quadrant does not need to handle these operations, rather it only needs to reference and query data from these sources.

Scalable Algorithms

Quadrant’s AI algorithms are inherently scalable. Many modern AI algorithms, such as Deep Learning, are compute and memory intensive; GPUs are required for real time performance. Furthermore, scaling them often requires highly skilled AI and data experts. NMLStream’s proprietary AI algorithms fold in application topology information to dramatically reduce the total compute resource requirement. Quadrant operates on subsets of related data without any loss in overall performance. These subsets can be further distributed to reach arbitrary scale. Quadrant’s compute layer can operate with a standard scale out architecture on commodity hardware.

Responsive User Interaction

Quadrant supports real time user interaction and integrations with NMLStream’s AI. Quadrant exposes a web application with bi-directional communication. Using this mechanism, Quadrant updates users’ screens immediately when an alarm occurs or even when the AI’s interpretation of system behavior changes. Understanding system changes as they occur means that users will never miss important interactions between system components. Quadrant supports standard notification mechanisms, such as email, PagerDuty, and Slack and a comprehensive REST API.

Proven in Production

Quadrant’s scale has been demonstrated at the AdTech company Beeswax. Quadrant is currently successfully deployed into Beeswax’s operational workflow to manage their large-scale AdTech application. Beeswax has over 135 services running across 4 different data centers. It handles over 2M request per hour originating from 50 different ad exchanges around the world. Beeswax infrastructure generates over 9,000 metrics worth of data every minute.

Hardware Requirements

Quadrant’s hardware requirements primarily scale with the number of services and the AI retention horizon. A machine with 4 CPU cores and 8 GB of memory can comfortably handle the computational requirements for 75 services. The AI retention horizon represents how much historical analysis NMLStream’s AI remembers. Storing a 30-day retention horizon for 75 services requires less than 10 GB. You don’t need a large distributed application to have Quadrant AI analyzing a large distributed application.

System Architecture Show less

download

Performance of NMLStream’s Backward Chaining Algorithm

Quadrant ingests streaming metrics from system infrastructure and uses patented technology called backward chaining to build relationships between key performance indicators (KPIs) and potential causal metrics. For a large dataset of alarms, Quadrant was able to identify causal metrics in over 92% of the alarms. The accuracy of the algorithm improves as the alarm duration increases.

Abstract:

In this document, we study the performance of NMLStream’s backward chaining algorithm. Using a data set of over 400 alarms, we find that backward chaining can identify causal metrics in over 92% of alarms.

Details:

NMLStream’s AI solution, called Quadrant, allows DevOps to quickly diagnose problems and take remediation action. Quadrant ingests streaming metrics from system infrastructure and uses patented technology called backward chaining to build relationships between key performance indicators (KPIs) and potential causal metrics. When a KPI triggers an alarm, the backward chaining algorithm identifies the offending service and metric over the entire topology of services and metrics. This allows the DevOps user responsible for uptime to quickly identify the cause and take action.

To understand the performance of backward chaining, we analyze a data set of over 400 alarms. These alarms are collected from a distributed system consisting of several hundred services across 4 geographically eparate data centers. These alarms are associated with a number of KPIs including end-to-end latency, error rates, etc. Alarm durations vary from 6 minutes to 6 hours. Figure 1 shows the histogram of alarm durations over the data set.

Histogram of alarm duration

For each alarm in the data set, we use backward chaining to determine the most causal service and metric for that alarm. The following table analyzes the performance of backward chaining:

Total number of alarms 407
Alarms with no significant causal metric 31
Alarms with at least one significant causal metric 376
Performance 92.4%
Table 1. Performance of backward chaining agorithm.

To further understand the properties of backward chaining, we look at the number of causal metrics for all alarms where the algorithm identified at least one causal metric. The following table gives that breakdown

Alarms with exactly one significant causal metric 160
Alarms with exactly two significant causal metrics 81
Alarms with exactly three significant causal metrics 65
Alarms with more than three significant causal metrics 70
Total 376
Table 2. Breakdown of alarms with multiple causal metrics

To understand the alarms for which the backward chaining algorithm did not find any causal metric, we plot the histogram of duration of such alarms. As the histogram below shows, the set of alarms for which the algorithm did not find any causal metric tend to be of much shorter duration. In fact, the average duration of these alarms is just 15.8 minutes.

Histogram of alarm duration for alarms with no causal metrics

Conclusion:

Using a dataset of over 400 alarms, we find that the performance of backward chaining algorithm is over 92%. The alarms for which the algorithm did not find any causal metric tend to be of much shorter duration. This suggest that performance of backward chaining algorithms improves as the alarm duration increases.

Show less

NMLStream

Copyright
All rights reserved