NMLStream’s Quadrant is AI on Duty. Quadrant uses AI to anticipate problems and guide users to solutions even in large scale distributed systems. Consequently, Quadrant’s architecture is structured end-to-end to handle this scale. Everything from data ingestion and compute to user interaction support responsive AI interaction for even the most demanding business-critical enterprise applications.
Quadrant’s AI operates on application telemetry. Data can come from, for example, a database such as InfluxDB or from an APM solution such as Microsoft’s Application Insights. Quadrant leverages the scalability of the underlying data platform to preprocess data. Actual data storage and routine transformations have been addressed effectively in monitoring solutions. Therefore, Quadrant does not need to handle these operations, rather it only needs to reference and query data from these sources.
Quadrant’s AI algorithms are inherently scalable. Many modern AI algorithms, such as Deep Learning, are compute and memory intensive; GPUs are required for real time performance. Furthermore, scaling them often requires highly skilled AI and data experts. NMLStream’s proprietary AI algorithms fold in application topology information to dramatically reduce the total compute resource requirement. Quadrant operates on subsets of related data without any loss in overall performance. These subsets can be further distributed to reach arbitrary scale. Quadrant’s compute layer can operate with a standard scale out architecture on commodity hardware.
Quadrant supports real time user interaction and integrations with NMLStream’s AI. Quadrant exposes a web application with bi-directional communication. Using this mechanism, Quadrant updates users’ screens immediately when an alarm occurs or even when the AI’s interpretation of system behavior changes. Understanding system changes as they occur means that users will never miss important interactions between system components. Quadrant supports standard notification mechanisms, such as email, PagerDuty, and Slack and a comprehensive REST API.
Quadrant’s scale has been demonstrated at the AdTech company Beeswax. Quadrant is currently successfully deployed into Beeswax’s operational workflow to manage their large-scale AdTech application. Beeswax has over 135 services running across 4 different data centers. It handles over 2M request per hour originating from 50 different ad exchanges around the world. Beeswax infrastructure generates over 9,000 metrics worth of data every minute.
Quadrant’s hardware requirements primarily scale with the number of services and the AI retention horizon. A machine with 4 CPU cores and 8 GB of memory can comfortably handle the computational requirements for 75 services. The AI retention horizon represents how much historical analysis NMLStream’s AI remembers. Storing a 30-day retention horizon for 75 services requires less than 10 GB. You don’t need a large distributed application to have Quadrant AI analyzing a large distributed application.Show less
In this document, we study the performance of NMLStream’s backward chaining algorithm. Using a data set of over 400 alarms, we find that backward chaining can identify causal metrics in over 92% of alarms.
NMLStream’s AI solution, called Quadrant, allows DevOps to quickly diagnose problems and take remediation action. Quadrant ingests streaming metrics from system infrastructure and uses patented technology called backward chaining to build relationships between key performance indicators (KPIs) and potential causal metrics. When a KPI triggers an alarm, the backward chaining algorithm identifies the offending service and metric over the entire topology of services and metrics. This allows the DevOps user responsible for uptime to quickly identify the cause and take action.
To understand the performance of backward chaining, we analyze a data set of over 400 alarms. These alarms are collected from a distributed system consisting of several hundred services across 4 geographically eparate data centers. These alarms are associated with a number of KPIs including end-to-end latency, error rates, etc. Alarm durations vary from 6 minutes to 6 hours. Figure 1 shows the histogram of alarm durations over the data set.
For each alarm in the data set, we use backward chaining to determine the most causal service and metric for that alarm. The following table analyzes the performance of backward chaining:
|Total number of alarms||407|
|Alarms with no significant causal metric||31|
|Alarms with at least one significant causal metric||376|
To further understand the properties of backward chaining, we look at the number of causal metrics for all alarms where the algorithm identified at least one causal metric. The following table gives that breakdown
|Alarms with exactly one significant causal metric||160|
|Alarms with exactly two significant causal metrics||81|
|Alarms with exactly three significant causal metrics||65|
|Alarms with more than three significant causal metrics||70|
To understand the alarms for which the backward chaining algorithm did not find any causal metric, we plot the histogram of duration of such alarms. As the histogram below shows, the set of alarms for which the algorithm did not find any causal metric tend to be of much shorter duration. In fact, the average duration of these alarms is just 15.8 minutes.
Using a dataset of over 400 alarms, we find that the performance of backward chaining algorithm is over 92%. The alarms for which the algorithm did not find any causal metric tend to be of much shorter duration. This suggest that performance of backward chaining algorithms improves as the alarm duration increases.Show less