AI and RCA

August 24, 2018 Dan O'Neill, PhD

It really is getting harder to find the source of a problem in modern applications. Why, modern architectural complexity simply makes it combinatorially difficult to consider possible causes, especially for new (unknown-unknown) problems. Traditional Root Cause Analysis methods really aren’t up to the challenge, are time and resource consuming, and are ultimately intractable for dynamic cloud based applications. What is needed is an intelligent, query free method to find the causal-path from symptoms to cause. This quintessential AI problem is addressed by NMLStream’s specialized Causal Graph technology.

Today, finding a root cause involves investigating application (and system) dependencies and correlating metrics (and system) telemetry. At each step, DevOps must decide which metrics to correlate and which paths to follow through the application. Correlating time series is easy to say, but hard to do; units, scales, and complicated mathematical relationships create a substantial cognitive challenge. Further, the cardinality of candidate metrics is very high, making comparisons between metrics daunting. There are often a large number of potential causal paths, to explore, most of which will not lead to the root cause. Compounding the problem, the relationships will change as the system scales-out, is updated, or re-configured.

DevOps and SRE’s, combat this challenge with intuition, experience, war rooms, NoC’s and just plain working harder. But, fundamentally the problem is cognitive. Unfortunately, people’s thinking scales linearly and the problem of finding the cause scales combinatorially with system complexity, a losing proposition.

Observability tools can help by collecting more granular data and by making queries faster. More data potentially improves system visibility. Specialized query methods can save time, but cannot directly address such cognitive challenges as “What queries should I pose, how do metrics actually relate, and just which paths do I need to explore and which can I ignore?” Some advanced observability tools facilitate mathematical manipulation of metrics, a useful but manually driven capability.

NMLStream’s Causal Graph ML technology addresses these challenges. It transparently sifts through the complete set of causal hypothesis about what could have caused the problem, and returns the most likely. This approach let’s DevOps “See What Matters” for each problem.

Our Causal Graph technology automatically learns the relationships between time series metrics and fully explores the set of all possible causal paths through the system, displaying for the user the most probable path from the service emitting the Alert to its root cause. This calculation is done very quickly, with customers seeing the evolution of a causal path in near real-time. This speed also let’s SRE’s see problems as they develop, and to also see the effects of their remediation actions. This approach reduces DevOps or SRE cognitive workload dramatically, with customers reporting 300x improvements in MTTA.

Causal Graph technology has a very large statistical representational capacity which grows combinatorially and in lock-step with system complexity. This fact directly address the combinatorics problem with existing methods and tools, and allows our technology to work with distributed, multi-geo, cloud systems at large scale.

August 24, 2018 Dan O'Neill, PhD


All rights reserved