How Causal Graphs are different from dashboards

August 6, 2018 Alex Orlova

DevOps engineers are getting used to wading through a huge amount of metric data and manually manipulating time series. Although this is currently the only way to discover the root cause of problems from applications telemetry, this approach is still highly time and labor intensive. It might take many hours and multiple people to manually check all possible metrics and figure out the root cause.

Classic DevOps dashboards, even those that are customizable and with a user-friendly UI, still require human effort to investigate and manually check all possible paths from the metric that experienced issues to the metric that caused these issues. It means that when an alarm happens DevOps must manually check and interpret hundreds or thousands (depends on the system load) of metrics and compare their values in real time to find out all possible relationships between them. And then repeat this exercise until the potential causal metric is found, fix it and then check if it helps to return the system health back to normal. The effectiveness of this investigation is more a question of DevOps experience and luck rather than of dashboard product features.

Grafana dashboard

NMLStream’s Causal Graph™ understands and anticipates issues in real time. This graph shows the current dynamic behavior of the system and gives DevOps an immediate prioritized, understanding of any open issues.

NMLStream Causal Graph

Each node of the Causal Graph is data from a query. The lines between them mean the relations between the metrics and can be a potential path that user might take. What is important, these paths are data-driven and change over time.

When an alarm is happening Causal Graph will calculate and highlight the path that is the most significant at that time and leads to the causal metric. Causal Graph also provides causality weights of the paths so DevOps always have an option to change the path if needed.

DevOps still have an ability to access time series for a specific metric or a group of metrics by clicking on the Causal Graph nodes. But they no longer need to manually find the relations between metrics and investigate all possible paths. Causal Graph based on AI algorithm provides all actual and issue-related information: offended metrics, the most causal path, possible alternative paths as well as overall system health.

Alarm response using dashboard vs alarm response using NMLStream Casual Graph

In other words, the Causal graph is a powerful AI navigation tool that will directly lead DevOps to the root of the problem. This helps to significantly improve MTTR and increase uptime.

August 6, 2018 Alex Orlova


All rights reserved