There’s a lot of talk about the applicability of artificial intelligence (AI) and deep learning to taming the vast quantities of data that modern Operations teams and their tools deal with. Analyst reports frequently tout AI capabilities, no matter how minor, as a strength of a product, and the lack of them as a weakness. Yet no effective use of AI seems to have emerged and claimed wide adoption in Network Operations or Server Monitoring. Why not? (Disclaimer: LogicMonitor does not currently have deep learning or other AI capabilities).
Part of the issue is that AI is a soft definition. As Rodney Brooks, the director of MIT’s Artificial Intelligence Laboratory, says, “Every time we figure out a piece of it, it stops being magical; we say, ‘Oh, that’s just a computation.’ So by definition we never really reach AI. LogicMonitor has a feature that performs numerical correlation among vast amounts of data, looking for patterns and similarities to an identified oddity (e.g. if disk latency on a database increases, it can identify what other metrics had a similar pattern. Web site requests? Network retransmissions? Database queries from a QA system? This can narrow down the candidates for the root cause). We don’t think of such a system as AI, because it’s “just” the application of well understood statistical methods. To an operations person from 20 years ago, however, it would definitely seem like intelligence.
Some techniques are (for the moment, at least) universally recognized as AI. Machine learning is one of them. Machine learning is finding applicability in all sorts of fields that were thought to be the province of “real” intelligence. It can beat humans at chess, at Go, compose symphonies and haiku as good as those generated by poets. So why not operational troubleshooting?
Well, one reason is that supervised machine learning needs to learn, and then follow, rules. It has to be trained on a set of data, such as completed games. From the training set it can generate a model, and use that model it to apply what it has learned to new games or compositions.
One problem with supervised learning in the Ops world is that you can’t tell what rules the AI extracted. It may be getting the correct result, but for the wrong reasons.
The most fun example of this I know of is Clever Hans. Hans was a horse that could do arithmetic. Addition, subtraction, division. Ask the horse a question, and he would indicate when a person counting out loud had reached the correct answer. He would do this reliably, and regularly, and repeatedly. The only problem was that the horse wasn’t solving mathematical problems and then listening for the correct number to be said. He was just looking at the people around him, and when their body language indicated the right moment, he’d stomp his foot. He was effectively a neural net that had been trained to give the answer to mathematical puzzles — but Hans was extracting the answers not in the way that was expected. And not in a way that was useful if you actually wanted to rely on Hans’ computational abilities, and didn’t know the answer in advance.
Similarly, AI vision recognition systems can be trained to distinguish leopards from cheetahs from ocelots, very accurately. But they also identify couches as leopards. (One could imagine problems if one was relying on these algorithms to defend against leopards…)
You can check the work of an AI when you already know the answer — but relying on it when you don’t know the answer makes you depend on whatever cues the system deemed applicable from its training data — and you don’t know what those were. This could be that a device is identified as a likely root cause because it has more than the usual incidence of the letter “P” in its hostname, which may have been an unknown artifact of the training data set.
Of course, humans are also subject to an inability to explain why they make their decisions or take their actions much of the time. I was unaware, until I read a book about bicycle racing, that in order to initiate a turn to the left, the process is started by a slight steer to the right, to change the center of balance. I’d been riding bikes for decades, and would have denied that I did that. For those that want many more examples of unaware behavior, demonstrating that we often come up with a conscious narrative for why we made a decision after we’ve decided, justifying ex-post-facto why we did something, I highly recommend “Thinking, Fast and Slow.” Most operational troubleshooting however, is definitely slow thinking, where we consciously investigate step by step, starting from the symptomatic system, often taking shortcuts from our knowledge of the application, but investigating and testing explicit hypotheses.
I would also acknowledge that the inability to explain your reasoning doesn’t matter if you get the right answer. Professional bike racers will be able to tell you exactly how fast they can go around a given corner before they lose traction and slide out. So could a physicist. The physicist can show his work, and explain the co-efficient of friction, and the lateral and vertical vectors. The cyclist will not be able to explain how he knows — but he will know.
So the lack of insight into an AI’s processes isn’t necessarily an obstacle to their use in operations. Rather, it’s the fact that an AI is limited by its training set. An AI can write a symphony that we enjoy, because it conforms to our current expectations of what an artistic and pleasant symphony should sound like, but it cannot push the boundaries of creativity, creating symphonies that violate rules and create enough emotional impact that they cause riots — then were soon regarded as genteel art.
In IT Operations, good operational practices dictate that every issue and problem is unique. All prior issues should have been addressed in a way that means they won’t recur, or — at the very least — the monitoring has been configured so that it clearly warns of the situation. If neither of those conditions is true, the incident should still be regarded as open. So if the issues are always unique, the training sets will not have covered them, and the insights from an AI are unlikely to be terribly insightful, or helpful. They may in fact be distractions and wild goose chases — like asking Clever Hans to calculate your tax return.
One way around this would be to mine the information across a diverse set of customer operational data — the fact that I had an issue with Zookeeper that had quorum configuration as the root cause, and I resolved it, may mean that the issue shouldn’t recur for me — but that knowledge may be useful to other companies running zookeeper. Of course there are problems here not only with data privacy, but also data training. If I identify my zookeeper nodes as z1.prod, with tags #zookeeper and #prod and you call yours n34.lax.us.west with tags #quorum and #live — how is commonality to be established so that the training lessons can be applied?
Technology can certainly add value to monitoring systems now — it can cluster alerts from different systems together into single incidents, based on commonalities of time, and (ideally) a knowledge of the topology and dependencies of the system. This can help reduce alert overload, and help simplify root cause detection. It is arguable whether this is AI, however. (Note that the ideal state of ‘knowing’ dependencies may not be possible for the system, as dependencies can change with load; whether data is cached or not; whenever new code is deployed; or whenever new nodes or containers are added/removed. Note also that changes in dependencies that cause incidents may not be known by the monitoring system (as the communication mechanisms may themselves be disrupted by the incident) so the monitoring may be misidentifying some alerts as related to the incident based on old data. As the Google team puts it “Few teams at Google maintain complex dependency hierarchies because our infrastructure has a steady rate of continuous refactoring.“)
AI is probably better suited to pointing out discrepancies and anomalies in performance over time. AI can alert you to issues caused by releases — for example, that rendering a page used to take 2 database requests — now it takes 10. Identifying deviations in performance between versions of code, and whether they are significant, is going to be an increasingly important role of monitoring as development agility increases.
Of course, progress is continuous in this field. Since I started writing this article, a team from LinkedIn and Stanford have published a paper that shows promise in automating the identification of root cause, using unsupervised machine learning to cluster performance anomalies, as well as snapshots of the call graph of data flows. This shows promise — however, it is hard to generalize this work to situations where you do not control the code that generates the metrics, thus allowing the automation of the call graph data. (Think routers, commercial software, storage arrays, etc — all of which may be the root cause of issues, but outside of the call graph.)
Regardless, it is heartening to see progress here. Speaking as a former network and systems administrator that was often on call — the easier we can make life for the people on the front lines of monitoring and alert response, the better for everyone. (And of course, that’s our whole focus at LogicMonitor, so we’ll be paying close attention to this space.)
Originally posted at Breathe Publication