In this blog we look at a Major Incident, thought to be a network issue, and discover some key actions that are needed to give us a fighting chance of discovering the root cause.
Introduction
We specialize in problem diagnosis rather than incident recovery, but recently we have been helping with some Root Cause Analysis relating to one-off Major Incidents. Although RPR is a problem diagnosis method, we've found a way to adapt it to determine the root cause of one-off incidents. I'll cover this adaption on the RPR Practitioners LinkedIn forum. Here, I cover more general observations that could be useful to a network engineer dialed into an Incident Bridge, being bombarded with information and theories. I've base this article around a particular example.
The Problem
At about three o'clock one Saturday afternoon a critical production system failed, stopping all manufacture of the company's products. Circumstances pointed to a network issue; cluster nodes seemed to lose sight of each other and NFS mounts failed. Earlier in the day there had been some security testing conducted using a sophisticated system scanner, and the suspicion was that this was now causing the problem, even though the scanning had been stopped for some time.
This was a very big incident, the biggest for 25 years according to senior members of the team, and recovery had taken several hours. The issue had board-level visibility and so when the technical support teams began to struggle to find the root cause we were asked to assist.
Investigation Review
To get things started, we met with around 10 people who were involved in the incident recovery and the subsequent investigation. We asked them to describe the investigation and findings so far.
Things started well as the Network Manager had produced a detailed timeline of events which was very useful. As the meeting progressed we realized that the investigation had focused on the actions taken to recover the system (disabling/enabling components, forcing failovers, etc.) and how the system had responded to these actions.
We drew a timeline on a whiteboard similar to the one you see above, and that enabled everyone to realize that there was little point in discussing the Manual Intervention actions and responses because:
Auto-recovery mechanisms would have changed the environment (cluster failover, network route changes, Spanning Tree Topology changes, etc.)
In the Manual Intervention phase, each technical support team was "prodding" their own technology and this clouds the situation
If the original cause was transient it may have disappeared already
Setbacks
We then set about trying to collect event information relating to a timeframe that spanned just before the failure (normal operation) to just after the failure (auto-recovery period). Unfortunately key pieces of information, in particular switch log data, had been lost as logs had wrapped.
Commands that would give us an insight into the state of the system (netstat, ps, arp -a, etc.), albeit after auto-recovery, had been executed but the output had not been saved.
Bottom line was that we had insufficient data to conclusively prove that the problem was or was not a network issue, although we were able to prove that it wasn't an issue with the scanning software.
Lessons We Can Learn
We've looked at one case study here, but our Problem Analysts have seen exactly the same type of discussions at many other customers.
The first mistake seems to be that when we send a command into a system component (network or otherwise) we believe it will respond in a number of ways depending on its state, and unfortunately sometimes we get unexpected responses.
I recommend that the investigating team get very focused on collecting and analysing event information for the timeframe between just before and just after failure. Don't waste time looking at the way the system responds to manual intervention, we are way too far down the line at this point; auto-recovery mechanisms will have changed things and it will be impossible to deduce the original cause at such a late stage.
Logs wrap and data is lost. Make sure log data is saved as soon as possible. Create a plan today that lists the logs that should be saved for Tier 1 systems.
Make sure that commands are executed to show key state information and the output is saved. You won't have time on the day to decide which "show" commands should be executed prior to manual intervention and so, for Tier 1 systems, write a Diagnostic Capture Plan right now.
Conclusions
None of us should think we are immune to these mistakes. In the heat of a Major Incident, getting the system back online is obviously the priority, and there's little time to think about anything else. With a little planning we might just capture the data we need to find root cause and prevent a future crash.