Insufficient Evidence When Problems Occur
In the middle of an IT disaster, it’s all hands on deck. Quick thinking—assisted by a healthy rush of adrenalin—can get you through the crisis. Once the system is up again, hopefully you’ll have time to conduct a post-mortem investigation to uncover the root cause and prevent future problems.
When disaster strikes, our natural inclination is to identify a culprit to blame: the system, the vendor, the network, the storage-area network (SAN), the DBAs, etc. In many cases, however, you might uncover a few inconsistencies, but the real cause can’t be identified: “Case dismissed due to insufficient evidence.” If the systems are your responsibility, the words “no root cause found” hang over you like a dark cloud.
No Smoking Gun
But without a root cause to focus upon, how do you prevent a reoccurrence of “The Great IT Disaster?” Even if you don’t know exactly what went wrong, you can take steps to stop those gremlins from returning. Here are a few ideas:
Keep a level head. The more objective you can be in looking at the symptoms and the possible factors that brought the system down, the better chance you have of unraveling the mystery and preventing “Disaster Part II.” Level heads don’t roll.
Get a second opinion. You might not want to hear that advice after being swamped with opinions on what went wrong and whose fault it is. But going to someone who isn’t a stakeholder can provide new insights. Check forums and blogs, log support calls, chat with someone who has seen this—or something even worse—before. Combine others’ experience with local knowledge.
Apply change management. After a disaster, ask what processes might have changed. Maybe the change was perfectly legitimate, but did it really have to be done 30 minutes before the payroll run? Proper change-management procedures can stop people from working in silos, so they can get a better feel for the business and user needs. This ensures you have a back-out plan and helps isolate configuration changes into bite-sized elements. It’ll also give you more confidence knowing what works and what doesn’t (as well as what to do about it).
Don’t play the hero. When a system goes down, don’t make it worse by having all of the recovery procedures, workarounds, system names and support contact details in your head. You can’t anticipate every disaster, but you can equip others so that the whole world doesn’t center on a single superhero (i.e., you), who is also a single point of failure.
Introduce standards and procedures. Much as we like to protect our turf, you actually become more indispensable by implementing and documenting standards for how systems are configured and managed. If others know they can rely on your methodical approach and your documentation is easy to find and follow, then they’ll see the good job you’re doing.
Fix secondary problems anyway. If you find other inconsistencies during your search for a root cause, take the opportunity to tidy them up. Oftentimes, you can’t get to a root cause because too many other problems contaminated the evidence. It’s easier to deal with weeds while they’re still small.
Protect log files and prepare diagnostic tools. Messages, error logs, performance data, and other diagnostic tools will help retrace the crime. If you suspect a problem on the SAN, for example, you might be able to enact more detailed logging. System dump sizes need to be large enough so data won’t be lost in the event of another disaster. Enable automatic monitoring for serious problems so you learn about them from the system as soon as they begin. It’s better to discover a problem from an automated alert than from 1,000 unhappy users.
Consider external causes. Look upstream to find the trickle that began the flood. Is it the network, the SAN or some external server that your system depends on? I recently saw corrupt file systems on some virtual servers shortly after a disk failure. The disk was in a RAID set. The hot spare disk took over correctly, so why would the data be damaged? It was a controller firmware bug on the disk subsystem, which caused a controller reboot, and some of the disks didn’t fail to their alternate path correctly. The failed disk was the trigger, but it wasn’t the root cause. Even after the disk was replaced and the damage repaired, the problem could happen again because of that bug. With updated controller firmware, the single-disk failure in the RAID set wouldn’t have the same impact.
Establish performance baselines. Everyone knows what the system looks like when it’s running slowly. What does it normally run like? A baseline provides a point of comparison when performance heads south. On IBM i systems, Collection Services data lets you see what a normally functioning system looks like. You should save Collection Services data from your important business timeframes to have a reference point to compare against when disaster strikes.
Describe the problem history. Human memory can play tricks after the event. Not only do reports conflict wildly on what fixed the problem—even the main symptoms might be forgotten in the scramble to get things running. The cure might even be mistaken for the disease. That’s why a detailed problem history can help clear the air and prevent a replay of the disaster. The IBM i Troubleshooting documentation describes diagnostic techniques. The Diagnostic Tools Redbooks publication is a great resource for all of the IBM i diagnostic tools. Is the problem confined to a single system or platform? What configuration changes have occurred? What applications are involved? Can you reproduce a test case of the problem?
Arrange upgrades as needed. It might be a bug that brought down your SAN or shut down your server. Patch the systems and find workarounds if you can pinpoint the issue. Regularly scheduled firmware updates and OS patching are proactive rather than reactive. And if the organization is running on dated hardware, the word “exposure” has a remarkable way of opening up change windows and hardware purses.
Manage expectations. These words ought to be scratched onto every overworked systems administrator’s screen. If you’re trying to get blood from a stone, what do you expect when people are tired, problems are accruing and no one recognizes that people are only human, after all. Visibility is important—if others grasp a little of the pressure you’re under, they’re more likely to understand why you can’t meet unrealistic deadlines. The same goes for the hardware. If the demands on the technology outweigh its capacity to respond, it’s time to draw a line in the sand.
Ultimately, without a root cause of a catastrophic systems failure, you can’t guarantee it’ll never happen again. But if you follow these steps, you’ll be better prepared for a similar event and have recovery procedures in place ahead of time.
Thanks to Dawn May for assisting with this article.