Logical Repair Practices
Tuesday, 29. May 2012
My God, can it all be the same?
Seems like most of my job now a days is looking at large systems and isolating problem areas. Things like performance problems, data corruption, or even failure analysis. Many of these systems have several independently managed processes, all tied together in a single forward facing application. Over the years, I’ve developed some methods of approaching system failures and problems that gives me a better chance of quickly evaluating and repairing the issues that plague these systems. I used to believe that these methods were only valid on larger system models, then, one day, a colleague of mine and I were sitting in a small coffee house discussing a problem they were having with one of the desktops they manage. While we exchanged ideas, I suddenly realized that I was using the same mental process on this little desktop as I did with the large cluster systems.
Size Really Doesn’t Matter.
It is really true! At least in this context. The size of the system (or the problem) is irrelevant to the method used to evaluate it. This is true in anything. The only requirement is that you understand how the system works. If you are having problems with your washing machine, and you don’t have a basic idea what all the components in the washing machine are, and what each component’s basic function is, you probably don’t want to try to fix your washing machine if it stops working. That’s not to say that you can’t fix a problem with your washing machine, even with limited knowledge of how all the sub systems work. If it doesn’t turn on, I would hope that everyone reading this would verify that it was plugged in, and that the plug it was plugged into works. I’d also guess if it was leaking all over the floor, most of you would look to see if the drain hose is in the drain pipe. You might even verify that none of the hoses on the outside of the washing machine were leaking. This simple idea, the idea of breaking down things into a testable size, is the foundation of logical diagnostics and repair.
Logical Diagnostics.
What is ‘Logical Diagnostics’? That’s the act of looking at a problem and logically validating sub systems until you locate a failure point. Wow, although well stated, that’s a real mouth full… Let’s see if we can break this down into digestible pieces.
- Ignore the impossible.
- Divide and conquer.
- Verify to eliminate.
- Confirm all repairs.
Ignore the impossible.
If you are trying to figure out why a system can’t get onto the internet, defragging the hard drive probably won’t help. Rebooting once is probably a good idea, but rebooting a second time because you still didn’t find anything wrong is probably a waist of time. Let’s say you have a monitor with no power light on it, would it make sense to replace the VGA cable? NO! This is not to say that you don’t have a bad VGA cable, but you can’t test that until the monitor is on. Ignoring the impossible saves you valuable time in finding the real problem.
Divide and conquer.
Let’s take our no display problem. The power indicator light on the monitor is off. We can safely assume that we need to figure out if the problem is one of the following:
- Bad monitor
- Bad power cord
- Power unplugged
So, where do we start? I would divide my testing into two parts. I would first verify the power is plugged in (on both sides) and that there is power at the wall. If both those things were true, I would then install a different monitor to test if the monitor was bad. By dividing the problem in half, I can easily confirm what things are working properly.
Verify to eliminate.
You notice that I said ‘confirm what things are working properly’? I didn’t say ‘find what was broken’. Why? Because when you are looking for the source of a problem, you do it by the process of elimination. Testing to verify that all the things that might be causing the problem, are working properly. And just because you find a problem, doesn’t mean it’s the only one! I have seen cases where a blown monitor can take out a circuit breaker. Never assume that the first problem you find, is the only one you need to fix to correct the problem you’re working on.
Confirm all repairs.
I’ve often said that the difference between a good technician and a bad one, is 5 Minutes. Always check and verify your work to insure that everything is working correctly. Verifying that you haven’t forgotten something will keep you from having to come back to fix the same problem.
Always remember, your basic knowledge is your best asset.
Most importantly, your knowledge of the different sub systems that make up the total system you are working on should considered paramount in correcting a problem. Any tech worth his salt knows that you only need to understand 20% of how a device functions to fix a problem… Thing is, you never know what 20% you need to know to fix the particular problem you are working on. Understanding how to setup and check systems is important, but not nearly as important as understanding why you are doing it that way in the first place!
— Stu