Troubleshooting eProducts

Not everybody knows that the most difficult part of Diagnostic Troubleshooting is often actually describing the problem accurately.

In our eProduct world this can be particularly vexing, and it sometimes requires formal test engineering practices to evaluate the nature of the problem report, even before troubleshooting can be started. In our Public Safety business, errors can be life-critical, and these generational long term deployments require enhanced troubleshooting skills and techniques and to observe, diagnose and cure problems.

Sporadic intermittent problems can test the mettle of any team, and this is where Shared Solutions can provide particular value.

Shared Solutions can Help

If you cannot adequately describe a problem, Shared Solutions can help. We often use simple statistical methods to qualify problems, and on particularly difficult or intermittent errors, we can create or specify test software and/or hardware to exercise systems and gather performance metrics to detect and abate problems.

If describing a problem is the most imortant part of troubleshooting, and it often is, reproducing the problem in isolation is usually the next step. It’s all well and good to tell a vendor that something happens “sometimes;” don’t wait for a return call with a solution in a hurry. But if we provide them with a means to reproduce the problem independently and even off-site, we empower the vendor to succeed. We all swim together in these murky waters, and by approaching problems as one team, all seeking the same solution sets, we all win together too.

When we approach a problem, we generally start at the Rising Edge of Reset. This usually means from Power-Up, or when the reset button is released and the system begins processing. We look for absolute consistency, and when it is not there, when things are inconsistent, we backtrack as necessary, sometimes early and often.

For complex systems that require ongoing diagnostics, we often develop in-house testing procedures or code as adjunct processes to ongoing operations, to fully understand and smooth the ongoing deployment. A simple example of this is to run a low-rate background ping on a link. An initial test may show “perfect” connectivity, but let’s run that for a week, and timestamp those responses, and log them, and compare them to other observations.

We always seek for the Root Cause of all problems. This is not always easy in the proprietary systems world where we are often integrating closed third-party designs, but this must be is our stated and shared goal from the outset, else we will ALL bear the brunt of the same problems again. And again, and again…

Every project (and indeed every product) is really just one large set of problems, until they are all discovered and addressed. Field Acceptance Testing rarely uncovers problems; no vendor worthy of your business would actually commence an ATP they hadn’t rehearsed well in advance, and any failures can often be traced to logistics, configuration, even source of supply. We look for uncontrolled variables using consistent and readily reproducible testing (and regression re-testing.)

The unfortunate truth is that issues cannot be solved that cannot be seen or measured with some form of statistical validity. We always want those issues solved before a project or product release, however the world is a technical onion sometimes.

Shared Solutions can help you peel the layers from your problems, without tears, and together we can identify and abate any technical issue.

Sample problem resolution graph: Six Weeks to Success