Next: Limitations Up: No Title Previous: Assuring homogeneity

The myth of causality

Can the Maelstrom approach be applied to the problem of inferring the `causes' of network problems? Probably not. The question of what `caused' a specific problem usually turns out to be ill-formed. This is mainly because the action that seemingly repaired a problem need not be directly related to the true cause of the problem.

Proposition 8: Causality is an abstract ideal that is meaningless in practice.

Foremost, the reason for this is that any claim that the Maelstrom process can make about causality is only true about the computing environment in which it was observed. This environment is constantly changing, and many changes are not observable, so no prior inference can be useful in predicting future behavior reliably. Theoretical `cause/effect' relationships can be used to guide the troubleshooting process, but these cannot reliably be inferred from observations during that process.

The concept of causality is plagued by `latent variables' that affect the function of a system without being recognized as having any effect. Most recently, months of carefully checking DHCP and router configurations for a DHCP problem ended when we observed a DHCP server answering an ARP request with an incorrect IP address. The `latent variable' turned out to be a defective network driver on the DHCP server - something we had never considered. The idea that troubleshooting is often a search for `latent variables' justifies the `phenomenological modularity' that allows one to easily add tests to a decision tree.

Eells[12] points out that the latent variable problem is much more severe than this simple case study illustrates. He shows that by an appropriate ignorance of latent variables one can `prove' that smoking prevents cancer. In his example, one latent variable is the location of the test subjects, either in large cities or on rural farms.

While network debugging by lower-level staff can be represented by decision trees, persistent problems referred to the highest level of triage almost always involve latent variables that are not part of the decision tree at all. Thus it becomes very important to be able to add tests to the debugging process as latent variables are discovered without affecting the global integrity of the debugging process. This need justifies use of `phenomenological modularity' to assure interoperability between scripts, and the complete intractability of determining precedences between the various tests justifies the execution time that it takes to infer the precedences automatically.

The idea of causality cannot be even exhibited within scripts that are appropriately homogeneous and convergent. In fact:

Proposition 9: Script convergence obscures causal relationships.

Causal relationships take the form of scripts (or actions) A, B, and C, where C always succeeds after A and always fails after B. In this case, A and B cannot be homogeneously convergent with one another, because they sufficiently diverge in what they do to either assure or break C. To add to this quandry, it is seldom true that a script enables or disables another in all possible conditions. This is also a mathematical idea that never appears in practice.

Next: Limitations Up: No Title Previous: Assuring homogeneity

Alva L. Couch
2001-10-02