As a `best practice', many sites pragmatically describe and standardize network troubleshooting procedures as ``decision trees'' that describe tests to make and corresponding actions to take. Usually decision trees are prepared by senior staff (or vendors) to aid junior staff in dealing effectively and autonomously with common problems. A decision tree is nothing more than a flow chart(in actuality a directed acyclic graph (DAG)) of actions, where the result of each action determines the next action to try.
For an example, consider the (very much oversimplified) decision tree in Figure 1. According to the tree, to check whether ftp service is running, one must first check whether one can make an ftp connection from a client machine. If so, everything is working. If not, one must then check whether the console of the ftp server responds to a keypress. If this succeeds, the server is running; otherwise it has probably crashed and needs to be power-cycled. The final measure is to check whether the program inetd is running on the (now perhaps rebooted) ftp server, and to start it if not. This is far too simple to be realistic; it is contrived only to illustrate concepts rather than as a practical application.
Two special actions determine when to give up. The `done' action indicates success, while the `fail' action indicates that the decision tree failed to correct the problem. In a realistic situation, failure of a procedure would escalate the problem's priority and refer it to the next level of technical staff.
One could perform most of the steps in this simple tree by convergent administration in Cfengine[2,3,4], but for the purposes of this discussion we consider the steps in the tree to be Babble scripts that potentially interact not just with a local host but also with remote serial consoles of devices such as routers, switches, and power interrupters. While Babble suits our needs for some scripting purposes, these scripts could be arbitrary programs in any desired language.
We started this work with the goal of automating the process of interpreting troubleshooting ``decision trees'' like the above to automatically detect and repair network disruptions. We abstracted each `decision' in the tree into a script with multiple exit values 0,1,2,..., where the exit value of a script indicates the next script to invoke. Simple actions not involving a decision are considered to be decisions with only one exit code and outcome. The scripts are plumbed together by declaring which ones should be called as a result of the return codes of others. We implemented a tree traversal algorithm in Perl that ``executes the decision tree'' by running scripts and taking decision branches as indicated.
Within our scheme, one would implement the above example as scripts A, B, C, D, E, with exit codes determining branches as indicated in Figure 1. For example, if B's exit code is 1, we invoke C; if the code is 0, we invoke D. Using algorithms from Babble, we thought we could represent this tree structure in XML. We endeavored to convert this XML tree into a nested hash and execute the results as a kind of script.