Safety

Next: Observability Up: The Maelstrom Previous: Testing the wind

Safety

Even in the simple examples of this paper, there are conditions in which a troubleshooting script can make things worse by its actions. This can happen if a corrective action is too extreme or depends upon external resources that are themselves down at the moment. If our scripts are convergent in the Cfengine sense and there are no hidden constraints, Maelstrom is relatively safe. If, e.g., a reboot depends upon hidden constraints that have not been assured, such as a service required for reboot, Maelstrom may reboot a server even though this makes the network state worse, and may well make future troubleshooting impossible without operator intervention.

Maelstrom is currently relatively naive about the limitations of its environment. It can be made safer by giving it more understanding of the imperfections within its scripts, and the hidden couplings between scripts and Maelstrom's environment.

Maelstrom cannot currently compensate for inhomogeneity or lack of convergence in scripts. In the future, there will be stronger precedence operators to control Maelstrom's actions in the presence of imperfect scripts. Recall that ``c2 : c1'' means ``c1 might theoretically precede c2''. We plan other precedence operations whose main purpose is to compensate for script deficiencies:

``c2 :: c1'' will mean ``the last success of c1 must precede the last invocation of c2''.
``c2 ::: c1'' will mean ``the first success of c1 must precede the first invocation of c2''.

Both of these are still weaker conditions than the ``:'' in make. In Maelstrom, we could notate make's concept of strong precedence as follows:

``c2 :::: c1' could mean that every success of c1 must be followed by an invocation (and success) of c2.

We use the colons to limit the number of characters one must escape inside shell commands in the configuration file (currently `:', `;', `[', and `]').

All of these syntactic mechanisms are attempts to compensate for non-homogeneous or non-convergent behavior in scripts.

The declaration c2 :: c1 means that c1 and c2 are convergent but inhomogeneous, so that c2 must be tried after c1 in order to make the execution result deterministic. This rule means ``always clean up after c1 with c2''.
The declaration c2 ::: c1 declares a hard-coded precedence, and means ``it is impractical to execute c2 without at least one success of c1. We will use this when there are unavoidable physical dependencies, such as when an intervening router must be tested before the equipment behind it.
The declaration c2 :::: c1 (which we may not implement in the immediate future) compensates for non-convergent behavior of c1, by always following it as soon as possible with a cleanup routine c2.

To understand the importance of adding these precedence operators to Maelstrom, note that with even the first one (::) we can simulate make with Maelstrom without resorting to more script intelligence. If script c1 is:

if [ -nt foo foo.o \
  -a -nt foo bar.o] ; exit 1
g++ -o foo foo.o bar.o exit 0

and script c2 is:

if [ -nt foo.o foo.c ] ; \
  exit 0 
g++ -c foo.c 
exit 0

and script c3 is:

 
if [ -nt bar.o bar.c ] ; \
  exit 0 
g++ -c bar.c 
exit 0

then the Maelstrom declarations:

c1 :: c2 
c1 :: c3

would accomplish the same effect as the Makefile above. Even the relatively weak double-colon operator precedence avoids the need to have script c1 know all the dependencies between its files, as in the former example. This script might do redundant compilations, but in the end it will accomplish the exact same result as the Makefile. Although we discuss the possibility of `rebooting' as a result of a script, we are not happy with the prospect of automated power-cycling of servers. We are currently developing a tool that allows that kind of dangerous action to be controlled by an electronic mail or two-way pager transaction. The script that wishes to reboot a server asks us whether it should or not, and an operator can mail back a `yes' or `no' response.

One weakness of Maelstrom's scheduling is its simplicity. Many colleagues have suggested that Maelstrom should allow one to declare not just precedences, but also ``costs'' as a measure of how disruptive a particular action will be. One could then try solutions in order of increasing cost. But this would require an even more complex syntax in the configuration file, and theoretical precedences have the same overall effect (through different kinds of declarations).

Next: Observability Up: The Maelstrom Previous: Testing the wind

Alva L. Couch
2001-10-02