next up previous
Next: Convergence Up: No Title Previous: No Title

Introduction

In maintaining complicated service networks, one pressing problem is to determine and actively eliminate causes of network service disruptions. It is currently easy to automatically detect network problems through a variety of monitoring techniques[14,16,28]. But taking the next step of automatically remedying network problems has so far proven impractical. A human troubleshooter must often engage in involved scientific inquiry to infer `causes' from observed `effects'. A complex problem can take weeks to solve, and may be solved without ever revealing the true `cause' of the problem.

While it may be argued that a well-designed network and well-chosen hardware do not fail, this is certainly not true in an environment where one teaches hands-on Computer Science. We face problems almost daily with runaway processes, latent bugs in web scripts, and other disruptions based upon student (or faculty) error. Vendor-supplied servers crash due to latent bugs `discovered' by users toying with new programming techniques. Cables are stepped upon and equipment is abused. We cannot limit the capabilities of users to prevent such failures without compromising our educational mission, so that we must react to failures on an ongoing basis.

One problem with automating troubleshooting of mission-critical network services such as http, ftp, imap, ssh, etc, is that these services typically depend upon other base services (such as NFS, LDAP, database servers, etc) that must function before the externally visible services will function. Network devices that connect internal servers to one another must also be examined during the troubleshooting process. Ideally, to automate service troubleshooting, one must integrate and coordinate procedures that analyze and repair almost every device in the network, from server processes down to switches. While it may be argued that a network is poorly designed unless the precedences between services are clearly defined and unvarying, problems persist largely because we do not fully recognize or understand the dependencies between services.



 
next up previous
Next: Convergence Up: No Title Previous: No Title
Alva L. Couch
2001-10-02