Noweb 3: What and Why

Norman Ramsey

Don Knuth coined the term ``literate programming'' to describe the art of programming primarily for the human reader, and only secondarily for the machine. Literate programming is supported by many tools, all of which provide some way for authors to interleave program source code with well typeset documentation. Most tools also support automatic or semi-automatic cross-referencing of source code. Only four or five literate-programming tools are widely used, and noweb may be the most widely used of all. It is certainly the most widely used literate-programming tool that is independent of the target programming language, and it was the first such tool.

Noweb emphasizes simplicity, extensibility, and language-independence. Noweb has the simplest markup of any literate-programming tool, making it easy for authors to understand the tool and to create literate programs. Noweb uses a pipelined architecture, which makes it possible for expert users to extend the system without recompiling and using the programming language of their choice. Users write extensions as Unix programs and use command-line options to insert them into the noweb pipeline. Users of noweb have written extensions for prettyprinting, conditional compilation, language-dependent cross-reference, etc. The pipelined architecture also makes it easy to support multiple styles of documentation; noweb is unique in supporting plain TeX, LaTeX, HTML, and troff.

Noweb is structured as a collection of C programs, shell scripts, awk scripts, and Icon programs, connected together by Unix pipelines. Noweb can be difficult to install; installers may have to work around bugs in vendors' implementations of awk, and installers must get Icon [Available for free from the University of Arizona] to exploit all of the capabilities of the system. Porting Noweb to the DOS or Windows platform requires either some effort to replace shell scripts or the purchase of a commercial shell.

Noweb's main competitor in the market for language-independent literate-programming tools is nuweb, whose design was inspired by noweb, but which is structured as a monolithic C program. As a result, nuweb is not extensible, but it is easy to port, and it runs quickly. Noweb can run slowly when it is necessary to fork many pipeline stages, some of which run in interpreted languages. Noweb can process nuweb files, but nuweb users continue to prefer nuweb because of its speed and installation.

Noweb's cross-referencing capability extends to HTML; a reader of a literate program can use a Web browser to click on an identifier and jump to the identifier's definition (and documentation). This capability has proven very useful, but it is limited to single documents. When large programs are composed of many separately compiled modules, it is awkward, to say the least, to process the entire program as a single document. (Such documents may run to hundreds of pages, even for a program of modest size, say 10,000 lines.) Users would much prefer to browse one document per module, and to be able to follow references between documents, but noweb does not currently support this model.

In sum, the three improvements that noweb's users would most like to see implemented are

Ability to make cross-references between documents.
Easier porting and installation.
Improved performance.

I would like to make these improvements, and I see three possible paths.

Simple programming improvements. Rewrite the elements of noweb as components of a monolithic C program, solving the portability and performance problems. One would need a little language to control the pipeline and to enable the insertion of external stages, to retain the ability to extend the system without recompiling anything. This path has no research content.

Case study of embedded languages. There are already a slew of embedded languages on the market, including Tcl, perl, Python, lua, slang, Visual Basic, and several flavors of Scheme. I'm not aware of any comparative studies among these languages. I would love to use the modifications to noweb as a vehicle for undertaking such a study. The study would address such questions as:

How big is the embedded language relative to the application? (Is the tail wagging the dog?)
What's the effect on portability?
How hard is it to integrate the same functionality in different languages?
Which languages can support a new native type to represent the information transmitted down the noweb pipeline?
What are the bug rates in different implementations?

I'm not exactly sure how to do a good job with such a study, but I think the results would be interesting to a broad segment of the research community.

Approximate programming environments. Neither of the paths above addresses the issue of better cross-reference. Doing a good job with cross-reference would involve, among other problems, something like smart recompilation for documents. What is more interesting is to see how to build a system that makes a smooth transition from approximate to complete cross-reference information. Noweb version 2 can provide language-dependent cross-reference information without giving up language-independence by using one of two mechanisms:

Have users mark definitions by hand, and find uses using a variant of an Aho-Corasick recognizer. The variant uses a language-independent algorithm to recognize identifiers.
Write a language-dependent pipeline stage that approximately identifies definitions, and use the same algorithm to recognize uses.

Both these methods are approximate. A third method would be to use proper language-dependent analysis to compute exact def-use information, but this would require essentially a compiler front end, which is about two orders of magnitude more work than an approximate tool that identifies definitions. This third method has the additional attraction that it can recognize declarations, so uses can be connected to declarations, and from there to definitions.

To follow this path I would

Develop a structure that can support all three of these cross-reference methods.
Compare the accuracy of the three methods. We could get access to the source code for the Fraser-Hanson book to get 5,000--7,000 lines of code that have definitions carefully marked by hand.

It should also be possible to find an industrial partner that would help us discover which cross-reference links are actually used in practice, so we can find out how consequential are the failures of the approximate cross-reference. Knowing the value of approximate cross-reference, and developing better techniques for approximate cross-reference, would be helpful for building code browsers and other tools that go beyond simple tools for literate programming.