Building cool stuff

One of the most fun, exciting, and educational opportunities for students is the chance to build interesting artifacts that may be seen and used by people all over the world! Many professors have laundry lists of such projects; here's mine. All the projects more or less relate to my research interests in programming languages and environments; related material is covered in CS 152.

Projects include:

OCR Dasher
Machine arithmetic
Abstractions for ML pattern matching
Dueling run-time systems
Soft typing for object-oriented languages
Explicit polymorphism
Refactoring tools
Machine-independent object code
Machine-independent assembly language
Source-level debugging
Debugging support
Build client of a portable assembly language
Concurrency in a portable assembly language
Read and write linked data structures --- in different languages
Engineering fast fingerprints
Idiomatic generation of high-level language code
Hypertext indexing for literate programs
Accurate cross-reference for literate programs
Prettyprinter generation
Smart compilation, portably
Machine descriptions for encoding, decoding, and simulating instructions
Dataflow optimization frameworks
Efficient conditional breakpoints
Support for bytecode interpretation
Parser generation
An Interactive, Incremental Compiler for Game Design
Autonomous Agents in Interactive Fiction
A procedure-oriented virtual machine
PostScript pictures for LaTeX documents

See something that looks cool? Write nr@cs.tufts.edu.

OCR Dasher

We have tools that are pretty good at converting scanned documents to texts, but there are many small errors. The project is to adapt Dasher to enable a person to, at high speed, correct all the errors in an OCR'ed document.

Machine arithmetic

Many people want to run 32-bit codes on 64-bit hardware or vice versa. We've recently discovered systematic way to run 32-bit codes on a 64-bit machine. The investigation turned up some surprisingly regular properties of machine arithmetic.

The new project is to investigate techniques for running 64-bit codes on 32-bit machines. (We're interested in compilation techniques, not simulation techniques like the Gnu Multiprecision Library.) We're hoping either to discover similar regular properties or figure out why they don't exist.

Abstractions for ML pattern matching

Any ML programmer can write a function to abstract over a certain pattern of values. For example, if you want triples in which the first and last elements are equal and the middle element is NONE, you can write

fun myTriple x = (x, NONE, x)

But tragically, there is no way to write something similar to get a pattern that matches only these triples. The project is to develop an extension to the ML dialect Objective Caml to support abstraction in pattern matching. This is not just a language-design project---we are looking for practical results we can apply to our C-- compiler.

Dueling run-time systems

Our community has gotten pretty good at getting different languages to work together, but it's another story when it comes to run-time systems. Usually any language with a garbage collector things it should control the world. But there is an exception: the Lua language is designed to coexist peacefully with other implementations. The project is to push that design to try to get Lua to interoperate with Perl. (This might be the scripting version of ``Calling Hell from Heaven and Heaven from Hell.'') This project would have two benefits:

It would be great community service if Lua programs could call everything in the CPAN library.
We are bound to learn something interesting about memory management and run-time systems.

Soft typing for object-oriented languages

Soft typing is an idea that allows you to get useful error messages from a static type checker while still being able to run your code if type checking fails. Most of the work on soft typing has been done on functional languages. It would be interesting to see how well the idea applies to object-oriented languages. I can imagine two interesting settings:

In a classroom setting, one could try to apply soft typing to the micro-Smalltalk language used in CS 152.
In a practical setting, one could try to apply soft typing to Lua.

Explicit polymorphism

System F has explicit polymorphism, but the explicit abstractions and applications are a bit hard to program with. One way to cope is to use Hindley-Milner type inference, which eliminates explicit annotations by restricting polymorphism. But another is to find a notation that makes explicit polymorphism usable. The programming language CLU offers one such notation, and it also has some interesting mechanisms for constraining the specialization of polymorphic types. The project would be to formalize the CLU type system and develop a precise characterization of its relationship to System F. There are also connections with a recent study of different languages' support for polymorphism, done at Indiana's Open Systems Lab (there is an OOPSLA paper).

Refactoring tools

Kent Beck's discipline of Extreme Programming advocates continuous refactoring of systems under construction. Refactoring by hand requires that code be carefully edited by hand and the system test suite be re-run. But many refactoring transformations could be standardized:

Pull out a chunk of code and convert it into a method.
Change the relationship between objects from inheritance to delegation, or vice versa.
Adjust private methods so they touch fields only through accessor methods---or not.

The project is to identify an interesting set of refactoring transformations (e.g., based on Martin Fowler's book) and to develop a refactoring tool for a modern programming language like Java or Standard ML. This project would make a good honors thesis.

Machine-independent object code

The New Jersey Machine-Code Toolkit generates an API for assembler-like functions, but it can't conveniently write the results to disk until all references have been resolved. The project is to develop a form of object code usable with the toolkit. Possible tasks are:

Devise a machine-independent form of object code. This should include performance measurements of alternative ways of representing relocation information.
Develop a machine-independent assembler, using the toolkit to generate the machine-dependent parts.
Develop a machine-independent linker usable with your object code. Develop a loader that can turn an object file without dangling references into an executable file.

I expect this work to lead to publication and to provide a foundation for further research.

Machine-independent assembly language

A companion to the previous project. Here the idea is to dispose of distinctions between assemblers once and for all, by having the assembler written in terms of the semantics of the instructions. No more remembering opcode names or the order of operands! This work would be a chance to apply some of the Computer Systems Description Languages being developed for the Zephyr project. Automatic generation of processor emulators h2 Emulators are used not only for the high purpose of playing NES games on your PC, but also to build systems software for new hardware even before it's built or to study the performance of computer systems. The project is to figure out how to generate an emulator, given a precise description of the target machine's instruction set. The long-term goal is to develop several strategies, ranging from simple strategies that produce slow emulators easily, to the most aggressive binary-translation techniques.

I expect this work to lead to publication. -->

Source-level debugging

Pick your favorite compiler and add support for source-level debugging, drawing on the ideas in the Debugging Everywhere project.

I expect this work to lead to publication.

Debugging support

ldb is a retargetable debugger for ANSI C. ldb works in part by exploiting a novel representation of debugging information that is entirely executable. One problem is that this representation is ten times larger than conventional representations The project is to devise and implement a new representation that mixes data and executable parts, such that it retains the retargeting advantages of the current representation but has the same efficiency as more conventional representations.

I expect this work to lead to publication.

Build client of a portable assembly language

Simon Peyton Jones and I are developing a new code generator for C--, a language to be used as a target for compilation. The project is to choose a compiler for a high-level or very-high-level language, and to retarget it to C--.

This project would make an excellent honors thesis and might lead to publication.

Concurrency in a portable assembly language

We have some preliminary designs for C-- to support concurrency. There are several possible avenues, ranging from deep development of one or two ideas to pulling all the ideas together into a coherent design.

This project should lead to publication.

Read and write linked data structures --- in different languages

It's quite useful to be able to write linked data structures to disk and read them back later (sometimes called ``pickling'' and ``unpickling''), but to do it by hand is tedious. The project is to find a good way to automate the process, first for trees, then for general linked data structures. The devil here is in the details:

generating code with the right idioms
finding both a readable ASCII representation and also a compact binary representation
exploring XML compatibility
making pickles self-identifying
managing the software engineering of generating code for different programming languages in a way that programs written in different languages can exchanging data transparently.

This project, tentatively named ``Pickle Master,'' could lead to software that might be very widely used.

Engineering fast fingerprints

A ``fingerprinter'' is a function F that can be applied to very long strings (e.g., disk files), such that for any string A, F(A) is of fixed size (typical sizes are 64-128 bits), and if A and B are two different strings, then F(A) and F(B) are also different with very high probability. Fingerprints have applications in security as well as filesystem replication. The engineering tradeoffs involved in the implementation of fingerprints are not well understood. The project is to study alternatives experimentally in order to engineer a very fast fingerprinter.

Idiomatic generation of high-level language code

Application generators (e.g., parser generators) need to emit code in high-level languages so the emitted code can be linked with applications written in high-level languages. The idioms used to express the code may vary depending on the language or the application. This project is to explore idiom choices for the New Jersey Machine-Code Toolkit, which generates an API for assembler-like functions. Even restricting our attention to C, we have several ways to consider emitting instructions:

invoke a macro for each instruction
call an external procedure for each instruction
call through a function pointer for each instruction

Each of these approaches has advantages in different situations. The project is to use the Standard ML modules system to devise a suitable way of choosing an idiom for a particular application. Familiarity with Standard ML is a plus but is not required.

I expect this work to lead to publication.

Hypertext indexing for literate programs

Even simple hypertext browsers like Netscape or Internet Explorer can make a literate program dramatically easier to understand. (See examples.) Unfortunately, all existing literate-programming tools assume they are indexing and cross-referencing a single document. This project is to develop interdocument cross-reference tools for the noweb literate-programming tool.

I expect this work to lead to publication.

Accurate cross-reference for literate programs

The Noweb literate-programming tool currently uses an approximate algorithm to compute cross-reference information. The project is to figure out how the Noweb framework can be changed to provide for completely accurate cross-reference information, and to develop one or more accurate cross-reference tools. There are many front ends that might be adapted for cross reference, supporting languages like Java (Sun javacc), C (lcc front end), C++ (Edison Design Group front end), Standard ML (Moscow ML front end or ML Kit), and Modula-3 (Compaq SRC front end).

Prettyprinter generation

Prettyprinting technology is well developed, but it is not well realized in flexible software. This project is to adapt existing prettyprinting technology so it can work with any programming language and in a variety of contexts, including standalone ASCII prettyprinter, literate-programming tool, and PostScript or FrameMaker.

As a bonus, one might consider extending TeX with support for prettyprinting.

This work could possibly lead to publication.

Smart compilation, portably

Make rebuilds object files whenever source files have changed. In a big system, this can cause slow builds. Lots of people have build ``smart compilation'' systems, which understand the changes and build new binaries only when necessary. But these systems have to be built over and over again for each new language. The project is to build a replacement for Make that can support smart compilation for more than one language.

I expect this work to lead to publication.

Machine descriptions for encoding, decoding, and simulating instructions

The project is to develop SLED and Lambda-RTL descriptions for a small selection of popular machines. These descriptions, which will be used in the C-- project, can be used to generate machine-dependent tools. Candidate machines include the PowerPC, the StrongARM, the HP PA RISC, the Motorola 68k family, Hitachi HS-1, and other processors.

Dataflow optimization frameworks

Sorin Lerner, Craig Chambers, and Todd Millstein have recently published some interesting work on provably sound optimizations, combined automatically. At present, these optimizations are written in a special-purpose domain-specific language, which is then interpreted in the Whirlwind compiler. This project has two parts:

Identify what has to be done to adapt the ideas to other compilers
Look for opportunities to translate the domain-specific optimizations to native code

This work would take place in the context of the Quick C-- project.

This work would make a good senior project and might lead to publication.

Efficient conditional breakpoints

The ldb debugger evaluates the conditions associated with conditional breakpoints by translating high-level expressions into a low-level intermediate code and interpreting the intermediate code. The language of the intermediate code is a subset of PostScript, so there are two levels of interpretation in the system. A better solution would be to use machine code to evaluate the conditions. If the machine code can go in the target address space, we can avoid significant context-switching overhead in the case where the breakpoint is not taken. It would be interesting to see if the PostScript could be reused for this purpose. The project would include

adding runtime support for managing code in the target address space
writing new postscript to emit machine code into the target space
possibly adding the appropriate Modula-3 support to the New Jersey Machine-Code Toolkit

I expect this work to lead to publication.

Support for bytecode interpretation

Todd Proebsting's recent work on superoperators looks like a nice technique for building bytecode interpreters, but it isn't very conveniently packaged. The project is to combine this work and my earlier work on specifying PostScript operators to form an environment for building very efficient interpreters and code emitters.

This work may lead to publication.

Parser generation

EBNF is a parser generator written in Icon and emitting Icon that accepts EBNF and generates LL(1) or SLR(1) parsers. Possible projects include:

Implement the DeRemer and Penello LALR(1) algorithm.
Add support for emitting Modula-3 or Standard ML.
Implement an error-correction algorithm.

The project should result in useful software that can be distributed on the net.

An Interactive, Incremental Compiler for Game Design

The implementation of text adventure games is greatly simplified by the use of special-purpose languages like Inform and TADS. Still, designing and debugging these games is awkward, because it's difficult to recover the game state once a change is made. The project is to develop an interactive, incremental compiler so you can change the state of a running game, then resume play. This would make a good undergraduate thesis project.

Autonomous Agents in Interactive Fiction

One goal of interactive fiction is to allow players to interact with seemingly autonomous agents or ``non-player characters'' (NPCs). NPCs are currently very difficult to program using interactive-fiction languages like Inform and TADS. The purpose of this project is to develop a programming-language extension for implementing NPCs. This will involve

Reading the literature on interactive NPCs/agents and developing a simple programming model based either on that literature or on other well-known techniques (e.g., state machines or goal-directed evaluation).
Designing a defining a notation for creating NPCs.
Implementing the design as a preprocessor for a popular compiler like TADS or Inform.

This would make a good undergraduate thesis project.

A procedure-oriented virtual machine

Most text adventure games use bytecode interpreters for portability. This project is to design and implement a procedure-oriented bytecode interpreter that will enable an implementation based on binary translation. This in turn should lead to excellent playing times on such handheld platforms as the Palm Pilot. The interpreter alone might make an interesting undergraduate thesis project. A full demonstration, complete with back end and translator, would make an excellent masters' project.

PostScript pictures for LaTeX documents

It is currently possible to use the PSTricks package to create figures for use in TeX documents, but PSTricks is very complex and has a steep learning curve. The project is to explore established user-interface principles and techniques, including two-view editing, to identify and correct the weaknesses in PSTricks.

Back to Norman Ramsey's home page.