scanner.nw

* In this assignment I print intermediate code.

<grammatical C functions>= (U->)
void print6 (Exp e) {
    extern int yylineno;
    printf("line %d:\n",yylineno);
    if (e.type==error_type)
        printf("# ERROR\n");
    else
        printTree(e.tree,stdout);
}

Defines print6 (links are to index).

<grammatical C declarations>= (U->) [D->]
void print6(Exp e);

The `yacc` value stack

Here are all the objects I use as synthesized attributes:

<grammatical declarations>= (U->) [D->]
%union {
    char *string;
    Type type;
    Product product;
    Symbol symbol;
    Exp exp;
    Explist explist;
    int bit;
}

3. Vocabulary and representation

Representation of terminal symbols

yacc and lex use integers to represent terminal symbols (tokens). Single-character tokens are represented by their ASCII codes. Longer tokens are declared using yacc's %token declaration; yacc chooses an integer representation of each such token. These representations are made available to the lexer via y.tab.h. I use the standard trick from Kernighan and Pike (x.tab.h instead of y.tab.h) to avoid remaking the lexer unecessarily.

<lexical include files>= (U->)
#include "x.tab.h"

1. Identifiers

Since lex is notoriously slow at using patterns to recognize reserved words, I look up every identifier in a table of reserved words. idcategory(id) returns the category of the reserved word id when id is in fact a reserved word. When id is not a reserved word, idcategory(id) returns IDENT.

<lexical definitions>= (U->) [D->]
letter  [A-Za-z]
digit   [0-9]

The offensive <INITIAL> below has to do with handling comments (q.v.). Both identifiers and reserved words are saved in the string table and put on the yacc value stack.

<lexical rules>= (U->) [D->]
<INITIAL>{letter}({letter}|{digit})*    { yylval.string = strsave(yytext);
                                          return idcategory(yytext); }

<lexical C declarations>= (U->) [D->]
extern char *strsave(char *s);

<grammatical declarations>+= (U->) [<-D->]
%token IDENT
%type <string> IDENT

<grammatical rules>= (U->) [D->]
ident   : IDENT
        ;

Defines ident (links are to index).

<grammatical declarations>+= (U->) [<-D->]
%type <string> ident

<grammatical C declarations>+= (U->) [<-D]
extern char yytext[];

2. Numbers

I have to use two different ScaleFactor definitions so I can tell the difference between long and short reals. Notice that the scale factor is not optional for the long real.

<lexical rules>+= (U->) [<-D->]
<INITIAL>{digit}+|{digit}({hexDigit}*)[Hh]      { yylval.string = strsave(yytext); 
                                                  return INTEGER; }
<INITIAL>{digit}+"."{digit}*{EScaleFactor}?     { yylval.string = strsave(yytext); 
                                                  return REAL; }
<INITIAL>{digit}+"."{digit}*{DScaleFactor}      { yylval.string = strsave(yytext); 
                                                  return LONGREAL; }

<grammatical declarations>+= (U->) [<-D->]
%token <string> INTEGER REAL LONGREAL

I permit lower case where Wirth insists on upper case. This will be convenient later on. Besides, Hanson does it.

I need parentheses around things like EScaleFactor because I'n not really defining a regular expression---I'm using a macro facility.

<lexical definitions>+= (U->) [<-D->]
hexDigit        [0-9A-Fa-f]
EScaleFactor    ([eE][+\-]?{digit}+)
DScaleFactor    ([dD][+\-]?{digit}+)

3,4. Strings and character constants

Single character strings "x" become character constants, not strings, thanks to the lex disambiguation rules.

<lexical rules>+= (U->) [<-D->]
<INITIAL>"\'"{nonquote}"\'"         { yylval.string=strsave(yytext); return CHARCONST; }
<INITIAL>{digit}{hexDigit}*[Xx]     { yylval.string=strsave(yytext); return CHARCONST; }
<INITIAL>"\""{nondoublequote}*"\""  { yylval.string=strsave(yytext); return STRING; }

<grammatical declarations>+= (U->) [<-D->]
%token <string> CHARCONST STRING

The character set for string literals is awkward, because I want to include backslash escapes. I use the ANSI standard backslash escapes from section A2.5.2 of the second edition of Kernighan and Ritchie. Because the lexical analyzer is probably not the right place to handle illegal backslash escapes, I allow any reasonable character to follow the backslash. I define nonoctalx to be those characters that can't start an octal or hexadecimal constant (when following a backslash). Then I can recognize octal and hexdecimal character constants like "\012". I don't insist that at least one hexDigit follow \x, because again that should be handled downstream of the lexical analyzer.

<lexical definitions>+= (U->) [<-D->]
plainnonquote        [ \t\]\"!#$%&()*+,\-./0-9:;<=>?@A-Z[^_`a-z{|}~]
plainnondoublequote  [ \t\]\'!#$%&()*+,\-./0-9:;<=>?@A-Z[^_`a-z{|}~]
escapedchar          (\\({nonoctalx}|{octal}{octal}?{octal}?|x{hexDigit}*))
nonoctalx            [ \]\'\"!#$%&()*+,\-./89:;<=>?@A-Z[\\^_`a-wyz{|}~]
nonquote             ({plainnonquote}|{escapedchar})
nondoublequote       ({plainnondoublequote}|{escapedchar})
octal                [0-7]

I also need to handle strings that don't terminate. When I see one, I gobble up the whole line on which it sits---that should make it easier for the parser to recover. (The alternative is trying to tokenize the characters following the open quote.)

<lexical rules>+= (U->) [<-D->]
<INITIAL>("\""{nondoublequote}*|"\'"{nonquote}*) <complain; return bad string>

I print the first few characters of a nonterminated string, followed by an ellipsis. I return the string anyway because that way there's a chance that the parser can just ignore the error.

<complain; return bad string>= (<-U)
{   yyerror("Unterminated string %.8s%s",yytext,yyleng>8?"...":"");
    yylval.string=strsave("");
    return STRING;
}

I include prototypes for the string functions, to keep lcc -A from complaining about missing prototypes.

<common include files>= (U-> U->) [D->]
#include <string.h>

5. Operators and delimiters

Here are %token declarations for all the multicharacter tokens, including the reserved words. I use yyBEGIN because BEGIN means something special to lex.

<grammatical declarations>+= (U->) [<-D->]
%token ARROW INC DEC LSHIFT RSHIFT LEQ GEQ EQ NEQ AND OR 
                /* -> ++ -- << >> <= >= == != && || */

I make sure the lexer always returns strings for identifiers and reserved words.

Recognizing the operators and delimiters is straightforward:

<lexical rules>+= (U->) [<-D->]
<INITIAL>"->"                           return ARROW;
<INITIAL>"++"                           return INC;
<INITIAL>"--"                           return DEC;
<INITIAL>"<<"                           return LSHIFT;
<INITIAL>">>"                           return RSHIFT;
<INITIAL>"<="                           return LEQ;
<INITIAL>">="                           return GEQ;
<INITIAL>"=="                           return EQ;
<INITIAL>"!="                           return NEQ;
<INITIAL>"&&"                           return AND;
<INITIAL>"||"                           return OR;
<INITIAL>[\]+!\-*/~&.,;|()[{}^=#<>:]    return *yytext;

Reserved word search

Recall that, instead of having the lex-generated automaton recognize reserved words, I wanted to look up each identifier to see if it is a reserved word. I put the reserved words into an array and use binary search to find their categories. A word that isn't in the list has category IDENT.

The list itself is excruciating to read. I use a trick I got from Dave Hanson---I put the list in a header file as calls to the kw (keyword) macro. Then I include the header with appropriate macros wherever I need it.

<list of reserved words>= (U->)
kw(INT, "int")

A binary search table of reserved words:

<reserved word data structures>= (U->)
static 
struct reserved {char *s; int category;} 
reservedwords[] = {
#define kw(VAL,STRING) {STRING,VAL},
<list of reserved words>
#undef kw
};
static int numreservedwords = (sizeof(reservedwords)/sizeof(struct reserved));

Defines kw, numreservedwords (links are to index).

idcategory is just binary search.

<lexical C functions>= (U->)
<reserved word data structures>
static int idcategory (char *id) {
    int low=0, high=numreservedwords-1, mid;
    int compare;
    while (low <= high) {
        /* Invariant: if id is in the initial range low...high,
                      it is in the current range low...high */
        mid = (low+high)/2;             /* note low <= mid <= high */
        compare = strcmp(id,reservedwords[mid].s);
        if (compare>0) low = mid + 1;
        else if (compare<0) high = mid - 1;
        else return reservedwords[mid].category;
    }
    return IDENT;               /* id is not a reserved word */
}

Defines idcategory (links are to index).

<lexical C declarations>+= (U->) [<-D]
static int idcategory(char *);

Comments

I use the standard trick of defining a special state just for comments. A begin comment sends the lexer into state <COMMENT>, and an end comment returns it to state <INITIAL>. All tokens that aren't comments are recognized only in state <INITIAL>, which explains the offensive <INITIAL> prepended to every rule.

<lexical definitions>+= (U->) [<-D]
%S COMMENT

<lexical rules>+= (U->) [<-D->]
<INITIAL>"/*"   BEGIN COMMENT;
<COMMENT>"*/"   BEGIN INITIAL;
<COMMENT>"\n"   ;
<COMMENT>.      ;

Whitespace and bad characters

<lexical rules>+= (U->) [<-D]
<INITIAL>[ \t\n]+               ; /* ignore whitespace */
<INITIAL>.                      <complain about a bad character>

The error message we print is different for printable and nonprintable characters.

<complain about a bad character>= (<-U)
{   if (*yytext >= ' ' && *yytext < 0177)
         yyerror("bad character `%c'", *yytext);
    else yyerror("bad character `\\%03o'", *yytext);
}

8. Expressions

8.1 Operands (designators and constants)

There are no qualified identifiers, so this simplifies the parsing of designators. It is a bit awkward to distinguish variables and parameters from constant identifiers. There is also an awkwardness with rvalues---boolean expressions have to be converted from ``test'' to values using BOOL. Following a suggestion of Hanson's, I use three nonterminals: exp is an expression (possibly a test); rvalue is an rvalue (never a test), and lvalue is an lvalue. I introduce complex_lvalue because I need to distinguish identifiers from all other lvalues (otherwise I get a reduce-reduce conflict when converting lvalues to expressions).

<grammatical declarations>+= (U->) [<-D->]
%type <exp> arraydes lvalue complex_lvalue rvalue exp

Here are productions for all the C literals. I use make_constval(type,string) to convert a string to a value of the type desired.

I issue warnings for long reals, since they aren't supported in Oberon/320.

<grammatical rules>+= (U->) [<-D->]
exp     : INTEGER       { $$ = make_constval(integer_type,$1); }
        | REAL          { $$ = make_constval(real_type,$1); }
        | LONGREAL      { warning("Long reals not supported (replaced with zero)");
                          $$ = make_constval(real_type,strsave("0.0")); }
        | CHARCONST     { $$ = make_constval(char_type,$1); }
        | STRING        { $$ = make_constval(build_array(0,stringchar_type),$1); }
        ;

Defines exp (links are to index).

8.2 Operators

I use yacc precedence declarations.

These declarations define precedence. My task is much simplified because unary and binary - have exactly the same precedence.

<grammatical declarations>+= (U->) [<-D->]
%left OR
%left AND
%left '|'
%left '^'
%left '&'
%left EQ NEQ
%left '<' '>' LEQ GEQ
%left LSHIFT RSHIFT
%left '+' '-' 
%left '*' '/' '%'
%right '!' '~' INC DEC                          /* unary operators */
%left ARROW '.'

binop checks types and generates intermediate code. Consult the chapter on typechecking for the description of the correct operation of binop and the meanings of various permissions p_xxx.

<grammatical rules>+= (U->) [<-D->]
complex_lvalue  : lvalue '.' ident              { $$ = find_field($1,$3); }
                | exp ARROW ident               { $$ = find_field(deref($1),$3) }
                | '*' rvalue                    { $$ = deref($1); }
                | arraydes ']'                  { $$ = $1; }
                ;

arraydes        : rvalue '[' exp                { $$ = subscript($1,$3); }
                ;

lvalue          : complex_lvalue                { $$ = $1; }
                | ident                         { $$ = lookup_lvalue($1); }
                ;

exp             : complex_lvalue                { $$.type=$1.type; 
                                                  $$.tree=tMEM($1.type->size,$1.tree); }
                | ident                         { $$ = lookup_exp($1); }
                ;

rvalue          : exp                           { $$ = boolval($1); }
                ;

Defines arraydes, exp, lvalue, rvalue (links are to index).

<grammatical rules>+= (U->) [<-D->]
exp             : exp EQ exp    { $$ = binop(OSeq, $1,$3,boolean_type,p_equality); }
                | exp NEQ exp   { $$ = binop(OSneq,$1,$3,boolean_type,p_equality); }
                | exp '<' exp   { $$ = binop(OSlt, $1,$3,boolean_type,p_relational); }
                | exp LEQ exp   { $$ = binop(OSleq,$1,$3,boolean_type,p_relational); }
                | exp '>' exp   { $$ = binop(OSgt, $1,$3,boolean_type,p_relational); }
                | exp GEQ exp   { $$ = binop(OSgeq,$1,$3,boolean_type,p_relational); }
                ;
exp             : exp '+' exp   { $$ = binop(OSplus, $1,$3,0,p_numeric); }
                | exp '-' exp   { $$ = binop(OSminus,$1,$3,0,p_numeric); }
                | exp OR exp    { $$ = binop(OSor,   $1,$3,0,p_boolean); }
                ;
exp             : exp '*' exp   { $$ = binop(OSmul,  $1,$3,0,p_numeric); }
                | exp AND exp   { $$ = binop(OSand,$1,$3,0,p_boolean); }
                ;
exp             : '(' exp ')'   { $$ = $2; }
                ;

Defines exp (links are to index).

Function calls

Calls to functions (procedures having a return type) may occur only as factors in the production given below. The bottom of p. 678 of the Oberon report makes it clear that the () are required even if the function has no parameters.

<not yet grammatical rules>=
exp             : ident ActualParameters        { $$ = fcall($1,$2); }
                ;

ActualParameters: '(' ExpList ')'               { $$ = $2; }
                | '(' ')'                       { $$ = 0; }
                ;

ExpList         : rvalue ',' ExpList            { $$ = explist($1,$3); }
                | rvalue                        { $$ = explist($1,0); }
                ;

Defines ActualParameters, exp, ExpList (links are to index).

<not yet grammatical declarations>=
%type <explist> ActualParameters ExpList

11. Compilation units

<grammatical rules>+= (U->) [<-D->]
module  : exp                                   { compile($1); }

Defines module (links are to index).

<grammatical declarations>+= (U->) [<-D]
%start module

Error recovery

Here are some simple error productions that might help the parser continue. The first four gobble up mangled declarations. The last two are stabs in the dark; I hope they give the parser a chance to recover from errors in statements and expressions.

<grammatical rules>+= (U->) [<-D]
exp             : error                 { $$.type = error_type; }

Defines exp (links are to index).

Putting it all together

Here are the necessary #include files:

<common include files>+= (U-> U->) [<-D]
#include <assert.h>
#include <stdio.h>
#include "types.h"
#include "predef.h"
#include "tree.h"
#include "typecheck.h"
#include "codegen.h"
#include "symbol.h"
#include "declarations.h"
#include "constants.h"
#include "errors.h"

There are no include files used exclusively by the parser. This is because the lexer sees y.tab.h, and so has to know everything the parser knows.

<grammatical include files>= (U->)

This is boilerplate for every lex specification ever written:

<lexer>=
%{
<common include files>
<lexical include files>
<lexical C declarations>
%}
<lexical definitions>
%%
<lexical rules>
%%
<lexical C functions>

And this is boilerplate for every yacc specification ever written:

<parser>=
%{
<common include files>
<grammatical include files>
<grammatical C declarations>
extern int yylex(void);
%}
<grammatical declarations>
%%
<grammatical rules>
%%
#define lint  /* keep lcc from barking about no reference to yaccpar_sccsid */
<grammatical C functions>

Defines lint (links are to index).

Indices

Chunks

<common include files>: D1, D2, U3, U4
<complain about a bad character>: U1, D2
<complain; return bad string>: U1, D2
<grammatical C declarations>: D1, D2, U3
<grammatical C functions>: D1, U2
<grammatical declarations>: D1, D2, D3, D4, D5, D6, D7, D8, D9, U10
<grammatical include files>: D1, U2
<grammatical rules>: D1, D2, D3, D4, D5, D6, U7
<lexer>: D1
<lexical C declarations>: D1, D2, U3
<lexical C functions>: D1, U2
<lexical definitions>: D1, D2, D3, D4, U5
<lexical include files>: D1, U2
<lexical rules>: D1, D2, D3, D4, D5, D6, D7, U8
<list of reserved words>: D1, U2
<not yet grammatical declarations>: D1
<not yet grammatical rules>: D1
<parser>: D1
<reserved word data structures>: D1, U2

Identifiers

ActualParameters: D1, U2
arraydes: U1, D2
exp: U1, U2, D3, D4, D5, D6, U7, D8
ExpList: D1, U2
idcategory: U1, D2, U3
ident: D1, U2, U3, U4
kw: U1, D2
lint: D1
lvalue: U1, D2
module: D1, U2
numreservedwords: D1, U2
print6: D1, U2
rvalue: U1, D2, U3

The yacc value stack

3. Vocabulary and representation

Reserved word search

8. Expressions

8.1 Operands (designators and constants)

Chunks

Identifiers

The `yacc` value stack