* In this assignment I print intermediate code.
<grammatical C functions>= (U->) void print6 (Exp e) { extern int yylineno; printf("line %d:\n",yylineno); if (e.type==error_type) printf("# ERROR\n"); else printTree(e.tree,stdout); }
Definesprint6
(links are to index).
<grammatical C declarations>= (U->) [D->] void print6(Exp e);
<grammatical declarations>= (U->) [D->] %union { char *string; Type type; Product product; Symbol symbol; Exp exp; Explist explist; int bit; }
%token
declaration;
yacc chooses an integer representation of each such token.
These representations are made available to the lexer via y.tab.h
.
I use the standard trick from Kernighan and Pike (x.tab.h
instead of
y.tab.h
) to avoid remaking the lexer unecessarily.
<lexical include files>= (U->) #include "x.tab.h"
idcategory(id)
returns the category of the reserved word id
when id
is in fact a reserved word.
When id
is not a reserved word, idcategory(id)
returns IDENT
.
<lexical definitions>= (U->) [D->] letter [A-Za-z] digit [0-9]
The offensive <INITIAL>
below has to do with handling comments (q.v.).
Both identifiers and reserved words
are saved in the string table and put on the yacc value stack.
<lexical rules>= (U->) [D->] <INITIAL>{letter}({letter}|{digit})* { yylval.string = strsave(yytext); return idcategory(yytext); }
<lexical C declarations>= (U->) [D->] extern char *strsave(char *s);
<grammatical declarations>+= (U->) [<-D->] %token IDENT %type <string> IDENT
<grammatical rules>= (U->) [D->] ident : IDENT ;
Definesident
(links are to index).
<grammatical declarations>+= (U->) [<-D->] %type <string> ident
<grammatical C declarations>+= (U->) [<-D] extern char yytext[];
ScaleFactor
definitions
so I can tell the difference between long and short reals.
Notice that the scale factor is not optional for the long real.
<lexical rules>+= (U->) [<-D->] <INITIAL>{digit}+|{digit}({hexDigit}*)[Hh] { yylval.string = strsave(yytext); return INTEGER; } <INITIAL>{digit}+"."{digit}*{EScaleFactor}? { yylval.string = strsave(yytext); return REAL; } <INITIAL>{digit}+"."{digit}*{DScaleFactor} { yylval.string = strsave(yytext); return LONGREAL; }
<grammatical declarations>+= (U->) [<-D->] %token <string> INTEGER REAL LONGREAL
I permit lower case where Wirth insists on upper case. This will be convenient later on. Besides, Hanson does it.
I need parentheses around things like EScaleFactor
because I'n not
really defining a regular expression---I'm using a macro facility.
<lexical definitions>+= (U->) [<-D->] hexDigit [0-9A-Fa-f] EScaleFactor ([eE][+\-]?{digit}+) DScaleFactor ([dD][+\-]?{digit}+)
"x"
become character constants,
not strings, thanks to the lex disambiguation rules.
<lexical rules>+= (U->) [<-D->] <INITIAL>"\'"{nonquote}"\'" { yylval.string=strsave(yytext); return CHARCONST; } <INITIAL>{digit}{hexDigit}*[Xx] { yylval.string=strsave(yytext); return CHARCONST; } <INITIAL>"\""{nondoublequote}*"\"" { yylval.string=strsave(yytext); return STRING; }
<grammatical declarations>+= (U->) [<-D->] %token <string> CHARCONST STRING
The character set for string literals is awkward,
because I want to include backslash escapes.
I use the ANSI standard backslash escapes from section A2.5.2 of
the second edition of Kernighan and Ritchie.
Because the lexical analyzer is probably not the right place to
handle illegal backslash escapes, I allow any reasonable
character to follow the backslash.
I define nonoctalx
to be those characters
that can't start an octal or hexadecimal constant (when following
a backslash).
Then I can recognize octal and hexdecimal character constants like "\012"
.
I don't insist that at least one hexDigit
follow \x
,
because again that should be handled downstream of the lexical analyzer.
<lexical definitions>+= (U->) [<-D->] plainnonquote [ \t\]\"!#$%&()*+,\-./0-9:;<=>?@A-Z[^_`a-z{|}~] plainnondoublequote [ \t\]\'!#$%&()*+,\-./0-9:;<=>?@A-Z[^_`a-z{|}~] escapedchar (\\({nonoctalx}|{octal}{octal}?{octal}?|x{hexDigit}*)) nonoctalx [ \]\'\"!#$%&()*+,\-./89:;<=>?@A-Z[\\^_`a-wyz{|}~] nonquote ({plainnonquote}|{escapedchar}) nondoublequote ({plainnondoublequote}|{escapedchar}) octal [0-7]
I also need to handle strings that don't terminate. When I see one, I gobble up the whole line on which it sits---that should make it easier for the parser to recover. (The alternative is trying to tokenize the characters following the open quote.)
<lexical rules>+= (U->) [<-D->] <INITIAL>("\""{nondoublequote}*|"\'"{nonquote}*) <complain; return bad string>
I print the first few characters of a nonterminated string, followed by an ellipsis. I return the string anyway because that way there's a chance that the parser can just ignore the error.
<complain; return bad string>= (<-U) { yyerror("Unterminated string %.8s%s",yytext,yyleng>8?"...":""); yylval.string=strsave(""); return STRING; }
I include prototypes for the string functions, to keep lcc -A from complaining about missing prototypes.
<common include files>= (U-> U->) [D->] #include <string.h>
%token
declarations for all the multicharacter tokens,
including the reserved words.
I use yyBEGIN
because BEGIN
means something special to lex.
<grammatical declarations>+= (U->) [<-D->] %token ARROW INC DEC LSHIFT RSHIFT LEQ GEQ EQ NEQ AND OR /* -> ++ -- << >> <= >= == != && || */
I make sure the lexer always returns strings for identifiers and reserved words.
Recognizing the operators and delimiters is straightforward:
<lexical rules>+= (U->) [<-D->] <INITIAL>"->" return ARROW; <INITIAL>"++" return INC; <INITIAL>"--" return DEC; <INITIAL>"<<" return LSHIFT; <INITIAL>">>" return RSHIFT; <INITIAL>"<=" return LEQ; <INITIAL>">=" return GEQ; <INITIAL>"==" return EQ; <INITIAL>"!=" return NEQ; <INITIAL>"&&" return AND; <INITIAL>"||" return OR; <INITIAL>[\]+!\-*/~&.,;|()[{}^=#<>:] return *yytext;
IDENT
.
The list itself is excruciating to read.
I use a trick I got from Dave Hanson---I put the list in a header
file as calls to the kw
(keyword) macro.
Then I include the header with appropriate macros wherever I need it.
<list of reserved words>= (U->) kw(INT, "int")
A binary search table of reserved words:
<reserved word data structures>= (U->) static struct reserved {char *s; int category;} reservedwords[] = { #define kw(VAL,STRING) {STRING,VAL}, <list of reserved words> #undef kw }; static int numreservedwords = (sizeof(reservedwords)/sizeof(struct reserved));
Defineskw
,numreservedwords
(links are to index).
idcategory
is just binary search.
<lexical C functions>= (U->) <reserved word data structures> static int idcategory (char *id) { int low=0, high=numreservedwords-1, mid; int compare; while (low <= high) { /* Invariant: if id is in the initial range low...high, it is in the current range low...high */ mid = (low+high)/2; /* note low <= mid <= high */ compare = strcmp(id,reservedwords[mid].s); if (compare>0) low = mid + 1; else if (compare<0) high = mid - 1; else return reservedwords[mid].category; } return IDENT; /* id is not a reserved word */ }
Definesidcategory
(links are to index).
<lexical C declarations>+= (U->) [<-D] static int idcategory(char *);
<COMMENT>
, and an
end comment returns it to state <INITIAL>
.
All tokens that aren't comments are recognized only in state <INITIAL>
,
which explains the offensive <INITIAL>
prepended to every rule.
<lexical definitions>+= (U->) [<-D] %S COMMENT
<lexical rules>+= (U->) [<-D->] <INITIAL>"/*" BEGIN COMMENT; <COMMENT>"*/" BEGIN INITIAL; <COMMENT>"\n" ; <COMMENT>. ;
<lexical rules>+= (U->) [<-D] <INITIAL>[ \t\n]+ ; /* ignore whitespace */ <INITIAL>. <complain about a bad character>
The error message we print is different for printable and nonprintable characters.
<complain about a bad character>= (<-U) { if (*yytext >= ' ' && *yytext < 0177) yyerror("bad character `%c'", *yytext); else yyerror("bad character `\\%03o'", *yytext); }
BOOL
.
Following a suggestion of Hanson's, I use
three nonterminals: exp
is an expression (possibly a test);
rvalue
is an rvalue (never a test), and
lvalue
is an lvalue.
I introduce complex_lvalue
because I need to distinguish identifiers from
all other lvalues (otherwise I get a reduce-reduce conflict when converting
lvalues to expressions).
<grammatical declarations>+= (U->) [<-D->] %type <exp> arraydes lvalue complex_lvalue rvalue exp
Here are productions for all the C literals.
I use make_constval(type,string)
to convert a string to a value of
the type desired.
I issue warnings for long reals, since they aren't supported in Oberon/320.
<grammatical rules>+= (U->) [<-D->] exp : INTEGER { $$ = make_constval(integer_type,$1); } | REAL { $$ = make_constval(real_type,$1); } | LONGREAL { warning("Long reals not supported (replaced with zero)"); $$ = make_constval(real_type,strsave("0.0")); } | CHARCONST { $$ = make_constval(char_type,$1); } | STRING { $$ = make_constval(build_array(0,stringchar_type),$1); } ;
Definesexp
(links are to index).
These declarations define precedence.
My task is much simplified because unary and binary -
have
exactly the same precedence.
<grammatical declarations>+= (U->) [<-D->] %left OR %left AND %left '|' %left '^' %left '&' %left EQ NEQ %left '<' '>' LEQ GEQ %left LSHIFT RSHIFT %left '+' '-' %left '*' '/' '%' %right '!' '~' INC DEC /* unary operators */ %left ARROW '.'
binop
checks types and generates intermediate code.
Consult the chapter on typechecking for the description of the
correct operation of binop
and the meanings of
various permissions p_xxx
.
<grammatical rules>+= (U->) [<-D->] complex_lvalue : lvalue '.' ident { $$ = find_field($1,$3); } | exp ARROW ident { $$ = find_field(deref($1),$3) } | '*' rvalue { $$ = deref($1); } | arraydes ']' { $$ = $1; } ; arraydes : rvalue '[' exp { $$ = subscript($1,$3); } ; lvalue : complex_lvalue { $$ = $1; } | ident { $$ = lookup_lvalue($1); } ; exp : complex_lvalue { $$.type=$1.type; $$.tree=tMEM($1.type->size,$1.tree); } | ident { $$ = lookup_exp($1); } ; rvalue : exp { $$ = boolval($1); } ;
Definesarraydes
,exp
,lvalue
,rvalue
(links are to index).
<grammatical rules>+= (U->) [<-D->] exp : exp EQ exp { $$ = binop(OSeq, $1,$3,boolean_type,p_equality); } | exp NEQ exp { $$ = binop(OSneq,$1,$3,boolean_type,p_equality); } | exp '<' exp { $$ = binop(OSlt, $1,$3,boolean_type,p_relational); } | exp LEQ exp { $$ = binop(OSleq,$1,$3,boolean_type,p_relational); } | exp '>' exp { $$ = binop(OSgt, $1,$3,boolean_type,p_relational); } | exp GEQ exp { $$ = binop(OSgeq,$1,$3,boolean_type,p_relational); } ; exp : exp '+' exp { $$ = binop(OSplus, $1,$3,0,p_numeric); } | exp '-' exp { $$ = binop(OSminus,$1,$3,0,p_numeric); } | exp OR exp { $$ = binop(OSor, $1,$3,0,p_boolean); } ; exp : exp '*' exp { $$ = binop(OSmul, $1,$3,0,p_numeric); } | exp AND exp { $$ = binop(OSand,$1,$3,0,p_boolean); } ; exp : '(' exp ')' { $$ = $2; } ;
Definesexp
(links are to index).
()
are
required even if the function has no parameters.
<not yet grammatical rules>= exp : ident ActualParameters { $$ = fcall($1,$2); } ; ActualParameters: '(' ExpList ')' { $$ = $2; } | '(' ')' { $$ = 0; } ; ExpList : rvalue ',' ExpList { $$ = explist($1,$3); } | rvalue { $$ = explist($1,0); } ;
DefinesActualParameters
,exp
,ExpList
(links are to index).
<not yet grammatical declarations>= %type <explist> ActualParameters ExpList
<grammatical rules>+= (U->) [<-D->] module : exp { compile($1); }
Definesmodule
(links are to index).
<grammatical declarations>+= (U->) [<-D] %start module
<grammatical rules>+= (U->) [<-D] exp : error { $$.type = error_type; }
Definesexp
(links are to index).
<common include files>+= (U-> U->) [<-D] #include <assert.h> #include <stdio.h> #include "types.h" #include "predef.h" #include "tree.h" #include "typecheck.h" #include "codegen.h" #include "symbol.h" #include "declarations.h" #include "constants.h" #include "errors.h"
There are no include files used exclusively by the parser. This is because the lexer sees y.tab.h, and so has to know everything the parser knows.
<grammatical include files>= (U->)
This is boilerplate for every lex specification ever written:
<lexer>= %{ <common include files> <lexical include files> <lexical C declarations> %} <lexical definitions> %% <lexical rules> %% <lexical C functions>
And this is boilerplate for every yacc specification ever written:
<parser>= %{ <common include files> <grammatical include files> <grammatical C declarations> extern int yylex(void); %} <grammatical declarations> %% <grammatical rules> %% #define lint /* keep lcc from barking about no reference to yaccpar_sccsid */ <grammatical C functions>
Defineslint
(links are to index).