NCU CSIE: Compiler

SEMESTER PROJECT

The compiler design course is closely associated with a project. Three checkpoints are set up to help you make progress. At the final demonstration, you are required to show off a complete compiler. We will provide you with a grading sheet that details the grading policy. We urge to you to make continual progress in your project since this is one of the best ways to learn the course thoroughly.

First Checkpoint: The scanner. Due date:Apr. 12, beginning of class

At the first checkpoint, you will need to finish the scanner, using the scangen tool. You need to demonstrate your scanner with all the reserved words and symbols plus 10 correct names and comments. In addition, you need to use 10 incorrect symbols to demonstrate that your scanner can handle the incorrect names properly (e.g., issue some error messages). The output of your scanner should list each token in the input, together with the line number and the position of the first character of the token. In case the token is an identifier or a number, the string representing the token must also be printed. You do NOT need to print a copy of the output nor the program in order to save some paper. We will use on-line demonstration in the first checkpoint.
There are several test cases with which you may try your scanner.

You may use scangen to implement the scanner. There is a file for token definitions. If you choose to use lex or flex, you will have to modify the token definitions. A flex manual is also included in this web page.
The following issues should be addressed:

How does the scanner handle the string 10..20 ?

How does the scanner handle the string "abc""def" ?

How does the scanner handle run-away strings (or cross-line strings) and run-away comments?

How does the scanner handle the string 10.3E+5, 10.3E+BC, 10.3E+2.34, and 10.3EBC ? An acceptable solution to 10.3EBC could be two tokens: a real number 10.3 and an identifier EBC. But you may make your own decision.

How does the scanner handle a very long identifier? How long can an identifier be? What if the maximum is exceeded?

How does the scanner handle a very long string constant? How long can a string be? What if the maximum is exceeded?

How does the scanner handle reserved words? How do you use a table for this purpose?

How does the scanner handle illegal characters, such as !, [ and ], which do not appear in Ada/CS?

How does the scanner handle the string '''' ?

How does the scanner handle end-of-file?

How does the scanner recover from errors? What characters are deleted when errors occur?

A lex example file.

Second Checkpoint: The parser. Due date: May 25-31.

At the second checkpoint, you will need to finish the parser. You need to demonstrate your scanner with the test programs we will provide. As in the first checkpoint, you do NOT need to print a copy of the output nor the program. We will use on-line demonstration.
You may use llgen to implement the parser. There is a file for syntax definitions. If you choose to use yacc or bison, you will have to modify the syntax definitions. A bison manual is also included in this web page.
For this checkpoint, your parser should answer accept for correct test file and reject for incorrect test files. For incorrect input, the positions of the errors should be indicated. The parser should try to parse to the end of the input and catch as many errors as possible.
You need to address the following issues:

Does the parser issue good error messages, such as line numbers and character positions of errors and explanations of errors?

How does the parser recover from errors?

Are there any optimizations to the parse table, size reduction or speed improvement?

Third Checkpoint: The symbol table. Due date: June 14-15, beginning of class

At the third checkpoint, you will need to finish the symbol table and related routines. You need to show that your symbol table can correctly handle block entry and exit, procedure names, and parameters. As usual, do not print a copy of the program or the output. On-line demonstration is enough. You need to address the following issues:

Explain how the symbol table is organized, a hashed table, binary search list, or a linear list, etc.
What does the compiler do when it encounters an undeclared variable, an undeclared type name, an undeclared function or procedure, an undeclared enumeration literal, an undeclared constant, etc.? The compiler should try to avoid spurious error messages.
Do you use action-controlled semantic stack or parser-controlled semantic stack?
You need to add action routines to print a message whenever a new scope is created or closed.
You need to add action routines to print a message whenever a global simple variable (that is, variables of type integer or real) is declared.
You need to add action routines to print a message whenever a local simple variable (that is, variables of type integer or real) is declared in a function.
You need to add action routines to print a message whenever a loop index variable is implicitly declared.

Final Checkpoint: A complete compiler. June 14-15, beginning of class

At the last checkpoint, you will have the chance to show off your complete compiler. We will schedule an interview with each students and test the features that you have implemented. You need to bring your grade sheet to the interview. Please do not print your program or output. We will give you a grade for the project during the interview. In the project, you will need to implement a minimum set of features, which will be announced later in the class. If you implement extra features, you will earn extra points.
Here is a list of semantic checks:

All identifiers are declared.

No identifiers are multiply-defined.

Types in expressions are correct with respect to the type rules.

Imported names are handled properly. Name conflicts are resolved correctly.

Inheritance relation among classes are properly formed in OO languages.

Methods in a class are not multiply defined.

All declared methods are implemented.

Abstract methods are not implemented in abstract classes.

Public/protected/private rules are observed.

The target machine may be the pseudo machine defined in the appendix of the textbook. There is a simulator called testit in the distribution directory. You may also generate code for the Java virtual machine, a manual of which is included in the next section, and use any web browser to run it.