Design Document

I. Reference Compiler Introduction

The compiler I referenced is from Chapter 17 of the textbook and the PL/0 simple compilation system structure in the experimental website attachments.

PL/0 is a compilation-interpretation execution program. The target code generated by the compilation part is PCODE instructions. Since I didn't decide on the target code early on, when choosing a reference compiler, I mainly referenced the front-end architecture (lexical analysis, syntax analysis, semantic analysis generating intermediate code). PL/0 uses one-pass scanning, with syntax analysis as the core, calling the lexical analyzer to get tokens during syntax analysis while also performing semantic analysis, and finally generating target code. Meanwhile, it performs semantic checking and transfers to error handling procedures when errors are detected.

In lexical analysis, the getsym subroutine is used to skip all spaces to read words and put word symbols into wsym or ssym;

Syntax analysis is the core program. PL/0 calls other subroutines during syntax analysis, with multiple recursive subroutines in syntax analysis;

In error handling, error(n : integer) is used to output error messages;

In symbol table management, it uses the global variable: table : array[0..txmax] of record { name : alfa; case kind: objecttyp of constant : (val:integer ); variable,prosedure: (level,adr: integer ) end; as the symbol table.

II. Overall Compiler Design

The compiler files are organized as follows (only showing C++ source code files .cpp, .h):

.
├── _debug.h
├── ErrorHandler.cpp
├── ErrorHandler.h
├── exceptions
│   ├── FileIOError.h
│   └── ParseEndError.h
├── ICTranslator.cpp
├── ICTranslator.h
...

The compiler is mainly divided into lexical analysis, syntax analysis and building AST, error handling and intermediate code, generating MIPS target code:

int main() {
    if (!input.is_open())
        throw FileIOError("ERROR IN OPENING FILE 'testfile.txt'!");
    if (!normalOutput.is_open())
        throw FileIOError("ERROR IN OPENING FILE 'printAll.txt'");

    // lexical analyzer
    auto *lexer = new Lexer();
    std::vector<Token *> &tokens = lexer->parse();

    // syntax analysis, generate abstract syntax tree
    auto *parser = new Parser(tokens);
    Node *root = parser->parse();

    delete lexer;
    delete parser;

#ifdef STAGE_GRAMMAR_ANALYSIS
    grammarItemOutput(root);
    normalOutput << std::flush;
#endif

    // error handler, generate intermediate code
    auto *errorHandler = new ErrorHandler(root);
    errorHandler->check();
    
    ICTranslator *icTranslator = errorHandler->icTranslator;

#ifdef STAGE_ERROR_HANDLING
    auto it = errorLog.begin();
    while (it != errorLog.end()) {
        errorOutput << it->first << " " << it->second << "\n";
        ++it;
    }
    errorOutput << std::flush;
#endif

#ifdef STAGE_INTERMEDIATE_CODE
    icTranslator->output();
#endif

    // generate MIPS assembly code
    auto *mipsTranslator = new MipsTranslator(icTranslator);
#ifdef STAGE_MIPS
    mipsTranslator->translate();
#endif

    input.close();
    normalOutput.close();
    errorOutput.close();

    delete errorHandler;
    delete root;

    return 0;
}

Designed using an object-oriented approach:

Lexical analysis calls the parse() method of the Lexer object, returning the token string tokens
Syntax analysis calls the parse(tokens) method of the Parser object, taking the token string tokens as input and returning the root node root of the abstract syntax tree
Error handling and intermediate code generation calls the check() method of the ErrorHandler object, after execution the intermediate code is stored in its attribute errorHandler->icTranslator
When generating MIPS assembly code, call the translate() method of the MipsTranslator object, outputting MIPS code during translation

III. Lexical Analysis Design

Pre-coding Design

Lexer.cpp/Lexer.h

Lexical analyzer, reads in source program symbol string, extracts tokens, returns token string
Attributes:
- std::ifstream &input: input file
- std::ofstream &output: output file
- std::string token: currently read symbol string
- bool inMultLineComment: whether in multi-line comment, if so skip all characters read
- Symbol symbol: category code of the most recently read token
Methods:
- parse(): start lexical analysis
- getSym(): read one token
- print(): output current token and category code (according to lexical analysis assignment requirements)
- parseLine: perform lexical analysis on one line, called in parse()

ReservedWord.h

Reserved word enumeration class

Symbol.h

Category code enumeration class, some elements are the same as elements in reserved word enumeration class, and establish mapping relationship through const std::map

Post-coding Modifications

Forgot to clear cached token after reading each token
/, /* and // need further pre-reading to determine tokens
C++ syntax: enum class doesn't have built-in to_string method, needs to use C++ STL std::map<Symbol, std::string> symbol2outputString to output

IV. Syntax Analysis Design

Pre-coding Design

Parser.cpp/Parser.h

Parser, reads in source program symbol string, extracts tokens, returns token string
Attributes:
- std::vector<Token *> &tokens: token string obtained after lexical analysis
- int tokenPos: index of currently parsing token
- const int tokenLength: number of tokens
- Token *curToken: similar to std::string token in lexical analysis, curToken is currently read token
Methods:
- nextItem(): increment tokenPos and read next token into curToken
- parse(): recursive descent entry
- parse_XXX(): recursive descent subroutine for syntax component XXX, like parse_CompUnit(), parse_Decl()
- semicn_before_assign(): determine whether semicolon or equal sign appears first, used to determine whether [Exp] ';' or LVal '=' Exp ';' | LVal '=' 'getint''('')'';' appears first when parsing Stmt

GrammarItem.h

Syntax component enumeration class, also uses std::map<GrammarItem, std::string> to convert enumeration keywords to strings for output

Token.h

Used to store tokens in token string obtained after lexical analysis. This part wasn't added during lexical analysis stage. For convenience of syntax analysis and subsequent error handling, Token stores not only the token string but also its corresponding category code Symbol and line number lineNumber in source program.

Node.cpp/Node.h

Node class in abstract syntax tree
Attributes:
- std::vector<Node *> children: store child nodes of multi-way tree
- Node *parent: store parent node (actually not used)
- int depth: current node depth (actually not used)
- bool isLeaf: whether leaf node (actually can be replaced by token != nullptr)
- GrammarItem grammarItem: syntax component of current node
- Token *token: store corresponding token

Note: grammarItem is used by intermediate nodes (non-terminals) to store syntax components of non-terminals; token is used by tokens (terminals) to store values, categories, line numbers and other information of terminals.

Methods: Similar to basic tree operations in data structures

Since we need to build abstract syntax tree, we need to read in current syntax component during recursive descent. Therefore, when entering each recursive subroutine, first create current node through Node *xxx= new Node(GrammarItem::XXX, depth), and iteratively build syntax tree.

Post-coding Modifications

Problems with building syntax tree for left recursion. For example: for AddExp → MulExp | AddExp ('+' | '−') MulExp, initially directly converting it to EBNF representation AddExp → MulExp {('+' | '−') MulExp} is inconsistent with required output format; also, left recursion is more complex when building syntax tree. Using same example, read current syntax component, when finding there are still syntax components on right side, first package left side (current component) up one layer, then continue repeating judgment. This way can build syntax tree consistent with grammar. As shown below:

/* AddExp → MulExp | AddExp ('+' | '−') MulExp FIXME: left recursion */
/* After rewrite: AddExp → MulExp {('+' | '−') MulExp} */
Node *Parser::parse_AddExp(int depth) {
    Node *current = new Node(GrammarItem::AddExp, depth);
    Node *child = this->parse_MulExp(depth + 1);
    current->addChild(child);
    child->setParent(current);
    while (this->curToken->symbol == Symbol::PLUS
           || this->curToken->symbol == Symbol::MINU) {
        // When finding components on right side, first package left side up one layer, FIXME: depth is wrong here
        Node *temp = new Node(GrammarItem::AddExp, depth);
        temp->addChild(current);
        current->setParent(temp);
        current = temp;
        // ('+' | '−')
        current->addChild(new Node(this->curToken, current, depth + 1));
        this->nextItem();
        // MulExp
        child = this->parse_MulExp(depth + 1);
        current->addChild(child);
    }
    // TODO: recursively modify current's depth
    return current;
}

Under this approach, depth stored in node depth isn't necessarily accurate, but since it wasn't used later, didn't continue modifying.

Problems with token reading function nextItem(): initially throwing ParseEndError exception when reaching end, later found this approach very inconvenient, changed to directly return when reaching end.

V. Error Handling Design

Code architecture started getting messy from this part (×)

Pre-coding Design

Parser.cpp/Parser.h

Some errors need to be resolved during syntax analysis, otherwise syntax tree structure will have problems. For example: missing right parenthesis type errors, need to add an "error node" at original right parenthesis position, then when processing in ErrorHandler, output error message when reading this error node. For example, process of handling missing ):

// ')'
if (this->curToken->symbol == Symbol::RPARENT) {
    funcDef->addChild(new Node(this->curToken, funcDef, depth + 1));
    this->nextItem();
} else {  // FIXME: j => ErrorType::MissingRPARENT )
    int formerLineNum = this->tokens[tokenPos - 1]->lineNumber;
    funcDef->addChild(new ErrorNode(ErrorType::MissingRPARENT, formerLineNum));
}

ErrorHandler.cpp/ErrorHandler.h

Error handler, handles all errors except above errors
Attributes:
- Node *root: root node of syntax tree, i.e. CompUnit
- SymbolTable *currentTable: current symbol table

Methods:

check_XXX(): perform recursive descent analysis again like Parser to find errors in each syntax component

findParamError(): find errors between function formal parameters and actual parameters (two types: number mismatch, type mismatch)

// Check following two errors between function definition definedEntry and function call calledEntry:
// FIXME: d => ErrorType::ParamNumNotMatch
// FIXME: e => ErrorType::ParamTypeNotMatch
// Return value indicates whether there are errors
bool ErrorHandler::findParamError(SymbolTableEntry *definedEntry,
                                  std::vector<SymbolTableEntry *> *calledEntry,
                                  int lineNum) {
    auto size = calledEntry->size();
    if (definedEntry->funcParamsNum() != size) {
        errorLog.insert({lineNum, errorType2string.find(
                ErrorType::ParamNumNotMatch)->second});
        return true;
    }
    std::vector<FuncParam *> *definedFuncParams = definedEntry->getFuncParams();
    for (auto i = 0; i < size; ++i) {
        // hasSameType(SymbolTableEntry *realParam, FuncParam *funcParam)
        bool typeSame = SymbolTableEntry::hasSameType((*calledEntry)[i],
                                                      (*definedFuncParams)[i]);
        if (!typeSame) {
            errorLog.insert({lineNum, errorType2string.find(
                    ErrorType::ParamTypeNotMatch)->second});
            return true;
        }
    }
    return false;
}

check_FormatString, check strings:

bool ErrorHandler::check_FormatString(Node *node, int *formatNum) {
    *formatNum = 0;
    std::string s = node->getToken()->value;
    bool hasIllegalChar = false;
    for (int i = 1; i < s.size() - 1; ++i) {
        // check <FormatChar>
        auto now = s[i];
        auto next = s[i + 1];  // i < length - 1 ensures no out of bounds
        if (now == '%' && (i >= s.size() - 2 || next != 'd')) {
            hasIllegalChar = true;
        } else if (now == '%' && next == 'd') {
            *formatNum = *formatNum + 1;
            ++i;  // start judging from next two
        } else if (now == '\\' && next == 'n') {
            ++i;
        } else if (now == '\\' && next != 'n') {
            hasIllegalChar = true;
        } else {
            int ascii = (int) ((unsigned char) now);
            if (!(ascii == 32 || ascii == 33 || (ascii >= 40 && ascii <= 126))) {
                hasIllegalChar = true;
            }
        }
    }
    return hasIllegalChar;
}

SymbolTable.cpp/SymbolTable.h

Symbol table
Attributes:
- bool isRoot: whether top level symbol table
- SymbolTable *parent: upper level symbol table
- std::vector<SymbolTable *> children: lower level symbol tables
- std::map<std::string, SymbolTableEntry *> name2symbolTableEntry: store all entries in current symbol table
Methods:
- nameExistedInCurrentTable: whether name already defined in current symbol table
- nameExistedInAllTables: whether name already defined globally
- getEntryByNameFromAllTables: find defined variable by name

SymbolTableEntry.cpp/SymbolTableEntry.h

Symbol table
Attributes:
- SymbolTableEntryType type: symbol table entry type, divided into variables, constants, one-dimensional arrays, one-dimensional array constants, two-dimensional arrays...
```
Node *node;
const bool isFuncFParam;  // function formal parameter
unsigned int defLineNum;

Var *var{nullptr};
VarConst *varConst{nullptr};

Array1 *array1{nullptr};
Array1Const *array1Const{nullptr};
Array2 *array2{nullptr};
Array2Const *array2Const{nullptr};

FunctionOfInt *functionOfInt{nullptr};
FunctionOfVoid *functionOfVoid{nullptr};

// Referenced entry
ReferencedEntry *tempEntry{nullptr}; // This type should not be saved in the symbol table!!!
// Its corresponding defined entry
SymbolTableEntry *definedEntry{nullptr};
```
  For convenience, all classes corresponding to various data types have been added as attributes to the symbol table entry class (in the form of pointers). Here, ReferencedEntry *tempEntry is used to represent a referenced variable, rather than an actually defined variable. In this case, its SymbolTableEntry *definedEntry must be assigned, ensuring that this entry can locate the actual definition point.
  
  Additionally, there are corresponding class definitions for each of these data types.

Modifications after Code Completion

Due to extensive use of pointers, numerous Segmentation fault errors occurred. Here is an example of the updated constructor and destructor functions for the two-dimensional array Array2:

class Array2 {
public:
    const int d1;  // -1 indicates function parameters
    const int d2;
    int **values;

    Array2(int d1, int d2) : d1(d1), d2(d2) {
        values = new int *[d1 + 1];
        for (int i = 0; i < d1 + 1; i++) {
            values[i] = new int[d2 + 1]();
        }
    }

    ~Array2() {
        for (int i = 0; i < d1 + 1; i++) {
            delete[] values[i];
        }
        delete[] values;
        values = nullptr;
    }
};

This ensures that null pointer errors are avoided, while also preventing memory leaks.

VI. Code Generation Design

Pre-coding Design

In ICTranslator.cpp/ICTranslator.h:

Intermediate Code Translator
Attributes:
- std::vector<ICEntry *> *mainEntries: Stores intermediate code entries for the main function (global definitions and function definitions are stored separately)
- std::map<int, std::string *> *id2allPureString: Maps IDs to all “pure strings” (those without %d), simplifying MIPS code generation.
- std::map<std::string *, ICItemFunc *> *name2icItemFunc: Maps function names to function definitions
Methods:
- translate_XXX: Translates specific processes, such as translate_ConstVarDef, translate_printf, translate_FuncCall, etc.
- When translating, it is necessary to determine if the current function is main or a user-defined function:
```
if (currentFunc != nullptr) {
   currentFunc->entries->push_back(new ICEntry(xxx));
} else {
   mainEntries->push_back(new ICEntry(xxx));
}
```

In MipsTranslator.cpp/MipsTranslator.h:

MIPS Code Generator

Attributes:

Below are mappings used to locate the memory location or starting address of various data types by ID:

// Main function only: storing local variables
std::map<int, int> localVarId2mem;  // ID is negative

std::map<int, int> tempVarId2mem;  // ID is positive

std::map<int, int> localArrayId2mem;  // ID is positive

std::map<int, int> tempArrayId2mem;  // ID is negative

// For custom functions only: storing local variables, offset relative to current function stack's $sp
std::map<int, int> localVarId2offset;  // ID is negative

std::map<int, int> tempVarId2offset;  // ID is positive

std::map<int, int> localArrayId2offset;  // ID is positive

std::map<int, int> tempArrayId2offset;  // ID is negative

//---------------------

std::map<Reg, bool> regUsage;

std::map<Reg, int> reg2id;

std::map<int, int> funcFVarParamId2offset;

std::map<int, int> funcFArrayParamId2offset;

Generation Process:

Errors are handled directly in the ErrorHandler class, which also generates intermediate code. This results in a high degree of coupling between error handling and code generation: errors are handled while intermediate code, corresponding to various ICxxx data types, such as ICItem, ICItemFunc, ICItemArray, etc., is generated.

Once the intermediate code is generated, the MipsTranslator class is used to produce the target code. The process begins by translating global values (including strings) in the .data segment:

// Global variables and constants
while (mainStream->at(i)->entryType != ICEntryType::MainFuncStart) {
    ICEntry *defEntry = mainStream->at(i);
    //        assert(defEntry->isGlobalVarOrConstDef());
    translate_GlobalVarOrArrayDef(defEntry);
    i++;
}

// Define string segments (pure string portions)
mipsOutput << "\n# string tokens: \n";
for (const auto &item: *id2allPureString) {
    const int id = item.first;
    const std::string *str = item.second;
    mipsOutput << strId2label(id) << ":  .asciiz   \"" << *str << "\"\n";
}

Next, the main function is translated:

// Main function section
assert(mainStream->at(i)->entryType == ICEntryType::MainFuncStart);
i++;
mipsOutput << "\n\n.text 0x00400000\n\n# main function\n";
while (i < mainEntryNum) {
#ifdef MIPS_DEBUG
    mipsOutput << std::flush;
#endif
    ICEntry *entry = mainStream->at(i);
    ICItem *op1 = entry->operator1, *op2 = entry->operator2, *op3 = entry->operator3;
    const int opNum = entry->opNum;
    switch (entry->entryType) {
        case ICEntryType::VarDefine: {  // Local variables
            // ...
        }
        // ...
    }
}

Modifications after Code Completion

Numerous bugs appeared, particularly with arrays in various cases of parameter passing:

An array defined in the main function, pushed onto the stack within the main function
An array defined in the main function, pushed onto the stack within a custom function
An array defined in a custom function, pushed onto the stack within that function
A globally defined array, pushed onto the stack within the main function
A globally defined array, pushed onto the stack within a custom function

In each case, different elements are stored, base addresses vary, and the meaning of offsets relative to $sp differs.

VII. Code Optimization Design

Due to time constraints, only constant folding was implemented. For example:

The lw instruction is replaced with li if the value is a constant:

// Non-LVal
if (var->isGlobal) {
    la(reg, var->toString());
    lw(reg, 0, reg);
    return;
}
if (var->isConst) {
    li(reg, var->value);
    return;
}
if (isFuncFParam(var)) {
    addr = funcFVarParamId2offset.find(var->varId)->second;
    if (whenPushingParamsRecursively) {
        addr += 30000;
    }
    lw(reg, addr, Reg::$sp);
    return;
}
if (inSelfDefinedFunc) {
    if (var->isTemp) {
        addr = tempVarId2offset.find(var->tempVarId)->second;
    } else {
        addr = localVarId2offset.find(var->varId)->second;
    }
    if (whenPushingParamsRecursively) {
        addr += 30000;
    }
    lw(reg, addr, Reg::$sp);
} else {
    if (var->isTemp) {
        addr = tempVarId2mem.find(var->tempVarId)->second;
    } else {
        addr = localVarId2mem.find(var->varId)->second;
    }
    lw(reg, addr, Reg::$zero);
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design Document

I. Reference Compiler Introduction

II. Overall Compiler Design

III. Lexical Analysis Design

Pre-coding Design

Post-coding Modifications

IV. Syntax Analysis Design

Pre-coding Design

Post-coding Modifications

V. Error Handling Design

Pre-coding Design

Modifications after Code Completion

VI. Code Generation Design

Pre-coding Design

Modifications after Code Completion

VII. Code Optimization Design

FilesExpand file tree

System_Design.md

Latest commit

History

System_Design.md

File metadata and controls

Design Document

I. Reference Compiler Introduction

II. Overall Compiler Design

III. Lexical Analysis Design

Pre-coding Design

Post-coding Modifications

IV. Syntax Analysis Design

Pre-coding Design

Post-coding Modifications

V. Error Handling Design

Pre-coding Design

Modifications after Code Completion

VI. Code Generation Design

Pre-coding Design

Modifications after Code Completion

VII. Code Optimization Design