The compiler I referenced is from Chapter 17 of the textbook and the PL/0 simple compilation system structure in the experimental website attachments.
PL/0 is a compilation-interpretation execution program. The target code generated by the compilation part is PCODE instructions. Since I didn't decide on the target code early on, when choosing a reference compiler, I mainly referenced the front-end architecture (lexical analysis, syntax analysis, semantic analysis generating intermediate code). PL/0 uses one-pass scanning, with syntax analysis as the core, calling the lexical analyzer to get tokens during syntax analysis while also performing semantic analysis, and finally generating target code. Meanwhile, it performs semantic checking and transfers to error handling procedures when errors are detected.
In lexical analysis, the getsym subroutine is used to skip all spaces to read words and put word symbols into wsym or ssym;
Syntax analysis is the core program. PL/0 calls other subroutines during syntax analysis, with multiple recursive subroutines in syntax analysis;
In error handling, error(n : integer) is used to output error messages;
In symbol table management, it uses the global variable: table : array[0..txmax] of record { name : alfa; case kind: objecttyp of constant : (val:integer ); variable,prosedure: (level,adr: integer ) end; as the symbol table.
The compiler files are organized as follows (only showing C++ source code files .cpp, .h):
.
├── _debug.h
├── ErrorHandler.cpp
├── ErrorHandler.h
├── exceptions
│ ├── FileIOError.h
│ └── ParseEndError.h
├── ICTranslator.cpp
├── ICTranslator.h
...
The compiler is mainly divided into lexical analysis, syntax analysis and building AST, error handling and intermediate code, generating MIPS target code:
int main() {
if (!input.is_open())
throw FileIOError("ERROR IN OPENING FILE 'testfile.txt'!");
if (!normalOutput.is_open())
throw FileIOError("ERROR IN OPENING FILE 'printAll.txt'");
// lexical analyzer
auto *lexer = new Lexer();
std::vector<Token *> &tokens = lexer->parse();
// syntax analysis, generate abstract syntax tree
auto *parser = new Parser(tokens);
Node *root = parser->parse();
delete lexer;
delete parser;
#ifdef STAGE_GRAMMAR_ANALYSIS
grammarItemOutput(root);
normalOutput << std::flush;
#endif
// error handler, generate intermediate code
auto *errorHandler = new ErrorHandler(root);
errorHandler->check();
ICTranslator *icTranslator = errorHandler->icTranslator;
#ifdef STAGE_ERROR_HANDLING
auto it = errorLog.begin();
while (it != errorLog.end()) {
errorOutput << it->first << " " << it->second << "\n";
++it;
}
errorOutput << std::flush;
#endif
#ifdef STAGE_INTERMEDIATE_CODE
icTranslator->output();
#endif
// generate MIPS assembly code
auto *mipsTranslator = new MipsTranslator(icTranslator);
#ifdef STAGE_MIPS
mipsTranslator->translate();
#endif
input.close();
normalOutput.close();
errorOutput.close();
delete errorHandler;
delete root;
return 0;
}Designed using an object-oriented approach:
- Lexical analysis calls the
parse()method of theLexerobject, returning the token stringtokens - Syntax analysis calls the
parse(tokens)method of theParserobject, taking the token stringtokensas input and returning the root noderootof the abstract syntax tree - Error handling and intermediate code generation calls the
check()method of theErrorHandlerobject, after execution the intermediate code is stored in its attributeerrorHandler->icTranslator - When generating MIPS assembly code, call the
translate()method of theMipsTranslatorobject, outputting MIPS code during translation
Lexer.cpp/Lexer.h
- Lexical analyzer, reads in source program symbol string, extracts tokens, returns token string
- Attributes:
std::ifstream &input: input filestd::ofstream &output: output filestd::string token: currently read symbol stringbool inMultLineComment: whether in multi-line comment, if so skip all characters readSymbol symbol: category code of the most recently read token
- Methods:
parse(): start lexical analysisgetSym(): read one tokenprint(): output current token and category code (according to lexical analysis assignment requirements)parseLine: perform lexical analysis on one line, called inparse()
ReservedWord.h
- Reserved word enumeration class
Symbol.h
- Category code enumeration class, some elements are the same as elements in reserved word enumeration class, and establish mapping relationship through
const std::map
- Forgot to clear cached
tokenafter reading each token /,/*and//need further pre-reading to determine tokens- C++ syntax:
enum classdoesn't have built-into_stringmethod, needs to use C++ STLstd::map<Symbol, std::string> symbol2outputStringto output
Parser.cpp/Parser.h
- Parser, reads in source program symbol string, extracts tokens, returns token string
- Attributes:
std::vector<Token *> &tokens: token string obtained after lexical analysisint tokenPos: index of currently parsing tokenconst int tokenLength: number of tokensToken *curToken: similar tostd::string tokenin lexical analysis,curTokenis currently read token
- Methods:
nextItem(): incrementtokenPosand read next token intocurTokenparse(): recursive descent entryparse_XXX(): recursive descent subroutine for syntax component XXX, likeparse_CompUnit(),parse_Decl()semicn_before_assign(): determine whether semicolon or equal sign appears first, used to determine whether[Exp] ';'orLVal '=' Exp ';' | LVal '=' 'getint''('')'';'appears first when parsingStmt
GrammarItem.h
- Syntax component enumeration class, also uses
std::map<GrammarItem, std::string>to convert enumeration keywords to strings for output
Token.h
- Used to store tokens in token string obtained after lexical analysis. This part wasn't added during lexical analysis stage. For convenience of syntax analysis and subsequent error handling,
Tokenstores not only the token string but also its corresponding category codeSymboland line numberlineNumberin source program.
Node.cpp/Node.h
- Node class in abstract syntax tree
- Attributes:
std::vector<Node *> children: store child nodes of multi-way treeNode *parent: store parent node (actually not used)int depth: current node depth (actually not used)bool isLeaf: whether leaf node (actually can be replaced bytoken != nullptr)GrammarItem grammarItem: syntax component of current nodeToken *token: store corresponding token
Note:
grammarItemis used by intermediate nodes (non-terminals) to store syntax components of non-terminals;tokenis used by tokens (terminals) to store values, categories, line numbers and other information of terminals.
- Methods: Similar to basic tree operations in data structures
Since we need to build abstract syntax tree, we need to read in current syntax component during recursive descent. Therefore, when entering each recursive subroutine, first create current node through Node *xxx= new Node(GrammarItem::XXX, depth), and iteratively build syntax tree.
-
Problems with building syntax tree for left recursion. For example: for
AddExp → MulExp | AddExp ('+' | '−') MulExp, initially directly converting it to EBNF representationAddExp → MulExp {('+' | '−') MulExp}is inconsistent with required output format; also, left recursion is more complex when building syntax tree. Using same example, read current syntax component, when finding there are still syntax components on right side, first package left side (current component) up one layer, then continue repeating judgment. This way can build syntax tree consistent with grammar. As shown below:/* AddExp → MulExp | AddExp ('+' | '−') MulExp FIXME: left recursion */ /* After rewrite: AddExp → MulExp {('+' | '−') MulExp} */ Node *Parser::parse_AddExp(int depth) { Node *current = new Node(GrammarItem::AddExp, depth); Node *child = this->parse_MulExp(depth + 1); current->addChild(child); child->setParent(current); while (this->curToken->symbol == Symbol::PLUS || this->curToken->symbol == Symbol::MINU) { // When finding components on right side, first package left side up one layer, FIXME: depth is wrong here Node *temp = new Node(GrammarItem::AddExp, depth); temp->addChild(current); current->setParent(temp); current = temp; // ('+' | '−') current->addChild(new Node(this->curToken, current, depth + 1)); this->nextItem(); // MulExp child = this->parse_MulExp(depth + 1); current->addChild(child); } // TODO: recursively modify current's depth return current; }
Under this approach, depth stored in node depth isn't necessarily accurate, but since it wasn't used later, didn't continue modifying.
- Problems with token reading function
nextItem(): initially throwingParseEndErrorexception when reaching end, later found this approach very inconvenient, changed to directly return when reaching end.
Code architecture started getting messy from this part (×)
Parser.cpp/Parser.h
Some errors need to be resolved during syntax analysis, otherwise syntax tree structure will have problems. For example: missing right parenthesis type errors, need to add an "error node" at original right parenthesis position, then when processing in ErrorHandler, output error message when reading this error node. For example, process of handling missing ):
// ')'
if (this->curToken->symbol == Symbol::RPARENT) {
funcDef->addChild(new Node(this->curToken, funcDef, depth + 1));
this->nextItem();
} else { // FIXME: j => ErrorType::MissingRPARENT )
int formerLineNum = this->tokens[tokenPos - 1]->lineNumber;
funcDef->addChild(new ErrorNode(ErrorType::MissingRPARENT, formerLineNum));
}ErrorHandler.cpp/ErrorHandler.h
- Error handler, handles all errors except above errors
- Attributes:
Node *root: root node of syntax tree, i.e.CompUnitSymbolTable *currentTable: current symbol table
- Methods:
-
check_XXX(): perform recursive descent analysis again likeParserto find errors in each syntax component -
findParamError(): find errors between function formal parameters and actual parameters (two types: number mismatch, type mismatch)// Check following two errors between function definition definedEntry and function call calledEntry: // FIXME: d => ErrorType::ParamNumNotMatch // FIXME: e => ErrorType::ParamTypeNotMatch // Return value indicates whether there are errors bool ErrorHandler::findParamError(SymbolTableEntry *definedEntry, std::vector<SymbolTableEntry *> *calledEntry, int lineNum) { auto size = calledEntry->size(); if (definedEntry->funcParamsNum() != size) { errorLog.insert({lineNum, errorType2string.find( ErrorType::ParamNumNotMatch)->second}); return true; } std::vector<FuncParam *> *definedFuncParams = definedEntry->getFuncParams(); for (auto i = 0; i < size; ++i) { // hasSameType(SymbolTableEntry *realParam, FuncParam *funcParam) bool typeSame = SymbolTableEntry::hasSameType((*calledEntry)[i], (*definedFuncParams)[i]); if (!typeSame) { errorLog.insert({lineNum, errorType2string.find( ErrorType::ParamTypeNotMatch)->second}); return true; } } return false; }
-
check_FormatString, check strings:bool ErrorHandler::check_FormatString(Node *node, int *formatNum) { *formatNum = 0; std::string s = node->getToken()->value; bool hasIllegalChar = false; for (int i = 1; i < s.size() - 1; ++i) { // check <FormatChar> auto now = s[i]; auto next = s[i + 1]; // i < length - 1 ensures no out of bounds if (now == '%' && (i >= s.size() - 2 || next != 'd')) { hasIllegalChar = true; } else if (now == '%' && next == 'd') { *formatNum = *formatNum + 1; ++i; // start judging from next two } else if (now == '\\' && next == 'n') { ++i; } else if (now == '\\' && next != 'n') { hasIllegalChar = true; } else { int ascii = (int) ((unsigned char) now); if (!(ascii == 32 || ascii == 33 || (ascii >= 40 && ascii <= 126))) { hasIllegalChar = true; } } } return hasIllegalChar; }
-
SymbolTable.cpp/SymbolTable.h
- Symbol table
- Attributes:
bool isRoot: whether top level symbol tableSymbolTable *parent: upper level symbol tablestd::vector<SymbolTable *> children: lower level symbol tablesstd::map<std::string, SymbolTableEntry *> name2symbolTableEntry: store all entries in current symbol table
- Methods:
nameExistedInCurrentTable: whether name already defined in current symbol tablenameExistedInAllTables: whether name already defined globallygetEntryByNameFromAllTables: find defined variable by name
SymbolTableEntry.cpp/SymbolTableEntry.h
-
Symbol table
-
Attributes:
-
SymbolTableEntryType type: symbol table entry type, divided into variables, constants, one-dimensional arrays, one-dimensional array constants, two-dimensional arrays...Node *node; const bool isFuncFParam; // function formal parameter unsigned int defLineNum; Var *var{nullptr}; VarConst *varConst{nullptr}; Array1 *array1{nullptr}; Array1Const *array1Const{nullptr}; Array2 *array2{nullptr}; Array2Const *array2Const{nullptr}; FunctionOfInt *functionOfInt{nullptr}; FunctionOfVoid *functionOfVoid{nullptr}; // Referenced entry ReferencedEntry *tempEntry{nullptr}; // This type should not be saved in the symbol table!!! // Its corresponding defined entry SymbolTableEntry *definedEntry{nullptr};
For convenience, all classes corresponding to various data types have been added as attributes to the symbol table entry class (in the form of pointers). Here,
ReferencedEntry *tempEntryis used to represent a referenced variable, rather than an actually defined variable. In this case, itsSymbolTableEntry *definedEntrymust be assigned, ensuring that this entry can locate the actual definition point.Additionally, there are corresponding class definitions for each of these data types.
-
Due to extensive use of pointers, numerous Segmentation fault errors occurred. Here is an example of the updated constructor and destructor functions for the two-dimensional array Array2:
class Array2 {
public:
const int d1; // -1 indicates function parameters
const int d2;
int **values;
Array2(int d1, int d2) : d1(d1), d2(d2) {
values = new int *[d1 + 1];
for (int i = 0; i < d1 + 1; i++) {
values[i] = new int[d2 + 1]();
}
}
~Array2() {
for (int i = 0; i < d1 + 1; i++) {
delete[] values[i];
}
delete[] values;
values = nullptr;
}
};This ensures that null pointer errors are avoided, while also preventing memory leaks.
In ICTranslator.cpp/ICTranslator.h:
-
Intermediate Code Translator
-
Attributes:
std::vector<ICEntry *> *mainEntries: Stores intermediate code entries for themainfunction (global definitions and function definitions are stored separately)std::map<int, std::string *> *id2allPureString: Maps IDs to all “pure strings” (those without%d), simplifying MIPS code generation.std::map<std::string *, ICItemFunc *> *name2icItemFunc: Maps function names to function definitions
-
Methods:
-
translate_XXX: Translates specific processes, such astranslate_ConstVarDef,translate_printf,translate_FuncCall, etc. -
When translating, it is necessary to determine if the current function is
mainor a user-defined function:if (currentFunc != nullptr) { currentFunc->entries->push_back(new ICEntry(xxx)); } else { mainEntries->push_back(new ICEntry(xxx)); }
-
In MipsTranslator.cpp/MipsTranslator.h:
-
MIPS Code Generator
-
Attributes:
Below are mappings used to locate the memory location or starting address of various data types by ID:
// Main function only: storing local variables std::map<int, int> localVarId2mem; // ID is negative std::map<int, int> tempVarId2mem; // ID is positive std::map<int, int> localArrayId2mem; // ID is positive std::map<int, int> tempArrayId2mem; // ID is negative // For custom functions only: storing local variables, offset relative to current function stack's $sp std::map<int, int> localVarId2offset; // ID is negative std::map<int, int> tempVarId2offset; // ID is positive std::map<int, int> localArrayId2offset; // ID is positive std::map<int, int> tempArrayId2offset; // ID is negative //--------------------- std::map<Reg, bool> regUsage; std::map<Reg, int> reg2id; std::map<int, int> funcFVarParamId2offset; std::map<int, int> funcFArrayParamId2offset;
Generation Process:
Errors are handled directly in the ErrorHandler class, which also generates intermediate code. This results in a high degree of coupling between error handling and code generation: errors are handled while intermediate code, corresponding to various ICxxx data types, such as ICItem, ICItemFunc, ICItemArray, etc., is generated.
Once the intermediate code is generated, the MipsTranslator class is used to produce the target code. The process begins by translating global values (including strings) in the .data segment:
// Global variables and constants
while (mainStream->at(i)->entryType != ICEntryType::MainFuncStart) {
ICEntry *defEntry = mainStream->at(i);
// assert(defEntry->isGlobalVarOrConstDef());
translate_GlobalVarOrArrayDef(defEntry);
i++;
}
// Define string segments (pure string portions)
mipsOutput << "\n# string tokens: \n";
for (const auto &item: *id2allPureString) {
const int id = item.first;
const std::string *str = item.second;
mipsOutput << strId2label(id) << ": .asciiz \"" << *str << "\"\n";
}Next, the main function is translated:
// Main function section
assert(mainStream->at(i)->entryType == ICEntryType::MainFuncStart);
i++;
mipsOutput << "\n\n.text 0x00400000\n\n# main function\n";
while (i < mainEntryNum) {
#ifdef MIPS_DEBUG
mipsOutput << std::flush;
#endif
ICEntry *entry = mainStream->at(i);
ICItem *op1 = entry->operator1, *op2 = entry->operator2, *op3 = entry->operator3;
const int opNum = entry->opNum;
switch (entry->entryType) {
case ICEntryType::VarDefine: { // Local variables
// ...
}
// ...
}
}Numerous bugs appeared, particularly with arrays in various cases of parameter passing:
- An array defined in the main function, pushed onto the stack within the main function
- An array defined in the main function, pushed onto the stack within a custom function
- An array defined in a custom function, pushed onto the stack within that function
- A globally defined array, pushed onto the stack within the main function
- A globally defined array, pushed onto the stack within a custom function
In each case, different elements are stored, base addresses vary, and the meaning of offsets relative to $sp differs.
Due to time constraints, only constant folding was implemented. For example:
The lw instruction is replaced with li if the value is a constant:
// Non-LVal
if (var->isGlobal) {
la(reg, var->toString());
lw(reg, 0, reg);
return;
}
if (var->isConst) {
li(reg, var->value);
return;
}
if (isFuncFParam(var)) {
addr = funcFVarParamId2offset.find(var->varId)->second;
if (whenPushingParamsRecursively) {
addr += 30000;
}
lw(reg, addr, Reg::$sp);
return;
}
if (inSelfDefinedFunc) {
if (var->isTemp) {
addr = tempVarId2offset.find(var->tempVarId)->second;
} else {
addr = localVarId2offset.find(var->varId)->second;
}
if (whenPushingParamsRecursively) {
addr += 30000;
}
lw(reg, addr, Reg::$sp);
} else {
if (var->isTemp) {
addr = tempVarId2mem.find(var->tempVarId)->second;
} else {
addr = localVarId2mem.find(var->varId)->second;
}
lw(reg, addr, Reg::$zero);
}