I've had several requests for a detailed look at how the Epoch
compiler parser works, so I figured I'd write up a summary of how everything fits together.
All Epoch programs begin their life in the entrypoint function. The compiler's entrypoint is fairly simple, but it contains a lot of extra stuff related to parsing the command line, opening files, and so on; we'll skip those details as they're fairly boring.
The relevant bit is a call to the function Parse, which is passed in a string containing the source code, and the length of that string. So let's look at Parse!
Parse : string code, integer len -> boolean success = false
// Discard the dummy token
string token = PeekToken(0)
while(token != "")
print("Error: code could not be parsed: " ; token ; " " ; PeekToken(1) ; " " ; PeekToken(2))
token = PeekToken(0)
success = true
The first thing we do is call Lex to perform lexical analysis
on the input. I'll ignore lexing for now; what's interesting about this function is its general shape, as seen inside the while loop.
Epoch's current compiler uses a fairly rudimentary recursive descent
parser. Instead of fancier techniques involving parser generators, grammars, and other trappings of academic parser work, it's just a hand-written set of functions that divide and conquer the input until everything is consumed.
The while loop in the Parse function is a pretty standard driver
for such a parser. As you can see, all it does is repeatedly call other functions, and if all of those calls return false, it will barf an error message. Note that error handling is basically non-existent in Epoch's compiler right now, and building out a good diagnostic system would make the code considerably lengthier.
One important thing to note about this loop is that the order in which functions are called is actually important. There are places in the Epoch grammar where very similar sequences of tokens might mean very different things in the code; the order in which we try to match the input token stream is critical to making sure we pick the right variation. In general, more specific forms should precede more general forms.
This function only really shows one major example, which is resolving conflicts between algebraic sum type
definitions and strong alias
definitions. In Epoch, we can write the following code:
type Example : integer | string
type Demo : integer
The first one is an algebraic sum type definition, and the second is a strong alias. The first basically means "anything of Example type can contain either an integer or a string, but not both at the same time." The second means "anything of Demo type acts like an integer but cannot be mixed with anything else that is integer-typed."
The parser needs to check for the more specific
form first, i.e. sum types. If we checked for the more generic
form first, i.e. strong aliases, we would create a problem: we could never correctly parse sum type definitions! This is because the general form would consume
the tokens for the first half of the sum type definition, look at them, see that they are a legal strong alias, and then continue forward. But the next thing it would encounter would be half
of a sum type definition, and it wouldn't know what to do with that.
So now we have some basic structure and ground rules for writing our parser. How does the rest work?
Essentially, it's all more of the same! Recursive descent has the nice property that almost all the code looks very similar, and it's easy to deduce what things are going on once you're used to the patterns. It's also easier to make very intelligent error messages, because you can inject hand-written code for every possible parsing scenario. Unfortunately I don't have any great examples of error diagnostics in the current code, so you'll have to use your imagination.
To see a concrete example, let's look at the code that handles strong type aliases:
ParseStrongAlias : -> boolean matched = false
string aliasname = PeekToken(0)
string basename = PeekToken(2)
OnCodeGenRegisterAlias((++GlobalAliasCounter), PoolString(aliasname), PoolString(basename))
matched = true
This is pretty short and sweet. Note that the default return value of the function is false
, meaning that it didn't find a match for what it was looking for.
The first thing we do is look for the token "type", since all strong aliases begin with that keyword. If that keyword isn't the current token, we are obviously not looking at a valid strong alias, so we can abort parsing and let someone else try and match.
Secondly, we look for the colon which is required to appear after
the name of the type being defined (see above for an example of a strong alias definition). Simple enough.
Once that's done, we throw away a token (which we know has to be the "type" keyword). Then we store off two tokens: one for the type's name, and one for the type's base
. Then we consume three tokens (remember the colon that appears in there!). Lastly, we call some external code that isn't part of the parser. This code is what rigs up the type alias in the program's AST, so that the alias can be used as part of type-checking the program later on in the compilation process.
And that's really about it. The entire parser is just a handful of these functions, that break down the grammatical rules of the language into smaller and smaller nested parts. So let's run through a simple Epoch program and see what parses what! (You can follow along here
if you like.)
We start life in the Parse function, as above. We then repeatedly try
to parse each of the code constructs that are allowed to live at the top level of a file. Think of it as asking a question for each function call.
Is it a global block? No.
Is it a sum type definition? No.
Is it a weak alias definition? No.
Is it a strong alias definition? No.
Is it a structure definition? No.
Is it a function?
Aha! Finally we get somewhere! This looks like a function. So let's walk into the ParseFunction() code and see what happens next:
Does it have parameters? No.
Does it have a return value? No.
Does it have any tags? No.
Does it have a code body?
Yes! So let's parse a code block! Now we're in the ParseCodeBlock() function.
ParseCodeBlock is similar to Parse in that it contains a while loop; it will continue parsing until it finds the } that ends the current code block.
Is it a control entity, such as an if statement or while loop? No.
Is it a preincrement/decrement statement? No.
Is it a postincrement/decrement statement? No.
Is it a regular statement, i.e. a function call? Yes!
And therefore we can skip checking if it is a variable initialization or assignment.
ParseStatement() is where we are now. Ignoring some unrelated complexity for templates and whatnot, this is basically the same pattern yet again: a while loop, parsing until we reach the closing ) that represents the end of the function call.
Is it a valid expression? Yes!
So let's move on to ParseExpression(). This one gets a bit hairy, so let's ignore the details and just focus on the general principle: ParseExpression finds "Hello world!" as a string literal, and knows that this is the parameter to the function we're trying to call.
In this case, the function we're calling is print. So we pair up the function name with its parameters, hand those off to the AST construction logic, and wander back up the call stack.
We return from ParseStatement() and return to ParseCodeBlock(). There's only one statement, so ParseCodeBlock() returns. There's the end of the function, so ParseFunction() returns. There's no more code to look at, so Parse() returns.
And there you have it!