• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
  • entries
    625
  • comments
    1446
  • views
    1006392

Breaking down the Epoch parser

Sign in to follow this  
Followers 0
ApochPiQ

793 views

I've had several requests for a detailed look at how the Epoch compiler parser works, so I figured I'd write up a summary of how everything fits together.

All Epoch programs begin their life in the entrypoint function. The compiler's entrypoint is fairly simple, but it contains a lot of extra stuff related to parsing the command line, opening files, and so on; we'll skip those details as they're fairly boring.

The relevant bit is a call to the function Parse, which is passed in a string containing the source code, and the length of that string. So let's look at Parse!
Parse : string code, integer len -> boolean success = false{ Lex(code, len) // Discard the dummy token PopToken() IREnterProgram() string token = PeekToken(0) while(token != "") { if(ParseGlobalBlock()) { } elseif(ParseSumType()) { } elseif(ParseWeakAlias()) { } elseif(ParseStrongAlias()) { } elseif(ParseStructure()) { } elseif(!ParseFunction()) { print("Error: code could not be parsed: " ; token ; " " ; PeekToken(1) ; " " ; PeekToken(2)) return() } token = PeekToken(0) } IRExitProgram() success = true}
The first thing we do is call Lex to perform lexical analysis on the input. I'll ignore lexing for now; what's interesting about this function is its general shape, as seen inside the while loop.

Epoch's current compiler uses a fairly rudimentary recursive descent parser. Instead of fancier techniques involving parser generators, grammars, and other trappings of academic parser work, it's just a hand-written set of functions that divide and conquer the input until everything is consumed.

The while loop in the Parse function is a pretty standard driver for such a parser. As you can see, all it does is repeatedly call other functions, and if all of those calls return false, it will barf an error message. Note that error handling is basically non-existent in Epoch's compiler right now, and building out a good diagnostic system would make the code considerably lengthier.

One important thing to note about this loop is that the order in which functions are called is actually important. There are places in the Epoch grammar where very similar sequences of tokens might mean very different things in the code; the order in which we try to match the input token stream is critical to making sure we pick the right variation. In general, more specific forms should precede more general forms.

This function only really shows one major example, which is resolving conflicts between algebraic sum type definitions and strong alias definitions. In Epoch, we can write the following code:
type Example : integer | stringtype Demo : integer
The first one is an algebraic sum type definition, and the second is a strong alias. The first basically means "anything of Example type can contain either an integer or a string, but not both at the same time." The second means "anything of Demo type acts like an integer but cannot be mixed with anything else that is integer-typed."

The parser needs to check for the more specific form first, i.e. sum types. If we checked for the more generic form first, i.e. strong aliases, we would create a problem: we could never correctly parse sum type definitions! This is because the general form would consume the tokens for the first half of the sum type definition, look at them, see that they are a legal strong alias, and then continue forward. But the next thing it would encounter would be half of a sum type definition, and it wouldn't know what to do with that.


So now we have some basic structure and ground rules for writing our parser. How does the rest work?

Essentially, it's all more of the same! Recursive descent has the nice property that almost all the code looks very similar, and it's easy to deduce what things are going on once you're used to the patterns. It's also easier to make very intelligent error messages, because you can inject hand-written code for every possible parsing scenario. Unfortunately I don't have any great examples of error diagnostics in the current code, so you'll have to use your imagination.

To see a concrete example, let's look at the code that handles strong type aliases:
ParseStrongAlias : -> boolean matched = false{ if(!PeekWithExpectation(0, "type")) { return() } if(!PeekWithExpectation(2, ":")) { return() } PopToken() string aliasname = PeekToken(0) string basename = PeekToken(2) PopTokens(3) OnCodeGenRegisterAlias((++GlobalAliasCounter), PoolString(aliasname), PoolString(basename)) matched = true}
This is pretty short and sweet. Note that the default return value of the function is false, meaning that it didn't find a match for what it was looking for.

The first thing we do is look for the token "type", since all strong aliases begin with that keyword. If that keyword isn't the current token, we are obviously not looking at a valid strong alias, so we can abort parsing and let someone else try and match.

Secondly, we look for the colon which is required to appear after the name of the type being defined (see above for an example of a strong alias definition). Simple enough.

Once that's done, we throw away a token (which we know has to be the "type" keyword). Then we store off two tokens: one for the type's name, and one for the type's base. Then we consume three tokens (remember the colon that appears in there!). Lastly, we call some external code that isn't part of the parser. This code is what rigs up the type alias in the program's AST, so that the alias can be used as part of type-checking the program later on in the compilation process.


And that's really about it. The entire parser is just a handful of these functions, that break down the grammatical rules of the language into smaller and smaller nested parts. So let's run through a simple Epoch program and see what parses what! (You can follow along here if you like.)

entrypoint :{ print("Hello world!")}
We start life in the Parse function, as above. We then repeatedly try to parse each of the code constructs that are allowed to live at the top level of a file. Think of it as asking a question for each function call.

Is it a global block? No.
Is it a sum type definition? No.
Is it a weak alias definition? No.
Is it a strong alias definition? No.
Is it a structure definition? No.
Is it a function?

Aha! Finally we get somewhere! This looks like a function. So let's walk into the ParseFunction() code and see what happens next:

Does it have parameters? No.
Does it have a return value? No.
Does it have any tags? No.
Does it have a code body?

Yes! So let's parse a code block! Now we're in the ParseCodeBlock() function.

ParseCodeBlock is similar to Parse in that it contains a while loop; it will continue parsing until it finds the } that ends the current code block.

Is it a control entity, such as an if statement or while loop? No.
Is it a preincrement/decrement statement? No.
Is it a postincrement/decrement statement? No.
Is it a regular statement, i.e. a function call? Yes!
And therefore we can skip checking if it is a variable initialization or assignment.

ParseStatement() is where we are now. Ignoring some unrelated complexity for templates and whatnot, this is basically the same pattern yet again: a while loop, parsing until we reach the closing ) that represents the end of the function call.

Is it a valid expression? Yes!

So let's move on to ParseExpression(). This one gets a bit hairy, so let's ignore the details and just focus on the general principle: ParseExpression finds "Hello world!" as a string literal, and knows that this is the parameter to the function we're trying to call.

In this case, the function we're calling is print. So we pair up the function name with its parameters, hand those off to the AST construction logic, and wander back up the call stack.

We return from ParseStatement() and return to ParseCodeBlock(). There's only one statement, so ParseCodeBlock() returns. There's the end of the function, so ParseFunction() returns. There's no more code to look at, so Parse() returns.


And there you have it!

8
Sign in to follow this  
Followers 0


1 Comment


Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now