Jump to content

  • Log In with Google      Sign In   
  • Create Account

The Bag of Holding

Self-hosting the Epoch Compiler: Day Four

Posted by , 13 December 2013 - - - - - - · 643 views
Epoch, language design
Mostly hammering away on a variety of miscompiles and other tiny bugs tonight; I've squished a ton of problems but they all blur together and I can't clearly remember what all they were.

Almost everything at this stage manifests as the JIT engine vomiting when trying to turn the compiled Epoch program into LLVM bitcode (for eventual translation to machine code). Sadly, this means it takes a disproportionate amount of digging around in various bits of code to pin down the exact reason for a miscompile.

Something that's scaring me a lot is that many of the bugs I find in the compiled program are not reproducible in smaller test cases. That tends to suggest a nasty problem in the compiler itself, since it should theoretically always be possible to recreate a miscompile in a controlled experiment.

Without small test cases to validate that compiler bugs are really fixed, it's extremely hard to know if I'm actually making progress or just pushing the bugs around into different areas. Since it takes a couple of minutes to produce a compiler binary, there's a lot of downtime and context switching.

My brain is pretty tired today and I'm honestly not sure how much dedication I have tonight.

Some of the bugs are just so bizarre that I'm deciding to punt on them when possible; there'll be plenty of time in the future to hunt them down and hopefully get regression tests into the test suite. I'm pretty much only fixing things that reflect large-scale issues or would affect too much code to work around them.

Unfortunately, it looks like one of my attempted fixes has actually had me going in circles, because it introduced a different bug that broke in a totally different way. Trying to fix the new bug led to undoing the fix, which in turn created the old bug again. Argh!

[Ed. Note: I fell asleep shortly after this.]

Self-hosting the Epoch Compiler: Day Three

Posted by , 12 December 2013 - - - - - - · 687 views
Epoch, language design
Sat down and figured out some of the annoying bugs that were left in the compiler, mostly surrounding higher order functions and templates. Rigged up a half-dozen more intrinsics and such, and fired off the compiler for yet another pass on itself.

At 8:02PM PST the compiler successfully completed its first semantic analysis on itself, and began attempting code generation. However, there are some bugs in code generation that cause assertion failures for the moment. That will have to be dealt with.

I pinned down the assertion failure to some kind of expression type inference bug in templated higher-order functions. Since I only use that idiom once in the compiler, I decided to punt. I tweaked the code to not use a templated function, and made a note to come back and fix the bug some other day.

The next hiccup was a mistake in the way parenthetical expressions were handled by the code generator; easy enough to fix.

At 8:22PM PST the compiler successfully emitted an executable binary of... something. The binary itself appears mangled and disappointingly does not actually run. But still, the compiler *did* do a complete pass on itself and emit something vaguely resembling a runnable .EXE, so that's pretty cool.

Turns out I cut some corners in the .exe generation code, and it doesn't accurately represent the size of code emitted onto disk. Since the compiler is an order of magnitude (or two!) larger than any other program that it has generated to date, the numbers are wonky and cause only a subset of the binary to be loaded by Windows, which then (correctly) complains that the file is borked.

This is fairly straight-forward to fix, so I went ahead and hacked in some stuff to try and emit a more sane binary.

At 8:33PM PST the compiler emitted its second attempt at a runnable binary. The .EXE runs and starts up, but then immediately crashes because the JIT layer finds some quirks in the emitted bytecode. The first quirk that shows up is that "nothing" typed parameters are being added to lexical scopes as if they had variables associated with them - which of course is patently silly since they have the type of "nothing." Another easy-enough fix.

8:37PM PST - attempt number three. This still crashes in the JITter, but at least it crashes later on.

It quickly becomes apparent that I'm going to need to come up with a way to view the bytecode generated by the compiler between when it is emitted and when the JIT gets ahold of it.

Once I have that sorted out, the problem looks pretty simple: global variables are stomping on each other and the JIT is getting confused. This is a code-gen bug so I rig up a test case and start trying to pin down a fix.

After fixing globals, I found another crash bug, this time to do with parenthetical expressions in code generation. Ironically, it turns out to be on my "to do" list of bugs to fix already, so now's as good a time as any!

9:54PM PST - I've lost track of what attempt this is, but we're getting closer - slowly but surely. Next bug on the list relates to "nothing" parameters again - this time because they're throwing off stack slot indexing in the bytecode layer. Not hard to fix, but annoying. Either way, though, we're plodding ever closer to the goal...

10:35PM PST - That's about enough for one night. We're still generating several bytecode miscompiles, particularly around anonymous temporary structures. Sadly, my brain is too mushy to figure out the exact problem at this point, so it'll have to wait until Day Four!

Self-hosting the Epoch Compiler: Day Two

Posted by , 11 December 2013 - - - - - - · 745 views
Epoch, language design
A large number of the errors emitted by attempting to self-host the compiler have turned out to be caused by a relatively small number of bugs.

Hex literals had no support at all in the compiler, so I added that, and crushed a bunch of errors. I forgot to special-case 0 so anything that evaluated to 0 would not be treated as a number (the compiler assumed that it was a string since it couldn't be "cast" into a non-zero number). That was easy enough to fix.

The next big bug was a pretty painful face-palm moment. The type checker would permit certain combinations of types when algebraic sum types are involved (correctly) but it wouldn't consider any particular sum type to be compatible with itself. Derp. Another easy fix.

Running the compiler is a major chore. Parsing alone is 40 seconds or so, and semantic analysis takes many minutes to run each time I want to attempt doing a self-hosting pass. Progress is slowed primarily by this fact, and also by my incessant distractability that leads me to do other things while waiting on the compiler... and then I forget where I was... and so on. It's exactly the annoyance of long compile times that made me hate C++ in the first place.

I still suspect there's a bug in templated sum types, but it's hard to pin down. I may need to write some new test cases to figure out exactly what causes the type checker to barf on template instances.

Just for the hell of it, I attached a profiler to the running compiler. Turns out that garbage collection is still dominating execution time, by a huge margin. I suspect that each invocation of the garbage collector is taking many seconds to complete, and the semantic analyzer generates a lot of garbage, so it stands to reason that eventually the compilation process would become painfully slow. I remain hopeful that I can get the compiler significantly faster just by tuning the garbage collector implementation.

I borked the case where hex literals could be greater than INT_MAX (i.e. if they're unsigned), so that took another re-run to get fixed. Thankfully it's all easy stuff to tweak at this point.

I feel like an idiot at the moment; a couple of times I've "mysteriously" lost work from the Era IDE. Originally I suspected some deep bug, but it turns out that I'm just stupid. I was trying to save a file that was open in another process, and the Era implementation doesn't message failed saves at all yet, so I blindly assumed that the save was OK and closed the IDE, clobbering all my work. This should be easy enough to avoid though, as long as I pay attention.

Or maybe I could just fix the damn IDE bug.

Getting impatient with the slow compiles, so I'm building a version of the runtime that totally disables the garbage collector, just to see how much difference it makes...

... aaaaand OUCH.

Without GC, the compiler parses in 2.8 seconds (versus 40!) and then goes on to do more semantic validation in a couple of seconds than it used to do in several minutes.

This is revealing, and important. I'll have to do more work on the GC soon.

For the meantime, though, it looks like there are still some miscompiles to deal with. More interestingly, there's a crash in the compiler when it gets far enough along; something doesn't like what it's being fed. The crash also occurs with GC enabled (it just takes 10 minutes to get there), so I'm confident that it's not a red herring. It'll take time to pin down though, because the sheer size of the compiler source makes it hard to accurately determine where things are blowing up.

I had a couple leads on possible explanations for the crash, but they all turned out to be dead ends. So I'm pretty stumped, although prior error messages might be informative (there are some type checker failures that are bizarre).

And naturally it turns out that fixing the type checker causes the crash to stop. I suppose I should be happy.

Now I'm getting an assertion failure because the type "0" doesn't exist. Well, duh. But what's asking for something of type 0?

Setting that aside for a moment, I wanted to start at the top of the compilation error list and start looking for other bugs to fix. The first one was interesting: it's not actually a bug in the Epoch compiler. It's a bug in the C++ version of the compiler, which happens to break type safety, but it silently works by sheer accident!

So it seems that rewriting the compiler is already a win, because it caught some type checking bugs that I probably never would have found otherwise.

After a handful of other little fixes and improvements, I'm seeing a dramatically smaller set of error spam, which is encouraging. For now, though, I think it's time to call it a night.

Self-hosting the Epoch Compiler: Day One

Posted by , 10 December 2013 - - - - - - · 648 views
Epoch, language design
As I've written about here previously, I have a personal goal of self-hosting the Epoch language compiler by the end of 2013. The other night I actually ran the first attempt at passing the compiler source code through itself; the results were underwhelming, to say the least.

My main enemy out of the gate was the garbage collector. I've had a very naive heap traversal algorithm implemented since ages ago, but it turns out that O(n!) algorithms fall down hard when presented with values of n in the millions. Who knew?

Once I fixed up the garbage collector implementation, parsing went from "interminable" to about 40 seconds per invocation of the compiler. This is depressing because the C++ implementation can do a complete parse in about 40 milliseconds. But optimization can come later; for now I want to focus on getting the compiler to just build cleanly.

The next problem I ran into was stack space. Epoch programs like to use recursion a lot, and because of some awkward implementation artifacts, tail call elimination doesn't always work nicely (actually it almost never works). So compiling a 310+ KB source file when all you have to traverse data structures is recursion... yeah, that was painful.

Thankfully it's easy enough to increase the stack space for a process, so I did that, and fired up the compiler yet again. This time it barfed someplace deep in the code, because I forgot to implement support for some operator or other; after a few rounds of this kind of nonsense, I managed to get it to actually start doing heavy-weight semantic analysis of the code. Progress!

Semantic analysis has always been one of the slowest parts of compilation in Epoch, and it wasn't surprising to see the compiler chug for a couple of solid minutes trying to analyze itself. Unfortunately, after a little while, it crapped out with no stack space again.

So I bumped the stack space to 8MB, in hopes that such a gratuitous amount of space would be enough to get things to complete cleanly.

Meanwhile, the compiler is actually emitting a ton of errors - mostly about type checking failures. It seems that there are some rules about the Epoch type system that aren't fully implemented; I'll have to go back and write some more test cases to pin that down.

And of course there's a handful of barfs because of things like hex literals which for some reason I never finished implementing support for... or built-in/standard-library functions that aren't wired into the compiler yet... and miscellaneous junk like that.

The compiler is steadily getting further along in its attempts to analyze itself, but once again crashed out due to a missing operator implementation. I'm trying to fix the little stuff as I go, not just because I'm impatient to see a complete compile, but also because it saves me having to write a massive to-do list of junk to clean up later.

One thing is for sure: this compiler is devilishly slow. It'll take some major work to get it fast, I suspect, but I've been down that road before with the C++ compiler and the results were pretty encouraging, so I think I can do it again.

Most of the errors coming out now are related to type checking failures in template instantiations. I'm willing to bet there's a bug or three in the template implementation, since that was one of the last things to be added and has had the least time to bake. It's also easily the most intricate part of the compiler.

So plenty of work left to do, but I'm confident that a few solid hours of plugging away at each category of bugs will lead to a successful self-host within the next few weeks.

Two steps forward, one step sideways

Posted by , 09 December 2013 - - - - - - · 630 views
Epoch, language design
Over the weekend, I finished the last of the Epoch compiler support for templates. This means that, in theory, the Epoch-implemented compiler is capable of passing every test in the compiler test suite that I use for the C++ version of the compiler.

Unfortunately, I introduced two regressions along the way, which will require some tweaking to get fixed. No more than a day or so of work, though.

Given that the majority of the compiler is done, I went ahead and tried self-hosting last night, just for shiggles. The results were very informative.
  • Garbage collection is happening too frequently and creates ridiculous stalls when collections occur during parsing and semantic analysis.
  • Disabling the garbage collector causes the parser to chew up a huge amount of memory, but it makes parsing actually fast enough to bother with on large projects (like the compiler itself).
  • There are still a lot of edge cases that aren't handled by the compiler - mostly bits of syntax that are strictly legal in Epoch but haven't been totally shored up yet.
  • There are also a lot of built-in functions and operators that aren't recognized by the compiler yet; this is easy to fix, though.
My biggest conclusion is that the compiler will be driving a very heavy reimagining of how memory management works in Epoch. I already knew I wanted to fix some of the annoyances with how it works, such as heap-allocating all aggregate types all the time, but this is going to take more than just refinements of the existing systems.

It also brings me to a major decision point that I have been avoiding deliberately for a long time: how to handle manual memory management strategies in a type-safe way. The fundamental difficulty is that it's currently impossible to ask the language for a chunk of memory that isn't directly bound to a type - I can't even make an array of a type and do some hacky kind of object pooling.

Fixing this will take a significant amount of design and planning, to ensure I don't wind up crippling the type system or some such.

There's also going to be a lot of optimization necessary in the compiler implementation itself, and that's going to require richer language features and standard containers and all that jazz.

So while I might actually get self-hosting done by the end of the year (which was my original goal), I've unearthed a huge backlog of new stuff to work on in 2014 :-)

Breaking down the Epoch parser

Posted by , 02 December 2013 - - - - - - · 691 views
Epoch, language design
I've had several requests for a detailed look at how the Epoch compiler parser works, so I figured I'd write up a summary of how everything fits together.

All Epoch programs begin their life in the entrypoint function. The compiler's entrypoint is fairly simple, but it contains a lot of extra stuff related to parsing the command line, opening files, and so on; we'll skip those details as they're fairly boring.

The relevant bit is a call to the function Parse, which is passed in a string containing the source code, and the length of that string. So let's look at Parse!
Parse : string code, integer len -> boolean success = false
	Lex(code, len)

	// Discard the dummy token


	string token = PeekToken(0)
	while(token != "")
			print("Error: code could not be parsed: " ; token ; " " ; PeekToken(1) ; " " ; PeekToken(2))

		token = PeekToken(0)

	success = true
The first thing we do is call Lex to perform lexical analysis on the input. I'll ignore lexing for now; what's interesting about this function is its general shape, as seen inside the while loop.

Epoch's current compiler uses a fairly rudimentary recursive descent parser. Instead of fancier techniques involving parser generators, grammars, and other trappings of academic parser work, it's just a hand-written set of functions that divide and conquer the input until everything is consumed.

The while loop in the Parse function is a pretty standard driver for such a parser. As you can see, all it does is repeatedly call other functions, and if all of those calls return false, it will barf an error message. Note that error handling is basically non-existent in Epoch's compiler right now, and building out a good diagnostic system would make the code considerably lengthier.

One important thing to note about this loop is that the order in which functions are called is actually important. There are places in the Epoch grammar where very similar sequences of tokens might mean very different things in the code; the order in which we try to match the input token stream is critical to making sure we pick the right variation. In general, more specific forms should precede more general forms.

This function only really shows one major example, which is resolving conflicts between algebraic sum type definitions and strong alias definitions. In Epoch, we can write the following code:
type Example : integer | string
type Demo : integer
The first one is an algebraic sum type definition, and the second is a strong alias. The first basically means "anything of Example type can contain either an integer or a string, but not both at the same time." The second means "anything of Demo type acts like an integer but cannot be mixed with anything else that is integer-typed."

The parser needs to check for the more specific form first, i.e. sum types. If we checked for the more generic form first, i.e. strong aliases, we would create a problem: we could never correctly parse sum type definitions! This is because the general form would consume the tokens for the first half of the sum type definition, look at them, see that they are a legal strong alias, and then continue forward. But the next thing it would encounter would be half of a sum type definition, and it wouldn't know what to do with that.

So now we have some basic structure and ground rules for writing our parser. How does the rest work?

Essentially, it's all more of the same! Recursive descent has the nice property that almost all the code looks very similar, and it's easy to deduce what things are going on once you're used to the patterns. It's also easier to make very intelligent error messages, because you can inject hand-written code for every possible parsing scenario. Unfortunately I don't have any great examples of error diagnostics in the current code, so you'll have to use your imagination.

To see a concrete example, let's look at the code that handles strong type aliases:
ParseStrongAlias : -> boolean matched = false
	if(!PeekWithExpectation(0, "type"))

	if(!PeekWithExpectation(2, ":"))
	string aliasname = PeekToken(0)
	string basename = PeekToken(2)
	OnCodeGenRegisterAlias((++GlobalAliasCounter), PoolString(aliasname), PoolString(basename))
	matched = true
This is pretty short and sweet. Note that the default return value of the function is false, meaning that it didn't find a match for what it was looking for.

The first thing we do is look for the token "type", since all strong aliases begin with that keyword. If that keyword isn't the current token, we are obviously not looking at a valid strong alias, so we can abort parsing and let someone else try and match.

Secondly, we look for the colon which is required to appear after the name of the type being defined (see above for an example of a strong alias definition). Simple enough.

Once that's done, we throw away a token (which we know has to be the "type" keyword). Then we store off two tokens: one for the type's name, and one for the type's base. Then we consume three tokens (remember the colon that appears in there!). Lastly, we call some external code that isn't part of the parser. This code is what rigs up the type alias in the program's AST, so that the alias can be used as part of type-checking the program later on in the compilation process.

And that's really about it. The entire parser is just a handful of these functions, that break down the grammatical rules of the language into smaller and smaller nested parts. So let's run through a simple Epoch program and see what parses what! (You can follow along here if you like.)

entrypoint :
    print("Hello world!")
We start life in the Parse function, as above. We then repeatedly try to parse each of the code constructs that are allowed to live at the top level of a file. Think of it as asking a question for each function call.

Is it a global block? No.
Is it a sum type definition? No.
Is it a weak alias definition? No.
Is it a strong alias definition? No.
Is it a structure definition? No.
Is it a function?

Aha! Finally we get somewhere! This looks like a function. So let's walk into the ParseFunction() code and see what happens next:

Does it have parameters? No.
Does it have a return value? No.
Does it have any tags? No.
Does it have a code body?

Yes! So let's parse a code block! Now we're in the ParseCodeBlock() function.

ParseCodeBlock is similar to Parse in that it contains a while loop; it will continue parsing until it finds the } that ends the current code block.

Is it a control entity, such as an if statement or while loop? No.
Is it a preincrement/decrement statement? No.
Is it a postincrement/decrement statement? No.
Is it a regular statement, i.e. a function call? Yes!
And therefore we can skip checking if it is a variable initialization or assignment.

ParseStatement() is where we are now. Ignoring some unrelated complexity for templates and whatnot, this is basically the same pattern yet again: a while loop, parsing until we reach the closing ) that represents the end of the function call.

Is it a valid expression? Yes!

So let's move on to ParseExpression(). This one gets a bit hairy, so let's ignore the details and just focus on the general principle: ParseExpression finds "Hello world!" as a string literal, and knows that this is the parameter to the function we're trying to call.

In this case, the function we're calling is print. So we pair up the function name with its parameters, hand those off to the AST construction logic, and wander back up the call stack.

We return from ParseStatement() and return to ParseCodeBlock(). There's only one statement, so ParseCodeBlock() returns. There's the end of the function, so ParseFunction() returns. There's no more code to look at, so Parse() returns.

And there you have it!

Zeroing in on self-hosting

Posted by , 30 November 2013 - - - - - - · 595 views
Epoch, language design
After a short break, I'm back to hacking on Epoch again.

Today was pretty productive; made several additions to the parser, added support for a few lingering language features that have been neglected up until now, and fixed a couple minor bugs. The compiler is getting richer, the test suite is expanding at a decent pace, and we're very close to being able to self-host this bad boy.

I have a list of stuff that needs to get done, and it's short:
  • Finish implementing unary operators
  • Fix the few remaining template bugs
  • Write the .EXE generation logic
  • Run a final pass over the code for any remaining TODOs that are blocking self-hosting
There's also a handful of "bonus features" that would be cool to get in place but aren't strictly necessary for self-hosting. I'm planning on leaving those for later, since my goal was to self-host by the end of this year, and time is running short for hitting that milestone.

Depending on how motivated I am for the next few days, unary operator support shouldn't be too hard to finish off. Once that's out of the way, there's a handful of issues and missing support bits for templates, which will take a bit more effort but ultimately not a ton of time. Rebuilding the .EXE generation logic in Epoch will be kind of finicky, but it's really just a matter of screwing around with byte buffers and such, so it won't be too awful, especially since I have a working reference implementation in C++ to compare with.

And of course there remains the very real potential for the compiler to fail self-hosting on the first attempt, which wouldn't surprise me at all, and will add to the time required.

So all in all, December's gonna be a hell of a month.

Speculation on semantically-aware version control

Posted by , 17 October 2013 - - - - - - · 878 views

Working on a massive code base comes with some interesting challenges.

Consider this scenario, loosely based on a real-world problem we recently encountered at work:
  • Create a multi-million line code base
  • Divide this into over two dozen branches for various independent development efforts
  • On some given branch, find some popularly-edited code that needs reorganization
  • Move some functions around, create a few new files, delete some old crufty files
Given these changes, how do you cleanly merge the reorganization into the rest of the code tree? Between the time of the reorg and the merge, other people have made edits to the very functions that are being moved around. This means that you can't just do a branch integration and call it good; everyone has to manually recreate the changes in one way or another.

Either the programmer who did the reorg is burdened with manually integrating all the smaller changes made from other branches into his code prior to shipping his branch upstream, or every programmer across every branch must recreate the organizational changes with their own changes intact.

Obviously part of this is just due to the underlying issue that one file is popular for edits across many branches. But the clear solutions to the problem are all annoying:
  • Option One: Force one person to manage many changes. This is gross because it overburdens an individual.
  • Option Two: Force all people to manage one change. This is even worse because it requires everyone to fiddle with code they may not even care about.
  • Option Three: Never write code which becomes popular for edit. This is obviously kind of dumb.
  • Option Four: Never reorganize code. Even more dumb.
At the root of the problem is the way version control on source code currently works. We track textual differences on a line-by-line (or token-by-token) basis. The tools for diffing, merging, history tracking, and integration are all fundamentally ignorant of what the code means.

What if this were not true?

Suppose we built some language that did not store code in flat text files. Instead, it divides code into a strict hierarchy:

Project -> Module -> Function -> Code

You could insert classes or other units of organization between "modules" and "functions" if the language is so designed, of course.

Now, suppose that instead of having all code in a module go into a folder, and all functions in the module going into text files in that folder, we just treat a module as a data unit on disk.

Within this unit, we have arbitrary freedom to stash code however we want. The IDE/editor/other tools would understand how to open this blob of data and break it up into classes, functions, and maybe even individual statements or expressions.

So here comes the interesting bit. Strap on your helmet, kiddies, we're going to go fast.
  • Assign each atomic unit of code (say, a function, or maybe even a statement if you want to get crazy detailed, but that's probably a bad idea) a GUID.
  • Store the data as a GUID followed by the textual code associated with that GUID.
  • Store alongside the data a presentation metadata model which describes how to show these units of code to the programmer. This should be fully configurable and rearrangeable via the editor UI.
  • Each of the smallest level-of-detail objects gets stored in a separate file, identified by its GUID.
Given a set of code attached to a GUID, we no longer show revisions in the version control system as edits to a file. Instead, we show them more granularly, following the organizational hierarchy defined by the language: project, module, function, code. The project as a whole is grouped as a tree, allowing easy visualization of the hierarchy. Open a single node and you can see all changes relevant to that node and its children, on any level of granularity you like.

This sidesteps our original problem in two interesting ways. First, if we just want to reorganize code in a module without changing its functionality, we can do so by modifying the presentation metadata alone. This allows even an old dumb text-based merge utility to preserve our changes across arbitrary branches.

Second, and more fascinating, what if we want to reorganize code across modules? All we have to do is record that the code from one list of GUIDs moved from one module to another. If we don't store modules as folders, but instead as another layer of metadata, we can make another simple textual change that records the GUIDs belonging to one module GUID in revision A, and another module GUID in revision B.

Why GUIDs? Easy: it allows us to rename any atomic unit of code, or any larger chunk of code units, arbitrarily without breaking any of the system. Delete a function? No problem! Just remove its GUID from version control history like you would have deleted a file in the old approach. Add a function? Also no problem; it just becomes a new file in source control. Move a function around to another file or module or even project? Who cares?! It just changes metadata. Not the code itself.

So let's bring it all together. What if we want to have the exact original scenario? Programmer A makes several changes to organization of some module Foo. Meanwhile, programmers B through Z are making changes to the implementation of Foo, on different code branches.

Merging all of this becomes trivial even under existing integration tool paradigms, because of how we decided to store our code on disk. Even cooler, anyone can engage in reorganizations and/or functionality changes without stomping on anyone else when it comes time to merge a branch upstream.

All this requires is a little lateral thinking and some tool support for the actual code editor/IDE. If you want to support things like browsing the code repo from the web, all you need to do is add a tool that flattens the current project -> module -> function -> code hierarchy into an arbitrary set of text files - again, trivial to do with a little metadata.

The more I think about this approach to version control, the more convinced I am that Epoch is going to try it out. Flat text files are dumb and outmoded; it's time we used computers to do our work for us, the way it was always meant to be.

Advice to a Young Programmer

Posted by , 01 October 2013 - - - - - - · 10,035 views

One of the awesome things about ArenaNet is that we run a programming internship program that actually does a phenomenal job of preparing people to work in the games industry. This is accomplished by focusing on three primary principles:
  • Everything you do will matter. There is no pointless busy-work, no useless coffee-and-bagel fetching type nonsense, and every project contributes directly to something that impacts the studio and/or the game as a whole.
  • Everything you do will be reviewed. We have at least one senior programmer per intern dedicated to helping make sure that the work is top-notch. This entails exhaustive code reviews and extended design/analysis discussions before a line of code is even written.
  • Whether we end up hiring you or not, we're committed to making sure you leave the program as a good hire. The program is not a guaranteed-hire affair. However, with extremely few exceptions, we ensure that upon completion of the internship you're well prepared and ready to tackle game industry jobs.
One of the interns going through the program right now is assigned to me, and I've found it an awesome opportunity not just to mentor a more junior developer, but to force myself to crystallize and refine my own thinking so I can communicate it clearly to someone who doesn't have the benefit of many years of experience on the job.

Last week I had the chance to write down some thoughts about his performance so far in the program, and offer some feedback. After sitting down to re-read my letter, it struck me that there's a lot of stuff in there that might be useful for anyone who is just learning to work on large projects with large teams.

You may not be working on the next great MMO (or maybe you are!) but I think there's some value in this advice for anyone who is early on in their programming career.

Think Twice, Commit Once
This is my variant of the old carpenter's rule of "measure twice, cut once." In general, one of the biggest challenges of working on large-scale projects is keeping in mind all the ramifications of your decisions. Sometimes those implications are easier to see, and sometimes there's just no way to know ahead of time. But either way, it pays to take some time when writing code (and after writing code) to think very carefully about it.

One thing I personally like to do is let my non-trivial changelists sit for a day or so and then come back to them with a fresh mind, and re-read the code. I try to approach it as if I'd never seen the code before and was responsible for a code review on it. There are two directions that need to be considered: the tiny details, and the big-picture implications. Usually I will do two passes, one for each frame of mind.

I'll cover some of the small details later; the big things are generally harder anyways. It takes a lot of practice and experience to intuitively spot the consequences of design decisions; this has two important facets. First, it means that it won't be immediately obvious most of the time when you make a decision that has large-scale effects. Second, it means that it will take effort and conscious deliberation to train yourself to recognize those situations. The best suggestion I can offer is to pause often and try and envision the future - which I'll tackle next.

Be Nice to Future You
I could also say "be nice to everyone else who will ever read your code" but that doesn't make for as nice of a section heading. The general idea here is that code is written once and then lives a very, very, very long time. The natural impact of this is that people will have to read the code many times while it survives. Again there are two directions this can go in: tiny details, and large-scale impacts, and again, the details are easier to spot - especially at first.

Some concrete examples are probably in order at this point. For details, one of the things that goes a long way is simple formatting. It may seem almost overbearingly anal-retentive to complain about whitespace and which line your braces go on, but it has an impact. After twenty-odd years of reading code, you get to a point where you can recognize common patterns very easily. Especially in a workplace like ArenaNet with a strict and consistently-followed formatting standard, this is a major time-saver; if everyone writes code that looks similar, it's easier to spot "weird" things. If my brain is distracted while reading a piece of code by the formatting, it gets harder to focus on the meaning and intent of the code itself.

On the larger scale, there are things like comments. Code commenting is a religious warfare issue in most of the world, and even ArenaNet has lots of diverse viewpoints on how it should be done. However, we have some basic philosophical common ground that is very helpful.

Sometimes comments need to go away, and sometimes they need to be added. Comments that need to go away are generally guilty of at least one of the following crimes:
  • Repeating what the code already says
  • Being out of date or in imminent danger of becoming out of date
  • Being outright false (often as a result of accidents with copy/paste)
Things that are good to comment are largely architectural: what modules do, how they fit together, what the intent of each major section of code is. The details ("this line of code adds this column to this CSV file") are already pretty obvious from the code itself - or at least, they should be, if the names used in the code are clear. Which leads to...

A Rose By Any Other Name Smells Like Shit
This is a lot more detail-oriented stuff. There are a number of conventions that any team will typically have that are important and help readability and clarity of intent.

There are many specific examples in ArenaNet's code standards documentation, but there are other conventions that are more general and likely to be in use in almost any environment. For example, pluralization is important. Iterator variables should be singular ("item" or "currency" or "character"). Containers should be plural ("items" or "currencies" or "characters"). The same goes for function names; if a function does one thing, name it singularly. If it does multiple things, or does one thing to multiple things, pluralize appropriately.

In general names are hard to select and should be chosen with great care. If a function's purpose changes, change its name as well. Always make sure a name corresponds to the best description of intent that you can manage. (Just don't use extremelyVerboseVariableNamesWithLotsOfExtraneousDetails.)

Wastefulness Will Bite You Sooner Rather Than Later
Extra parameters that aren't used by a function should be removed. (There are compiler warnings for this, by the way - you should always build with warnings as errors and the maximum warning level that you can manage.) Similarly, make sure that all variables are actually used and do something. Make sure that function calls are useful. For example, initializing a variable and then immediately changing its value just creates extra noise that confuses the reader. Passing nothing but a string literal to sprintf() and then printf()ing the resulting buffer is confusing as well, and a little wasteful.

This is partly about readability and partly about real efficiency. In large code bases like ours, the death isn't from a single massively wasteful piece of code - it's from thousands of tiny decisions that add up over time... both in terms of reading code, and in terms of how it performs at runtime. Both memory and processing power are something to keep in mind. They may seem cheap (or even free) at times, especially in environments with immense resources. But don't forget that everything has a cost and those costs accumulate a lot faster than we might wish sometimes.

A corollary to this is cleanup practices. If you remove a piece of functionality, make sure all vestiges of it are gone - comments, preparatory code, cleanup code, etc. It's easy to forget pieces of logic when removing things, and this just leads to more noise and confusion for the next reader. Once again, re-reading your own code with a critical eye helps a lot here.

Give It A Nice Home
Where things live in code as well as data is always an important consideration. Some things don't need to be wrapped into classes - if there's no state being carried around, or no shared interface that must be used, favor free functions instead of building classes that are just methods. On the data side, make sure to scope things as tightly as you can, and declare things as close as possible to their first point of use. Some things don't need to be file-level variables (let alone globals), and can just live as locals in some function. Some things don't even need to live the entire lifetime of a function. RAII is a big deal in C++, and should be used liberally.

File organization is also a big thing to keep in mind. Keeping related bits of code in self-contained files is a good habit; but it takes some careful thought to decide the taxonomy for what is "related." Think of programs like a pipeline. Each segment of pipe (module) should do one very specific and very contained thing. To build the whole program, you link together segments of pipe (modules/classes/etc.) and compose them into a more sophisticated machine.

Follow the Leader...
You should always try and emulate the code you're working in if someone else owns it. Follow the style, the naming patterns, the architectural decisions, and so on. Often you can make your life harder by departing from established convention, and you will definitely make the lives of everyone else harder at the same time.

Don't be shy to ask questions if any of those decisions or patterns are unclear. I know that a lot of this stuff seems a bit vague and mystical right now; much of the reasoning for why things are the way they are may not be immediately apparent. Asking questions is always good - if there are reasons for why things are a certain way, then you get to learn something; and if there are no reasons, or bad reasons, you open up an opportunity to make things better.

...But Clean Up What You Find
One of the highest aspirations we should have as team programmers is to leave code better than we found it. This ranges from fixing minor details to cleaning up design decisions and adding documentation. This is something you should try for even if you're not in your "own" code - often if there is objective room for improvement, the owner will actually appreciate you taking the time to make his area nicer. Obviously this doesn't mean you should go on a rampage to change every code file just for the sake of it, and there's always value in consulting with an owner before making changes - but you get the idea.

Learning Never Stops
It can be very tempting in life to plateau. Sometimes we just want to feel like we have "arrived" and now we "get it." And indeed there will be milestones in your career where you can notice a profound change in your skills and perspective.

The key is to never stop chasing the next boost. Even after more than twenty years of writing computer programs, I learn new things all the time. Your learning process is either constantly running, or you're effectively dead.

And that, of course, leads to the final piece of advice I could ever offer anyone on life: "don't be dead."

So close, and yet so far

Posted by , 29 September 2013 - - - - - - · 827 views
After implementing a bit of value-based pattern matching support, the Epoch compiler is now down to 7 tests remaining before it's ready for self-hosting.

Of those 7, four are related to templates, which will be the last major endeavor required to get self-hosting viable. Templates are scary and will probably eat up a lot of time.

One of those tests is simply pattern matching not working on built-in functions, notably the "cast" function. I really hate the cast function and want to replace it with a template form instead, but for now it works the way it does, and I'll have to support the current syntax to get the compiler bootstrapped anyways.

Another test is related to value semantics of structures, which has been broken for ages, because of some half-baked optimizations I never finished ironing out. That one may take some time to get right because it requires being very fiddly in how I invoke copies of structures versus handing off references behind the scenes.

The final test in the list is for shadowed variables; unlike most languages, Epoch is actively hostile to identifier shadowing, and forbids it outright. However, the compiler in its current implementation is a bit too zealous with this, and will in fact flag a shadowing error even if you define variables of the same name in totally unrelated scopes. That's kind of annoying and needs to be fixed, but it's low priority.

The three fiddly little tests shouldn't be more than a day apiece to get done. Templates, as I keep observing here, are going to be the nasty heavy lifting bit of the project.

So self hosting remains a visible but still distant target. Another weekend or so should get me down to nothing but templates to fix, and then I can ponder how hideously painful that will be for a while before I drag up the nerve to actually tackle it.

Once self-hosting is done, it's time to take a careful look at what comes next. I'm torn between working on a debugger and working on a profiler, although the two overlap pretty heavily and I might wind up just doing common bits of work for each and seeing what falls out.

Profiling is going to be very important because the new compiler is a few orders of magnitude slower than the C++ one, which is kind of sad but also not entirely unexpected given that I've spent literally zero effort on optimizing the Epoch implementation of the compiler. I can't be bothered waiting more than a second for my programs to compile, so it'll be fun to get back into the performance arena and start hacking away on bottlenecks for a while.

It certainly will be a welcome alternative to all this self-hosting gibberish. I'm seriously bored of watching compiler test suites run...

January 2017 »

1516171819 20 21