Posted 15 October 2013 - 12:02 AM
I seriously thought about writing a combiner script, but it honestly turns out that having everything in a monolithic file isn't all that bad. 10KLOC is large but not unmanageable, and it's actually easier to remember how to find things in a single file than you might expect. Of course, it helps that the Epoch implementation of the compiler is about 1/4 the size of the C++ implementation, so remembering things is 4 times easier to begin with ;-)
The grammar should be deterministic-context-free. I think. I'm honestly not strong enough in parser theory to prove it for sure, but that feels about right. The parser itself is just a DFA written in classic recursive-descent style, so based on my understanding of the parser/grammar categories, the grammar is DCF. I could be utterly wrong, though.
This is actually only a chunk of the compilation process - a significant chunk, given that it produces bytecode for an abstract machine, but not everything. The remaining code (which is all C++) translates the bytecode emitted from this layer into LLVM bitcode which is then turned into native machine code more or less on the fly. I decided to do it this way because interfacing with LLVM from non-C++ languages is slightly less than a total nightmare, and this is already a big enough project.
Static analysis is a big part of the language. The code above implements a (mostly) full type checker as well as limited type inference support. It also permits function overloading and pattern matching. (In fact, using pattern matching is why the code is so compact compared to the C++ version; certain forms of code become a lot more succinct when you can express them as pattern matches. As the pattern matching support gets richer it'll get even more compact.)
Optimization is basically left entirely to LLVM at this point, mainly because once you hit LLVM bitcode there's very few categories of optimizations that LLVM can't already do. A few high-level things are done by the compiler based on type information but that's about it. There will probably be a lot more type-informed optimization done later as things get richer in the language.
Right now the parser outputs everything in a single IR in the Epoch side of things. This IR is traversed and decorated by the compiler itself, using a rather ad hoc traversal pattern, but in essence every node of the IR is visited and processed exactly once, even though it may strictly be "visited" many times (and then the decorations used as a kind of cache to avoid recomputing type information, say). From there a final traversal of the IR outputs the abstract machine bytecode, which goes through a simple mechanical conversion to LLVM bitcode before being handed off to a suite of dozens of LLVM passes for optimization.
There are probably many other questions worth asking, although I'd be happy to ramble about my baby for a million years, so you're probably better off not asking them all unless you want an earful :-P