Behold! The compiler which compiles itself(*)

Started by
13 comments, last by tufflax 10 years, 5 months ago
One of my big time consumers lately has been the Epoch language compiler. This is a beast which is intended to self-host, i.e. compile itself.

It weighs in at just under 10,000 lines right now or I'd post it inline. You can see the evil monstrosity here and gawk in awe at its massive hideousness. Unfortunately Google Code does not know how to highlight Epoch code yet so it doesn't look as pretty as it does in my (fully home-grown) IDE.


Bestow upon me your undying praise expressions of vomitous revulsion.



OK, so I lied. Technically it's very close to being able to self-host, but it's missing a couple features.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Advertisement

Please, please, please tell me you have a script that combines all that code into one file and that you don't edit the 10k monstrosity as a whole!

But very cool. What kind of grammar is Epoch? Context-free? I ask because I'm studying compilers right now and am curious in these things. I tried to tell from your parsing code, but it's... too verbose for me :) Actually, shoot, I'm gonna ask a bunch of compiler-related questions:

Do you do much static analysis or optimizing? How many passes over the program tree do you do, and how many different intermediate representations do you go through? What kind of code do you generate? x86? LLVM? And are there any other questions I should have asked but didn't?

[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

Please, please, please tell me you have a script that combines all that code into one file and that you don't edit the 10k monstrosity as a whole!

But very cool. What kind of grammar is Epoch? Context-free? I ask because I'm studying compilers right now and am curious in these things. I tried to tell from your parsing code, but it's... too verbose for me smile.png Actually, shoot, I'm gonna ask a bunch of compiler-related questions:

Do you do much static analysis or optimizing? How many passes over the program tree do you do, and how many different intermediate representations do you go through? What kind of code do you generate? x86? LLVM? And are there any other questions I should have asked but didn't?

Yes compiling itself is a fine stunt (and shows a modicum of language versatility), but if your compiler turns out inefficient code and its self-compiled runtime IS used normally --- and is intended in for use with projects with lots of code (to me 100K actual code lines is not overly large) then thats a negative.

"Compiles to native code for maximum performance" does NOT mean its efficient anywhere as much a a well developed optimizing compiler will do (for the compile runtime)

Your library generation likewise.

--------------------------------------------[size="1"]Ratings are Opinion, not Fact
I seriously thought about writing a combiner script, but it honestly turns out that having everything in a monolithic file isn't all that bad. 10KLOC is large but not unmanageable, and it's actually easier to remember how to find things in a single file than you might expect. Of course, it helps that the Epoch implementation of the compiler is about 1/4 the size of the C++ implementation, so remembering things is 4 times easier to begin with ;-)


The grammar should be deterministic-context-free. I think. I'm honestly not strong enough in parser theory to prove it for sure, but that feels about right. The parser itself is just a DFA written in classic recursive-descent style, so based on my understanding of the parser/grammar categories, the grammar is DCF. I could be utterly wrong, though.

This is actually only a chunk of the compilation process - a significant chunk, given that it produces bytecode for an abstract machine, but not everything. The remaining code (which is all C++) translates the bytecode emitted from this layer into LLVM bitcode which is then turned into native machine code more or less on the fly. I decided to do it this way because interfacing with LLVM from non-C++ languages is slightly less than a total nightmare, and this is already a big enough project.

Static analysis is a big part of the language. The code above implements a (mostly) full type checker as well as limited type inference support. It also permits function overloading and pattern matching. (In fact, using pattern matching is why the code is so compact compared to the C++ version; certain forms of code become a lot more succinct when you can express them as pattern matches. As the pattern matching support gets richer it'll get even more compact.)

Optimization is basically left entirely to LLVM at this point, mainly because once you hit LLVM bitcode there's very few categories of optimizations that LLVM can't already do. A few high-level things are done by the compiler based on type information but that's about it. There will probably be a lot more type-informed optimization done later as things get richer in the language.

Right now the parser outputs everything in a single IR in the Epoch side of things. This IR is traversed and decorated by the compiler itself, using a rather ad hoc traversal pattern, but in essence every node of the IR is visited and processed exactly once, even though it may strictly be "visited" many times (and then the decorations used as a kind of cache to avoid recomputing type information, say). From there a final traversal of the IR outputs the abstract machine bytecode, which goes through a simple mechanical conversion to LLVM bitcode before being handed off to a suite of dozens of LLVM passes for optimization.


There are probably many other questions worth asking, although I'd be happy to ramble about my baby for a million years, so you're probably better off not asking them all unless you want an earful :-P

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Yes compiling itself is a fine stunt (and shows a modicum of language versatility), but if your compiler turns out inefficient code and its self-compiled runtime IS used normally --- and is intended in for use with projects with lots of code (to me 100K actual code lines is not overly large) then thats a negative.

"Compiles to native code for maximum performance" does NOT mean its efficient anywhere as much a a well developed optimizing compiler will do (for the compile runtime)

Your library generation likewise.



I'm not entirely sure what most of this means, but it should be pointed out that I'm not writing my own native code generation or optimizations. I'm using LLVM for all that business.

For what it's worth, I wrote a raytracer in Epoch that marginally edges out a comparable C++ implementation for raw performance on my laptop.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

I seriously thought about writing a combiner script, but it honestly turns out that having everything in a monolithic file isn't all that bad. 10KLOC is large but not unmanageable, and it's actually easier to remember how to find things in a single file than you might expect. Of course, it helps that the Epoch implementation of the compiler is about 1/4 the size of the C++ implementation, so remembering things is 4 times easier to begin with ;-)

How were you doing it in C++? Were you using Flex/Lex and Yacc/Bison at all?

The grammar should be deterministic-context-free. I think. I'm honestly not strong enough in parser theory to prove it for sure, but that feels about right. The parser itself is just a DFA written in classic recursive-descent style, so based on my understanding of the parser/grammar categories, the grammar is DCF. I could be utterly wrong, though.

Cool. Sounds context-free to me. I'm working on an implementation of parsing with derivatives in C++ for my undergraduate thesis, and assuming I actually finish it maybe I can use Epoch's grammar (or a subset of it) as a demo and case study.

All very cool. I'll see if I have more questions that come up during the week.
[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]
I used boost::spirit, which was a great rapid prototyping tool, but has long since become a liability. A while back I did some performance tuning to get it to suck less (about a 1000x speedup in the parser alone) but it's still pretty slow, clunky, and takes forever to compile.

For comparison, it takes about 20 minutes to do a full rebuild of the project in C++. It takes less than 200 milliseconds to build the entire Epoch implementation of the compiler.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

It weighs in at just under 10,000 lines right now or I'd post it inline. You can see the evil monstrosity here and gawk in awe at its massive hideousness.

That's pretty damn sweet.

I really need to find the time to complete my little language project. You would not believe how much I regret writing the compiler in a dynamically-typed language...

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Yes compiling itself is a fine stunt (and shows a modicum of language versatility), but if your compiler turns out inefficient code and its self-compiled runtime IS used normally --- and is intended in for use with projects with lots of code (to me 100K actual code lines is not overly large) then thats a negative.
"Compiles to native code for maximum performance" does NOT mean its efficient anywhere as much a a well developed optimizing compiler will do (for the compile runtime)
Your library generation likewise.


I'm not entirely sure what most of this means, but it should be pointed out that I'm not writing my own native code generation or optimizations. I'm using LLVM for all that business.
For what it's worth, I wrote a raytracer in Epoch that marginally edges out a comparable C++ implementation for raw performance on my laptop.

Ok, so your language is a front-end using a relatively mature backend to do the code generation ....

--------------------------------------------[size="1"]Ratings are Opinion, not Fact

You are a hero. I've long given up self-hosting.

And to be honest, I'm considering giving up context-free grammars as well. I'm not sure they buy me something in the real world.

Previously "Krohm"

This topic is closed to new replies.

Advertisement