• Advertisement
Sign in to follow this  

Behold! The compiler which compiles itself(*)

This topic is 1534 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Advertisement

Please, please, please tell me you have a script that combines all that code into one file and that you don't edit the 10k monstrosity as a whole!

 

But very cool. What kind of grammar is Epoch? Context-free? I ask because I'm studying compilers right now and am curious in these things. I tried to tell from your parsing code, but it's... too verbose for me :) Actually, shoot, I'm gonna ask a bunch of compiler-related questions:

 

Do you do much static analysis or optimizing? How many passes over the program tree do you do, and how many different intermediate representations do you go through? What kind of code do you generate? x86? LLVM? And are there any other questions I should have asked but didn't?

Share this post


Link to post
Share on other sites

Please, please, please tell me you have a script that combines all that code into one file and that you don't edit the 10k monstrosity as a whole!

 

But very cool. What kind of grammar is Epoch? Context-free? I ask because I'm studying compilers right now and am curious in these things. I tried to tell from your parsing code, but it's... too verbose for me smile.png Actually, shoot, I'm gonna ask a bunch of compiler-related questions:

 

Do you do much static analysis or optimizing? How many passes over the program tree do you do, and how many different intermediate representations do you go through? What kind of code do you generate? x86? LLVM? And are there any other questions I should have asked but didn't?

 

Yes compiling itself is a fine stunt (and shows a modicum of language versatility), but if your compiler turns out inefficient code and its self-compiled runtime IS  used normally --- and is intended in  for use with projects with lots of code (to me 100K actual code lines is not overly large)   then thats a negative.

 

"Compiles to native code for maximum performance"  does NOT mean its efficient anywhere as much a a well developed optimizing compiler will do (for  the compile runtime)

 

Your library generation likewise.

Share this post


Link to post
Share on other sites
I seriously thought about writing a combiner script, but it honestly turns out that having everything in a monolithic file isn't all that bad. 10KLOC is large but not unmanageable, and it's actually easier to remember how to find things in a single file than you might expect. Of course, it helps that the Epoch implementation of the compiler is about 1/4 the size of the C++ implementation, so remembering things is 4 times easier to begin with ;-)


The grammar should be deterministic-context-free. I think. I'm honestly not strong enough in parser theory to prove it for sure, but that feels about right. The parser itself is just a DFA written in classic recursive-descent style, so based on my understanding of the parser/grammar categories, the grammar is DCF. I could be utterly wrong, though.

This is actually only a chunk of the compilation process - a significant chunk, given that it produces bytecode for an abstract machine, but not everything. The remaining code (which is all C++) translates the bytecode emitted from this layer into LLVM bitcode which is then turned into native machine code more or less on the fly. I decided to do it this way because interfacing with LLVM from non-C++ languages is slightly less than a total nightmare, and this is already a big enough project.

Static analysis is a big part of the language. The code above implements a (mostly) full type checker as well as limited type inference support. It also permits function overloading and pattern matching. (In fact, using pattern matching is why the code is so compact compared to the C++ version; certain forms of code become a lot more succinct when you can express them as pattern matches. As the pattern matching support gets richer it'll get even more compact.)

Optimization is basically left entirely to LLVM at this point, mainly because once you hit LLVM bitcode there's very few categories of optimizations that LLVM can't already do. A few high-level things are done by the compiler based on type information but that's about it. There will probably be a lot more type-informed optimization done later as things get richer in the language.

Right now the parser outputs everything in a single IR in the Epoch side of things. This IR is traversed and decorated by the compiler itself, using a rather ad hoc traversal pattern, but in essence every node of the IR is visited and processed exactly once, even though it may strictly be "visited" many times (and then the decorations used as a kind of cache to avoid recomputing type information, say). From there a final traversal of the IR outputs the abstract machine bytecode, which goes through a simple mechanical conversion to LLVM bitcode before being handed off to a suite of dozens of LLVM passes for optimization.


There are probably many other questions worth asking, although I'd be happy to ramble about my baby for a million years, so you're probably better off not asking them all unless you want an earful :-P

Share this post


Link to post
Share on other sites

Yes compiling itself is a fine stunt (and shows a modicum of language versatility), but if your compiler turns out inefficient code and its self-compiled runtime IS  used normally --- and is intended in  for use with projects with lots of code (to me 100K actual code lines is not overly large)   then thats a negative.
 
"Compiles to native code for maximum performance"  does NOT mean its efficient anywhere as much a a well developed optimizing compiler will do (for  the compile runtime)
 
Your library generation likewise.



I'm not entirely sure what most of this means, but it should be pointed out that I'm not writing my own native code generation or optimizations. I'm using LLVM for all that business.

For what it's worth, I wrote a raytracer in Epoch that marginally edges out a comparable C++ implementation for raw performance on my laptop.

Share this post


Link to post
Share on other sites

I seriously thought about writing a combiner script, but it honestly turns out that having everything in a monolithic file isn't all that bad. 10KLOC is large but not unmanageable, and it's actually easier to remember how to find things in a single file than you might expect. Of course, it helps that the Epoch implementation of the compiler is about 1/4 the size of the C++ implementation, so remembering things is 4 times easier to begin with ;-)

How were you doing it in C++? Were you using Flex/Lex and Yacc/Bison at all?

The grammar should be deterministic-context-free. I think. I'm honestly not strong enough in parser theory to prove it for sure, but that feels about right. The parser itself is just a DFA written in classic recursive-descent style, so based on my understanding of the parser/grammar categories, the grammar is DCF. I could be utterly wrong, though.

Cool. Sounds context-free to me. I'm working on an implementation of parsing with derivatives in C++ for my undergraduate thesis, and assuming I actually finish it maybe I can use Epoch's grammar (or a subset of it) as a demo and case study.

All very cool. I'll see if I have more questions that come up during the week.

Share this post


Link to post
Share on other sites
I used boost::spirit, which was a great rapid prototyping tool, but has long since become a liability. A while back I did some performance tuning to get it to suck less (about a 1000x speedup in the parser alone) but it's still pretty slow, clunky, and takes forever to compile.

For comparison, it takes about 20 minutes to do a full rebuild of the project in C++. It takes less than 200 milliseconds to build the entire Epoch implementation of the compiler.

Share this post


Link to post
Share on other sites

It weighs in at just under 10,000 lines right now or I'd post it inline. You can see the evil monstrosity here and gawk in awe at its massive hideousness.

That's pretty damn sweet.

I really need to find the time to complete my little language project. You would not believe how much I regret writing the compiler in a dynamically-typed language...

Share this post


Link to post
Share on other sites

 

Yes compiling itself is a fine stunt (and shows a modicum of language versatility), but if your compiler turns out inefficient code and its self-compiled runtime IS  used normally --- and is intended in  for use with projects with lots of code (to me 100K actual code lines is not overly large)   then thats a negative.
"Compiles to native code for maximum performance"  does NOT mean its efficient anywhere as much a a well developed optimizing compiler will do (for  the compile runtime)
Your library generation likewise.


I'm not entirely sure what most of this means, but it should be pointed out that I'm not writing my own native code generation or optimizations. I'm using LLVM for all that business.
For what it's worth, I wrote a raytracer in Epoch that marginally edges out a comparable C++ implementation for raw performance on my laptop.

 

Ok, so your language is a front-end using a relatively mature backend to do the code generation ....

Share this post


Link to post
Share on other sites

You are a hero. I've long given up self-hosting.

And to be honest, I'm considering giving up context-free grammars as well. I'm not sure they buy me something in the real world.

Share this post


Link to post
Share on other sites


It weighs in at just under 10,000 lines right now or I'd post it inline. You can see the evil monstrosity here and gawk in awe at its massive hideousness.

 

 


You keep harping on this 9000 line of code business. You do realize that's considered trivially small by most professional standards, right?
 
But I do appreciate the work you have done here, because for many of us it takes a huge effort to write almost 10,000 lines of code, because many times more lines have been removed to make an optimal piece of code. That means older versions, tests of possible solutions, optimizations, etc. Keep up the good work :)

Share this post


Link to post
Share on other sites

From the website, emphasis mine:

  • Programming shouldn't suck. This underlies all other principles of the language.
  • Make it easy to do the right thing, and hard to do the wrong thing.
  • Get out of the programmer's way as much as possible.
  • Repeating yourself and stating the obvious are both equally grave sins.
  • Conventional modes of thought only deserve to live insofar as they enhance productivity.
  • Provide escape hatches to permit things that are gross but necessary.

And yet, the language is not a Lisp! Something is wrong here...

Edited by tufflax

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement