Jump to content

  • Log In with Google      Sign In   
  • Create Account


Behold! The compiler which compiles itself(*)

  • You cannot reply to this topic
14 replies to this topic

#1 ApochPiQ   Moderators   -  Reputation: 15108

Like
6Likes
Like

Posted 14 October 2013 - 09:32 PM

One of my big time consumers lately has been the Epoch language compiler. This is a beast which is intended to self-host, i.e. compile itself.

It weighs in at just under 10,000 lines right now or I'd post it inline. You can see the evil monstrosity here and gawk in awe at its massive hideousness. Unfortunately Google Code does not know how to highlight Epoch code yet so it doesn't look as pretty as it does in my (fully home-grown) IDE.


Bestow upon me your undying praise expressions of vomitous revulsion.



OK, so I lied. Technically it's very close to being able to self-host, but it's missing a couple features.

Sponsor:

#2 Cornstalks   Crossbones+   -  Reputation: 6974

Like
0Likes
Like

Posted 14 October 2013 - 09:53 PM

Please, please, please tell me you have a script that combines all that code into one file and that you don't edit the 10k monstrosity as a whole!

 

But very cool. What kind of grammar is Epoch? Context-free? I ask because I'm studying compilers right now and am curious in these things. I tried to tell from your parsing code, but it's... too verbose for me :) Actually, shoot, I'm gonna ask a bunch of compiler-related questions:

 

Do you do much static analysis or optimizing? How many passes over the program tree do you do, and how many different intermediate representations do you go through? What kind of code do you generate? x86? LLVM? And are there any other questions I should have asked but didn't?


[ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

#3 wodinoneeye   Members   -  Reputation: 757

Like
0Likes
Like

Posted 15 October 2013 - 12:01 AM

Please, please, please tell me you have a script that combines all that code into one file and that you don't edit the 10k monstrosity as a whole!

 

But very cool. What kind of grammar is Epoch? Context-free? I ask because I'm studying compilers right now and am curious in these things. I tried to tell from your parsing code, but it's... too verbose for me smile.png Actually, shoot, I'm gonna ask a bunch of compiler-related questions:

 

Do you do much static analysis or optimizing? How many passes over the program tree do you do, and how many different intermediate representations do you go through? What kind of code do you generate? x86? LLVM? And are there any other questions I should have asked but didn't?

 

Yes compiling itself is a fine stunt (and shows a modicum of language versatility), but if your compiler turns out inefficient code and its self-compiled runtime IS  used normally --- and is intended in  for use with projects with lots of code (to me 100K actual code lines is not overly large)   then thats a negative.

 

"Compiles to native code for maximum performance"  does NOT mean its efficient anywhere as much a a well developed optimizing compiler will do (for  the compile runtime)

 

Your library generation likewise.


--------------------------------------------Ratings are Opinion, not Fact

#4 ApochPiQ   Moderators   -  Reputation: 15108

Like
2Likes
Like

Posted 15 October 2013 - 12:02 AM

I seriously thought about writing a combiner script, but it honestly turns out that having everything in a monolithic file isn't all that bad. 10KLOC is large but not unmanageable, and it's actually easier to remember how to find things in a single file than you might expect. Of course, it helps that the Epoch implementation of the compiler is about 1/4 the size of the C++ implementation, so remembering things is 4 times easier to begin with ;-)


The grammar should be deterministic-context-free. I think. I'm honestly not strong enough in parser theory to prove it for sure, but that feels about right. The parser itself is just a DFA written in classic recursive-descent style, so based on my understanding of the parser/grammar categories, the grammar is DCF. I could be utterly wrong, though.

This is actually only a chunk of the compilation process - a significant chunk, given that it produces bytecode for an abstract machine, but not everything. The remaining code (which is all C++) translates the bytecode emitted from this layer into LLVM bitcode which is then turned into native machine code more or less on the fly. I decided to do it this way because interfacing with LLVM from non-C++ languages is slightly less than a total nightmare, and this is already a big enough project.

Static analysis is a big part of the language. The code above implements a (mostly) full type checker as well as limited type inference support. It also permits function overloading and pattern matching. (In fact, using pattern matching is why the code is so compact compared to the C++ version; certain forms of code become a lot more succinct when you can express them as pattern matches. As the pattern matching support gets richer it'll get even more compact.)

Optimization is basically left entirely to LLVM at this point, mainly because once you hit LLVM bitcode there's very few categories of optimizations that LLVM can't already do. A few high-level things are done by the compiler based on type information but that's about it. There will probably be a lot more type-informed optimization done later as things get richer in the language.

Right now the parser outputs everything in a single IR in the Epoch side of things. This IR is traversed and decorated by the compiler itself, using a rather ad hoc traversal pattern, but in essence every node of the IR is visited and processed exactly once, even though it may strictly be "visited" many times (and then the decorations used as a kind of cache to avoid recomputing type information, say). From there a final traversal of the IR outputs the abstract machine bytecode, which goes through a simple mechanical conversion to LLVM bitcode before being handed off to a suite of dozens of LLVM passes for optimization.


There are probably many other questions worth asking, although I'd be happy to ramble about my baby for a million years, so you're probably better off not asking them all unless you want an earful :-P

#5 ApochPiQ   Moderators   -  Reputation: 15108

Like
0Likes
Like

Posted 15 October 2013 - 12:05 AM

Yes compiling itself is a fine stunt (and shows a modicum of language versatility), but if your compiler turns out inefficient code and its self-compiled runtime IS  used normally --- and is intended in  for use with projects with lots of code (to me 100K actual code lines is not overly large)   then thats a negative.
 
"Compiles to native code for maximum performance"  does NOT mean its efficient anywhere as much a a well developed optimizing compiler will do (for  the compile runtime)
 
Your library generation likewise.



I'm not entirely sure what most of this means, but it should be pointed out that I'm not writing my own native code generation or optimizations. I'm using LLVM for all that business.

For what it's worth, I wrote a raytracer in Epoch that marginally edges out a comparable C++ implementation for raw performance on my laptop.

#6 Cornstalks   Crossbones+   -  Reputation: 6974

Like
1Likes
Like

Posted 15 October 2013 - 12:58 AM

I seriously thought about writing a combiner script, but it honestly turns out that having everything in a monolithic file isn't all that bad. 10KLOC is large but not unmanageable, and it's actually easier to remember how to find things in a single file than you might expect. Of course, it helps that the Epoch implementation of the compiler is about 1/4 the size of the C++ implementation, so remembering things is 4 times easier to begin with ;-)

How were you doing it in C++? Were you using Flex/Lex and Yacc/Bison at all?

The grammar should be deterministic-context-free. I think. I'm honestly not strong enough in parser theory to prove it for sure, but that feels about right. The parser itself is just a DFA written in classic recursive-descent style, so based on my understanding of the parser/grammar categories, the grammar is DCF. I could be utterly wrong, though.

Cool. Sounds context-free to me. I'm working on an implementation of parsing with derivatives in C++ for my undergraduate thesis, and assuming I actually finish it maybe I can use Epoch's grammar (or a subset of it) as a demo and case study.

All very cool. I'll see if I have more questions that come up during the week.
[ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

#7 ApochPiQ   Moderators   -  Reputation: 15108

Like
1Likes
Like

Posted 15 October 2013 - 01:34 AM

I used boost::spirit, which was a great rapid prototyping tool, but has long since become a liability. A while back I did some performance tuning to get it to suck less (about a 1000x speedup in the parser alone) but it's still pretty slow, clunky, and takes forever to compile.

For comparison, it takes about 20 minutes to do a full rebuild of the project in C++. It takes less than 200 milliseconds to build the entire Epoch implementation of the compiler.

#8 swiftcoder   Senior Moderators   -  Reputation: 9865

Like
0Likes
Like

Posted 15 October 2013 - 07:29 PM

It weighs in at just under 10,000 lines right now or I'd post it inline. You can see the evil monstrosity here and gawk in awe at its massive hideousness.

That's pretty damn sweet.

I really need to find the time to complete my little language project. You would not believe how much I regret writing the compiler in a dynamically-typed language...

Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#9 wodinoneeye   Members   -  Reputation: 757

Like
-1Likes
Like

Posted 15 October 2013 - 08:00 PM

 

Yes compiling itself is a fine stunt (and shows a modicum of language versatility), but if your compiler turns out inefficient code and its self-compiled runtime IS  used normally --- and is intended in  for use with projects with lots of code (to me 100K actual code lines is not overly large)   then thats a negative.
"Compiles to native code for maximum performance"  does NOT mean its efficient anywhere as much a a well developed optimizing compiler will do (for  the compile runtime)
Your library generation likewise.


I'm not entirely sure what most of this means, but it should be pointed out that I'm not writing my own native code generation or optimizations. I'm using LLVM for all that business.
For what it's worth, I wrote a raytracer in Epoch that marginally edges out a comparable C++ implementation for raw performance on my laptop.

 

Ok, so your language is a front-end using a relatively mature backend to do the code generation ....


--------------------------------------------Ratings are Opinion, not Fact

#10 Krohm   Crossbones+   -  Reputation: 3052

Like
1Likes
Like

Posted 21 October 2013 - 01:23 AM

You are a hero. I've long given up self-hosting.

And to be honest, I'm considering giving up context-free grammars as well. I'm not sure they buy me something in the real world.



#11 rip-off   Moderators   -  Reputation: 8121

Like
0Likes
Like

Posted 21 October 2013 - 02:48 PM

Very nice, keep up the good work.

#12 cronocr   Members   -  Reputation: 752

Like
0Likes
Like

Posted 21 October 2013 - 06:56 PM


It weighs in at just under 10,000 lines right now or I'd post it inline. You can see the evil monstrosity here and gawk in awe at its massive hideousness.

 

 


You keep harping on this 9000 line of code business. You do realize that's considered trivially small by most professional standards, right?
 
But I do appreciate the work you have done here, because for many of us it takes a huge effort to write almost 10,000 lines of code, because many times more lines have been removed to make an optimal piece of code. That means older versions, tests of possible solutions, optimizations, etc. Keep up the good work :)

 

 


#13 tufflax   Members   -  Reputation: 498

Like
3Likes
Like

Posted 10 November 2013 - 05:37 PM

From the website, emphasis mine:

  • Programming shouldn't suck. This underlies all other principles of the language.
  • Make it easy to do the right thing, and hard to do the wrong thing.
  • Get out of the programmer's way as much as possible.
  • Repeating yourself and stating the obvious are both equally grave sins.
  • Conventional modes of thought only deserve to live insofar as they enhance productivity.
  • Provide escape hatches to permit things that are gross but necessary.

And yet, the language is not a Lisp! Something is wrong here...


Edited by tufflax, 10 November 2013 - 05:48 PM.


#14 ApochPiQ   Moderators   -  Reputation: 15108

Like
1Likes
Like

Posted 11 November 2013 - 10:47 AM

Heh, it actually began life as a Lisp, many years ago. It's evolved considerably since, obviously.



#15 tufflax   Members   -  Reputation: 498

Like
0Likes
Like

Posted 11 November 2013 - 10:58 AM

How did that happen? No one in their right mind would go from Lisp to non-Lisp. :P







PARTNERS