Creating a simple programming language/interpreter/compiler

Started by
8 comments, last by klems 13 years, 3 months ago
I've been interested in writing my own language for a while. I've decided that I now have enough programming experience to be able to write one - albeit slowly, with a lot of resource help.
I've scribbled up some example code of how I want it to look, which seemed like a useful first step. I want to take it as progressively as possible - it currently looks like a procedural watered down Python in example.

I think the first thing is do I write a compiler or interpreter? I don't have much of a clue of how this works. Obviously a compiler takes the code down to assembly or C then existing compilers compile that. I'm not entirely sure how an interpreter would work - instead of converting the entire file to assembly, it would just process the line at a time?
As such, which would be simpler to write? From my searching I've seen numerous references to virtual machines with interpreters..which sound pretty complex to do.

Second, I know the general steps are:
Code->Lexer tokenizes code->Parser parses the code->Code is interpreted/intermediate language then assembled or somesuch.

Again, I've seen most people say 'use lex/flex and bison/yacc or ANTLR.' to save a lot of time. I take it these generate the lexer/parser parts for you..but how much do they actually do? I'd like to implement as much as this from scratch as possible for the reason that I want to say I've done most of it, not just using an existing library. The 'reinventing the wheel' doesn't really apply since this is for fun, learning, and as a hobby project to do.

Thirdly, is the Jack Crenshaw compiler tutorial still a good read? It's 20 years old, I think.
I'm an unemployed student so I can't exactly afford the £50 for the dragon book, however good that's supposed to be.

tl;dr
I want to write a simple language as a hobby project.
1)Would a compiler or an interpreter be easier to do?
2)How much of the work do lex/yacc do for you? Would it be a huge part that I might want to write myself?
3)Is the Jack Crenshaw 'Lets write a compiler!' tutorial still relevant?

Thanks!
Advertisement
What's with all the programming language threads recently? I'm certainly not complaining, but the timing seems... odd.


I think the first thing is do I write a compiler or interpreter?


I wouldn't start there. A programming language is a software product like any other. Start with requirements. What does it need to do? What do you want to achieve?

Then plan out what you need to do that. For my hobby language, I started with the type system. It worked well for me, since that was the kind of core of what I wanted to do. A different language will require a different approach. The interpreter/compiler is often the last step, because you often don't care about that plumbing.


I'm not entirely sure how an interpreter would work - instead of converting the entire file to assembly, it would just process the line at a time?
As such, which would be simpler to write? From my searching I've seen numerous references to virtual machines with interpreters..which sound pretty complex to do.


Some do one line at a time. A more common middle ground these days is the virtual machine route. The parser spits out an in-memory version of the program which is a series of operations to do. Your code then can do those operations. A (simple) virtual machine isn't that complex once you start toying around with the stuff.

An interpreter is easiest to write, but your language requirements might prevent you from going that route.


Again, I've seen most people say 'use lex/flex and bison/yacc or ANTLR.' to save a lot of time. I take it these generate the lexer/parser parts for you..but how much do they actually do? I'd like to implement as much as this from scratch as possible for the reason that I want to say I've done most of it, not just using an existing library. The 'reinventing the wheel' doesn't really apply since this is for fun, learning, and as a hobby project to do.


They do quite a bit from what I understand. I wrote my own parser, but I do not recommend that route. Learning how to use the tools is more valuable than learning how to do it yourself; especially for tools like these that are standard, unlikely to change much, and pretty hard to use correctly in the first place.


Thirdly, is the Jack Crenshaw compiler tutorial still a good read? It's 20 years old, I think.
I'm an unemployed student so I can't exactly afford the £50 for the dragon book, however good that's supposed to be.


It should be fine. Pretty much every tutorial in the world starts with the calculator example, and it's so well known that it doesn't change much. I've heard varied things about the dragon book, and it will probably be overkill for your first hobby language.
The absolute first thing you need to do, as Telastyn mentioned, is do some requirements gathering. What will your language do? How will it be different from other, existing languages? Will it be for general-purpose development or more domain-specific? What will its primary strength be? Speed of execution? Rapid development? Easy reading for beginners? Something more interesting?

Once you have a good picture of what exactly you're trying to accomplish, then you can start worrying about things like execution model. Interpreters are often easier to write but require a lot more thought in how you will structure your language, because you have to explicitly handle things that come more or less for free on a VM/native code compilation model. Interpreting code is also very slow by comparison, so if you're interested in any kind of high-performance applications, that goes out the window. Compilation to native code is incredibly difficult and complex due to the realities of how modern CPUs work. Compilation to an intermediate form like C is pretty easy once you understand basic AST transformation principles; compilation to virtual machine code is slightly harder but again gives you a tremendous amount of flexibility and control over the mechanics of your language. You get more for "free" in the VM model than any other approach, IMHO; because you can control both the language mechanics and the machine's mechanics.

Deciding on execution model really depends more on your language requirements than anything else. It's also more or less arbitrary unless you're trying to make a really successful, widely-used language; for a hobby project, pick the right balance between ease of implementation and learning/educational opportunities, and go from there.

I second Telastyn's recommendation against rolling your own parser. Unless your syntax is painfully simple and consistent (think Lisp) you will spend far more time trying to get your parser right than actually doing fun work on the language. Ad hoc parsers are evil, complex, and hideously hard to debug. You are far better off using a parser-generator tool.

You will still have plenty of work to do, so don't worry about it taking all the fun out of the project. For instance, even with a parser generator at your disposal, you have to think about how to generate your AST and decorate it and then transform it for various compilation purposes and then actually emit it into an executable format, blah blah blah. The vast bulk of the trickiness of making a language lies outside of the parser, even for pathologically stupid syntaxes like C++.


Try learning some assembly language if you can, for any architecture, as this will make your entire project a lot easier. Look especially for the mapping between, say, C and x86 ASM, and think about how you would do something similar. (Hint: use your C compiler with optimizations turned off and look at the generated assembly. Make sure not to try this with optimizations on unless you want to cry in despair a lot.)

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

You might want to write the parser yourself, but once you get started you're going to change your mind quickly. Like, VERY quickly. If you're only making a toy language, you could try out the BNF Converter, an even more high-level tool than for example Bison. I wouldn't recommend it for larger projects though, as its generality locks your AST to your concrete syntax in pretty awkward ways.

Also, you might want to avoid using languages like C++, C#, Java, etc. when writing a compiler and instead take a look at some languages that are better suited to manipulating symbolic data, which is what you're going to be doing a LOT of. Haskell is great for this sort of work, and if you're into .NET there's F# for you.
It's not that hard to write a parser, actually, it depends on what you need. For example, I've wrote my own INI/CFG and XML parsers quite a few times. The method I use is(web apparently) the easiest to use and quickest ti implement. It's called Recursive Decent Parsing. Might I suggest you research that topic? Okay, I'll suggest it. I suggest you should research Recursive Decent parsing because it's the easiest I've found so far to implement. Others.... are more complicated. Here's a link to <a href="http://en.wikipedia.org/wiki/Recursive_descent_parser">wiki</a>

Not to reiterate what was said before, but it all comes down to requirements. If you looking for something along the lines of a Basic or C dialect, such as Javascript, then this type of parser might not work for you. One thing you will have to watch out for is recursive dead-lock! that's what I call it, but the technical name, I believe, is "Non Left-Recursion".The link I provided should give more information on that. But please keep in mind that a parser is, very much so, only half the deal here. What are you going to do with the data once it's parsed? What about security? Since this is a game development site, I'm making and assumption here, this language of yours will be used to develop games, correct? How will you run it? Will you use a virtual machine similar to Java, or produce and actual executable? What about the runtime or platform you are planning on writing for? Linux? Mac? Typical Windows? Lot's of questions...

Sorry if I sound disheartening. This sounds like a typical something that I would do, or be into. In fact, currently I kinda am so... If you need help with anything, don't hesitate to email or something, because I wouldn't mind writing parts of it for you. Cuz I'm cool like that.
Recursive descent is extremely limited in the types of grammars it can cleanly handle. Yes, you can (mechanically) transform any LR grammar into an LL grammar and therefore into one suitable for an RDP, but that's a lot of excess overhead.

Moreover, recursive descent parsers are very difficult to debug when you get past a minimum threshold of complexity. And rest assured, you'll spend a lot of time debugging your parser, even if you use a generator. I use boost::spirit for Epoch, which is an LL(inf) parser generator; it does an unlimited number of token lookahead steps to match grammar productions in an LL grammar using recursive descent. Although it's incredibly more productive than rolling my own parser (especially considering the relative sophistication of Epoch's grammar) it's still a total nightmare to debug. Anything that goes wrong means a couple of hours minimum slogging through the generated code looking for exactly how a production failed, or a semantic action triggered incorrectly, or whatever.

Although I'm satisfied with spirit for my own uses, I plan to escape it at some point and lean towards an LALR parser generated from a more customary generator tool. This is something that will likely happen once the grammar stabilizes a bit and the biggest bits of ambiguity and open-ended extension are taken care of. LALR parsers are, generally speaking, more compact in terms of generated code, faster, and use less runtime storage; they can be beastly to write by hand but are generally fairly straightforward to understand and debug, at least in my limited experience.

Using a generator has one final advantage that cannot be understated: it obviates the need to wonder if your bugs are due to the grammar/semantic actions, or the parsing process itself. Generally you can rely on generators to produce correct output; the only thing that can go wrong is user error unless you do something truly pathological with your grammar (in which case it is almost always the case that simple mechanical transformations of the grammar can eliminate the problem). In terms of wasted time, this drastically cuts down on the cost to produce a working, reliable parser.

Lastly, if you plan on having any kind of mutable or extensible syntax, use a generator. Hand-coding a parser that correctly deals with dynamic grammars is a bitch to get right.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Well, here's another reference to my post about a simple interpreted language I wrote for the heck of it, maybe you can get some ideas:

My Post


Good Luck

My Gamedev Journal: 2D Game Making, the Easy Way

---(Old Blog, still has good info): 2dGameMaking
-----
"No one ever posts on that message board; it's too crowded." - Yoga Berra (sorta)

Thanks for all the feedback, guys.

As for purpose, I don't think there is one at this stage past general development. Though I doubt it'll really be used as anything other than 'I made this' and a toy project. As for the design, it's very simple Pythonic like, with indents for scopes and using capital letters/full stops as delims (which probably isn't the best idea, but I think it looks neat and tidy).

I'm writing it in Python..as a side point, why would it be more difficult in a lower language like C++?
I've got the general lexer design set out in my head (wow, regexes seem to be built for this) - I'm guessing this is the easiest bit.

As for parsing goes, I shall give recursive descent parsers a read up on, then give that a go..I think I probably will go off them and go for bison or something similar, but the challenge is there for me.

I just hope considerations for each part linearly isn't going to be an issue. That is, I've spent the last couple of days thinking about just the lexer, then I'll just be thinking about the parser - so currently I have no idea whether it'll be compiled to an executable or onto a VM.
Once I tried to make my own compiler for the microcontroller PIC16F84. The ASM of the device was about 30 instructions, and I tried to make my own C-like language so I can write programs in C, instead of ASM. It basically translated the C code to ASM. It was kind of fun. I remember I tokenized the input, then I tried to guess what kind of operation is taking place - (+,-,++ , a loop, conditional statement, etc..) have some kind of "begin block, end block" for functions, do / for / while loops, "if's-else" and such. I also remember I made heavy use of labels in the ASM output. It was kinda of fun to write. smile.gif
Here is an early picture of it.


picc194.th.jpg

Uploaded with ImageShack.us

I'm writing it in Python..as a side point, why would it be more difficult in a lower language like C++?
My favourite feature ever: pattern matching. The high-level view of a program as a transformation from input to output rather than a set of reads and writes to various memory locations, something which can get in the way when writing interactive programs, also is a huge boon for this type of program.

This topic is closed to new replies.

Advertisement