Back to General and Gameplay Programming

Compiler design

General and Gameplay Programming Programming

Started by Khaos August 03, 2005 11:11 PM

9 comments, last by Extrarius 18 years, 8 months ago

Khaos

170

Author

August 03, 2005 11:11 PM

Let's say I wanted to create a compiler for a little language of my design. Could someone explain the general process? Should the compiler take my code and turn it into assembly code, and then use an assembler, such as NASM to assemble it, or write my own assembler and assemble it? Either way, that doesn't seem like the right thing to do. Also, I don't want to use those compiler design libraries. But anyway, how do the current compilers do it? All I see, even with home-made compilers, is a source code to executable conversion.. and no middle step (assembly). I would like to do this, but I am not sure what they are doing. So, in general, what is the compiler converting my code into. Say for example: x=10; x=x+1 -> What might this be converted into, in general? I don't know how to ask this clearly. If you know what I mean, please help. I am not asking for specifics, just the general idea so I can get started. As simple as possible for now, nothing big, nothing fancy. Thank you.

Roboguy

794

August 03, 2005 11:23 PM

Probably the easiest thing to do would to either use a VM or compile into assembly (IIRC, GCC always compiles to assembly, then assembles it with the assembler specified in it's configuration file).

http://www.roboguy.net(WIP) - lisperati - SICP - Haskell - Python - OCaml - Lambda the Ultimate - Good Math, Bad Math - Wiki (not Wikipedia) - Pure - Term-Rewriting Functional Language

Oluseyi

2,123

August 04, 2005 12:03 AM

Read.

nprz

692

August 04, 2005 01:09 AM

In my compiler class, we didn't have enough time to get from interpretor code to assembly, but there are so steps that make the process a bit easier.

Scanner: Determine symbols or combinations or symbols that aren't allowed, like <> wouldn't be seen in C, but is used in VB. Change the code into tokens, so that your parser can read it easier.

Parser: Determines that your grammar is correct. If you have a const put into a byref function, the parser should reject this.

Symbol Generator: Symbols need to be generated for everything. (x + 1) will be converted to a temporary symbol.

Intermediate Code or Assembly Code: You can turn all of your source code into assembly code in this step.

You should also make sure that your grammar is in a determinate form, such that looking at one token at a time, you know which path to take. This part is hard for me to explain though.

sirGustav

588

August 04, 2005 01:23 AM

This might help or you could let the compile convert your code into C and let another compiler compile your C code.

my game | github | last.fm

timw

598

August 04, 2005 04:59 AM

in general a compiler will take code and convert it into intermediate representation like an AST or some sort of tree structure. some compilers will optomize this, it's like turning your code into a tree graph, really handy when you want to do optomization. that (optomized tree) is then traversed and assembly is generally produced. then it is usually fed into an assembler, writing your own is a huge waste of time, considering debuging a compiler is no easy task, youll have pleanty of work to do anyway. read up on using comiler compilers like bison or jcup. these tools are invaluable to learn if you want to write a compiler, as they will save you much time in the long run of the project.

Helter Skelter

332

August 04, 2005 08:44 PM

Quote:Original post by Khaos
Let's say I wanted to create a compiler for a little language of my design. Could someone explain the general process?

Traditional compilers are broken down into several parts starting with lexical analysis. This takes an input stream (memory buffer, file, etc) and tokenizes the input. In the case of operators that are represented by more than one symbols (say '==' or '!=') a single token is used to represent them. For languages such as C, C++, and Java the lexical parser usually skips over whitespaces, new line characters, and comments rather than returning them as tokens (which makes parsing easier).

The next component is the parser which takes the tokenized input and matches it against the grammar of the language. In many cases the parser and lexical tokenizer work in parallel with the lexer returning one token at a time and the parser checking, accepting, rejecting, or skipping tokens in the input stream based on the langauges grammar. A good number of compilers use this step to create an abstract syntax tree. This is a tree of data structures representing the tokenized input in such a way that represents their position in the input stream.

Next you might run the abstract syntax tree through an optimizer that would either modify or remove certain nodes in the tree. This might be done to accomplish common subexpression elimination, constant folding, or loop unrolling depending on the type of target (VM or CPU).

From here you have several options to decide upon. If you are using an abstract syntax tree you might convert it into intermediate code that represents the operations in a sequential fashion rather than that of the AST. You also might generate assembly code or directly emit a binary object or java like class file.

Quote:Should the compiler take my code and turn it into assembly code....Either way, that doesn't seem like the right thing to do.

This is a common approach and is by far one of the easier methods and quite acceptable. Compilers that don't use an external assembler to generate .obj or java like .class files generally generally have an internal component that does virtually the same thing.

Quote:Also, I don't want to use those compiler design libraries. But anyway, how do the current compilers do it?

While you can certainly create all the lexical and grammar parsing parts of the compiler by hand I don't recommend it. Tools such as flex, bison, antlr, etc. are extrmely helpful and make langauge development much easier - especially since you will likely end up duplicating code in hand made parsers.

Quote:So, in general, what is the compiler converting my code into. Say for example: x=10; x=x+1 -> What might this be converted into, in general?

Depending on the operations performed on x before and after the code given you might end up with something like this:

; unoptimized    mov    dword ptr [ebp - 4], 10     ; set x = 10    mov    eax, dword ptr [ebp - 4]    ; get x    add    eax, 1                      ; add 1    mov    dword ptr [ebp - 4], eax    ; set x; partial optimization    mov    dword ptr [ebp - 4], 10     ; set x = 10    inc    dword ptr [ebp - 4]         ; do x = x + 1

For code generation look into using a burg implementation which, like bison and flex, takes an input file describing the operations to generate code for and automatically creates a code generator.

For an example of a compiler containing hand coded lexer and parser along with a burg code generator take a look at LCC.