[C] Parsing C tokens

Started by
26 comments, last by Ectara 13 years, 7 months ago
I'm writing a C interpreter/compiler, written in C, and am stuck at the parsing step. I wrote my own lexing process, which works nicely and gives me an array of tokens, but I am not sure as to the next step. I would figure that I would utilize a tree to order and evaluate an expression, then convert that tree to code. Any example I try to find either mixes lexing and parsing, which I can't use, or immediately jumps to Yacc and Lex, which is also of no use to me. So far I have variable and function declaration completed. Abandoning the project is unacceptable, as is using some sort of tool, or using another's interpreter. Has anyone had any luck getting the parser started?
Advertisement
In order to parse C code you need two things: The grammar for the language and a parser.

Finding the grammar for C isn't hard (first thing I found on Google).

You have decided for some reason that you don't want to use yacc or any other tools for parsing (could you explain why?). What do you know about parsers? Have you written one for simple arithmetic expressions? You can parse C using yacc, so it can be done with a LALR parser.

I have written a couple parsers in the past, one for ASM, and another couple for simpler languages than C. I've been using C almost every day for years, so I have a pretty solid knowledge of its syntax and grammar.
Quote:Original post by Ectara
I have written a couple parsers in the past, one for ASM, and another couple for simpler languages than C. I've been using C almost every day for years, so I have a pretty solid knowledge of its syntax and grammar.


I am very familiar with C's syntax and grammar, but I don't think I could write down a formal specification from memory.

Again, why don't you want to use lex and yacc?
Licensing conflicts, another dependency, and it would also be nice to have the experience. Using them is unacceptable.
If you're coding a parser by hand, I find the easiest path to take is to go for a Recursive Descent Parser (RDP).

The wikipedia article on the topic (link) got some nice examples to get you started. What you should do is to first describe the grammar you want to parse with something like BNF, after which it's very easy to translate it directly to a RDP.

It's very helpful if you make your tokenizer/lexer in such a way it's easy to assign semantic values to your tokens (for example "124" would translate to an TOKEN_INT with a value of 124).

If you want to make sure your expressions solve properly with respect to presedence (/ before *), you'd probably want to look up something about tail recursion.


Thanks for the heads up, I'll look into it. I wrote a special string hashkeying function, which I use as an integer key before my hash functions in tables, but also I use the keys of certain strings as symbolic constants for comparisons, so I'd say I already have that down. I have the types and indirection, and dereferencing down; I wrote the entire virtual machine underneath it, and matching ASM instruction sets, so I just need a language to interpret, and what better choice than C.
Quote:Original post by Ectara
Thanks for the heads up, I'll look into it. I wrote a special string hashkeying function, which I use as an integer key before my hash functions in tables, but also I use the keys of certain strings as symbolic constants for comparisons, so I'd say I already have that down. I have the types and indirection, and dereferencing down; I wrote the entire virtual machine underneath it, and matching ASM instruction sets, so I just need a language to interpret, and what better choice than C.


How's your virtual machine like? If you make it completely stack based, it's very-very-very easy to get stuff running in no time since it's trivial to translate a syntax tree into something which runs directly on a stack.

For example this syntax tree:

AssignExpressionNode | *---StorageIdentifierNode(FooBar) | |  *---MultiplyExpressionNode       |       |       *----IntegerConstantNode(2)       |       |       *----IntegerConstantNode(3)


To generate code for it you'd simply recursively run through the tree, starting at the root. When you reach an IntegerConstantNode, you'd generate a "push unto stack" instruction, etc. The above could produce the following code:

PUSH_STORAGE_REFERENCE FooBarPUSH_INTEGER 2PUSH_INTEGER 3MULTIPLYASSIGN


Etc, etc.

Quote:Original post by Ectara
so I just need a language to interpret, and what better choice than C.
I am pretty sure that should be a question, and the answer is: just about any other language. C is designed to be a portable assembly language, for goodness sake, why would you want to interpret it? You can't really capitalize on any of the features interpreters bring, and C *already* runs everywhere...

Not to mention that it has been done at least a couple of times already, and nobody much seems to use it.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

As for why I chose C, I like C. Not much good a new module or language will be for me if I hate using it, and eventually end up writing a new one. I also like the fact that it is portable assembly, giving it the ability to not only run on every household appliance in my kitchen, but it will also run on my VM. The interpretation is for code maintainability and patching; no need to recompile a script; though it could be compiled to bytecode, even that isn't hardcoded.

As for the VM, it's a flat memory model stack based "machine", with simulated threading and flexible executable format, allowing for dynamically linked script libraries. It also uses registers, and supports all common datatypes but 64+bit integers right now. That will be added later, since it isn't always intrinsic.

This topic is closed to new replies.

Advertisement