Attribute grammar implementation?

Started by
4 comments, last by andrew111 8 years, 4 months ago

Hi everyone.

I have some theoretical doubt.

I am implementing a small compiler of a custom language. For some internal reasons i have to fully implement the compiler(no lex, no yacc, etc...yeah, reinvent the wheel..).

Anyway. Until now, i already have the lexic parser and the syntax parser.

For the lexic parser, i used a hand-made automata which is quite fast.

For the syntax parser i did the following(which i'm not quite sure it is the correct way):

I defined my grammar in a text file.

Then i implemented a "grammar parser" which builds the LR(1) matrix for that grammar. The matrix is saved in another text file.

Then the compiler loads all the data from this file(terminals, no-terminals, the matrix).

The i apply the bottom-up LR(1) parsing algorithm.

From here on, is where i get a bit lost.

I have read that i can generate code while doing the syntax parsing. But some semantic properties cannot be known until i have a full syntax tree.

So i decided to build the tree.

So now the syntax parser builds a tree(which is my first IR).

From there i made a simplified in-order tree.

By traversing the tree i think i can do the code generation. That's where i have stopped.

I read about attribute grammars. Although they seem useful, i don't know how can be used.

The theory says that each grammar rule now has some pseudo-code which specify what happens with the symbols parsed.

My idea and question is this:

Do i have now to create a new parser which reads not only the grammar, but it's attributes and then.. what???

This feels like implementing a compiler to implement a compiler......

So, how are attribute grammars implemented or used in a compiler.

I know it may sound dumb, but there is no book nor article or tutorial which actually says something about implementing attribute grammars.

Advertisement
I do this kind of thing in C#, and instead of defining the grammar productions in a text file, I define them in the C# code itself. For example in an expression evaluator:


grammarBuilder.AddRule("Expression -> Expression * Expression", arguments => int.Parse(arguments[0].Text) * int.Parse(arguments[2].Text)); // arguments[1] is the * operator token.
The lambda expression "arguments => ..." is called during when the LR parser executes the reduce action for this production, and the return value of the lambda expression is pushed onto the LR parser's token stack. "arguments" in this case is a Token array representing the bits of the LR stack that are being popped for this reduction ([0] = LHS expression, [1] = * operator, [2] = RHS expression in this case). I hadn't heard of attribute grammars before, but when I looked it up it appears that this is the same thing as what I'm doing. My parser is actually a GLR parser, and this concept still works even with a graph-like LR stack.

You could do the same thing in text files, but you'd have to implement the parsing for it as well. That's the main reason I decided to put everything in C# - so I could just let the C# language handle most of the work for me.

For code generation, it might be better to just build up your syntax tree and then have a completely separate step read the syntax tree to build code, because generating code may need to make its decisions based on more than just one production at a time...

I'm not sure exactly what kind of code you want to generate, and the best implementation probably depends on exactly what you need.



If you want to use attribute grammars, and construct the entire syntax tree first instead of modifying the tree on the fly, all you REALLY need to do is give the tree nodes the unique production ID that was used when they were reduced. This ID is all you need to associate them with whatever your code generator uses to make its decisions. This decouples the parsing stage from the attribute application step.

However, you may not even need the production ID. In most programming-language grammars I've seen, once the syntax tree is formed, you have one function per major type of node which has handling for each of the possible arrangements of child nodes (i.e. exactly like what a recursive descent parser would do to make its parsing decisions, except you do it on the syntax tree instead of the token stream).
The approach that I normally use is to build an AST (abstract syntax tree) during parsing, and use that for further processing.

I have read about attribute grammars, and you can implement them (probably), but I never used them. I think they are mostly useful to define a computation in a tree (that is, I can write it down, and give the description and you will understand how it is computed).
Practically, you are forced to store everything in the nodes, and pass stuff around, highly complicated due to the confinement of the nodes.

Programming one or more walks over the tree in a general purpose programming language is much easier, as you can keep all kinds of additional tables and data around for your processing.
Don't be afraid to do several walks, they are cheap, and they make nice seperate steps, keeping things easy.
The 'simple' approaches usually promote "do everything in one sweep", as it eliminates the need from building a tree. On the other hand, you may have to push yourself in complicated corners to get all the information at the right time, and there are a lot of things you cannot do.

If you really want to program attribute computations, iirc they are mathematical values, so it doesn't hurt computing the same thing more than once. A trivial implementation would be to walk over the tree, and compute every attribute every single time, until you don't get any changes any more.
In a more complicated setting you can add booleans "computed" to avoid computing the same thing more than once, or you can attach an increasing number to each result, with the rule that if any of the sources of an attribute has a higher number than the result attribute, the latter is out of date and needs to be recomputed.
No doubt you can do smarter things here if you want.

Did I already mention just programming a walk over the tree in normal programming language is much easier?

Thanks for the answers.

It seems i'm overcomplicating things.

Perhaps the attribute grammars should be taken just as a model, not as a practical structure.

Well, let's make this compiler compile and then optimize.

THanks

Here's how the hand-written compilers I've built work, in the hopes that it may help you visualize the problem better - note that for a hand-built compiler like this, I don't bother with trying to define rules in an external file, I use the "recursive descent" mechanism which works well (especially since I'm not making a compiler-generator tool, but an actual compiler).

Let's start with a simple expression:


A = B + C
Now, if we run this through our parser, we'll likely end up with something like this for our AST (bad ASCII art ahoy):


Assign
|  |
A Add
  | |
  B C
Now we just need to generate code by walking the tree and outputting assembly as we go (mythical VM assembly used for clarity):


void OutputAssignment(const TreeNode& aRoot)
{
  // Pushes the result onto the VM's stack
  OutputExpression(aRoot.Right);
  // Assembly: "POP <destination>"
  std::cout << "POP " << aRoot.Left.Name << "\n";
}

void OutputExpression(const TreeNode& aRoot)
{
  if (aRoot.Type = TokenType::ADD)
  {
    OutputAdd(retVal, aRoot);
  }
  // handle other expression types here
  return output;
}

void OutputAdd(const TreeNode& aRoot)
{
  auto leftSideResult = OutputToken(aRoot.Left);
  auto rightSideResult = OutputToken(aRoot.Right);
  // Assembly: "PUSH <left>\nADD <right>"
  std::cout << "PUSH " << leftSideResult.Name << "\nADD " << rightSideResult.Name << "\n";
}

Variable OutputToken(const TreeNode& aRoot)
{
  // If we wanted to support parenthesis, we could detect them here and recurse back into OutputExpression, popping
  // the stack into a temporary variable and returning it as our result.
  // We wouldn't want to leave it on the stack for our caller because that makes the ADD code harder, since now it
  // has two values on the stack and our mythical VM assembly code wants a single argument for ADD, not two values
  // on the stack.
  // Nothing a good peephole optimizer on the assembler side can't fix though!
  return Variable(aRoot.Name);
}
Note how the structure of the functions and the call graph walk the tree and output the resulting assembly in a depth-first traversal.

In the end, the above code would give you:


PUSH B
ADD C
POP A
Of course, this is highly simplified, but the example carries over to other types of assembly and fancier code.

The only caveat is that your complier is limited by stack space since it's doing a recursive processing of the AST instead of a non-recursive parsing - but a little cleverness will fix that smile.png

If the implementation language doesn't matter, there is a library for Scala called Kiama at https://bitbucket.org/inkytonik/kiama

I don't remember all the details since it has been a while since I used it, but it uses a "packrat parser". Which you basically write the grammar directly in Scala, and specify the classes/params to be used to build the syntax tree.

Attributes are declared in recursive tree-like expressions. We used the attributes for things like calculating scope and constant expressions, I forget what else, but I remember it being really easy at the time.

Another option is to use Ragel for lexing and Lemon for parsing. These can be used to generate parsers in C.

http://www.colm.net/open-source/ragel/

http://www.hwaci.com/sw/lemon/

This topic is closed to new replies.

Advertisement