Help me with my design

Started by
6 comments, last by King Mir 10 years, 4 months ago

Background

I'm writing a compiler, and I'm trying to come up with a good design for an abstract syntax tree, and it's conversion into code.

The Problem
The AST is built in depth first order, such that every node added will only ever refer to previously added nodes, and in most cases referenced nodes are the most recently added nodes. It is naturally a tree, so each node except the root node is referenced by one parent. For data, an AST node has a type, which can optionally be represented as an enum, a string value, and references to a usually small number of other nodes.

The above naturally leads to an implementation of a std::deque of nodes, acting like a stack. This takes advantage of the fact that elements in a deque tend to be adjacent in memory. It also allows the tree to grow without reallocation as it is built.

To convert the ast to code, I need to traverse the ast in essentially the same depth first traversal, except it may be convenient to start at the root node instead of the first node added. This would make make code generation a recursive algorithm, implementing depth first traversal. How to convert each node depends mostly on the logical type of the node. There may be a hierarchy to this conversion, that is it may make sense to say that converting a constructor is similar to converting a function. A hierarchy may organize code better.

So it strikes me as prudent to use virtual dispatch to implement the conversion. That is, each node type has an associated virtual convert/code_gen/compile function somewhere.

These two designs are at odds with each other, and I'd like your input on how to graft them together. There may also be other considerations that I've overlooked.

Summary

How to have a near continuous in memory and simply structured data, but use virtual dispatch to do operate on that data?


Thanks for your replies.

Advertisement

#1: Don’t traverse using recursion; you will stack-overflow if your language is decent enough. Use an explicit stack for safety. It is only a bit harder to implement.

#2: Don’t use virtual functions. They don’t really simplify any of the complexity while making it a pain to debug and overall slower at run-time. Each node will be the same structure but with a type identifier that matches an enumerated value determining what type of data the structure holds. All you really need from there is to have a bunch of unions of structure inside that structure and pass the correct one to a function/method specifically named, “ParseIfNode( const ML_IF_NODE &_inIfNode, ML_PREFIXOUTPUT &_poPreOut, ML_OUTPUT &_poOut, ML_POSTFIXOUTPUT &_poPostOut )”.

Virtuals will just make a mess of things; for the sake of understandability you should just use named functions to parse nodes.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

#1: Don’t traverse using recursion; you will stack-overflow if your language is decent enough. Use an explicit stack for safety. It is only a bit harder to implement.

Can't I just use a segmented stack?

#2: Don’t use virtual functions. They don’t really simplify any of the complexity while making it a pain to debug and overall slower at run-time. Each node will be the same structure but with a type identifier that matches an enumerated value determining what type of data the structure holds. All you really need from there is to have a bunch of unions of structure inside that structure and pass the correct one to a function/method specifically named, “ParseIfNode( const ML_IF_NODE &_inIfNode, ML_PREFIXOUTPUT &_poPreOut, ML_OUTPUT &_poOut, ML_POSTFIXOUTPUT &_poPostOut )”.
Virtuals will just make a mess of things; for the sake of understandability you should just use named functions to parse nodes.

L. Spiro

I was hoping virtual would naturally create a hierarchy and some order instead just having a bunch of named functions. I suppose I could do that some other way.

Also, having a switch or array that maps an enum to a function feels like I'm reinventing virtual dispatch.

Can't I just use a segmented stack?

Knock yourself out.


Also, having a switch or array that maps an enum to a function feels like I'm reinventing virtual dispatch.

You’d just be using it to decide what type of class to instantiate instead.
And I am still not sure how virtual functions will help to solve this problem, especially since parsing a syntax tree generally leaves you with fixed-sized structures of a single class. You could allocate classes of a specific type based on the type of nodes you encounter at syntax-tree-creation time but then you would have to allocate each node individually and you would likely exhaust your memory.

Unless you have something else in mind for how to use the virtuals.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Also, having a switch or array that maps an enum to a function feels like I'm reinventing virtual dispatch.

You’d just be using it to decide what type of class to instantiate instead.
And I am still not sure how virtual functions will help to solve this problem, especially since parsing a syntax tree generally leaves you with fixed-sized structures of a single class. You could allocate classes of a specific type based on the type of nodes you encounter at syntax-tree-creation time but then you would have to allocate each node individually and you would likely exhaust your memory.

Unless you have something else in mind for how to use the virtuals.

L. Spiro

Yeah, creating the nodes with different types in near continuous memory is exactly why the two approaches don't mesh. I can think of a few ways to graft them, but they aren't very elegant. That's why I'm asking for input.

I don't see why nodes would be any more likely to exhaust memory if I allocate nodes individually though. The size of a node is still the same. I do want them to be in (near) continuous memory for cache reasons.

Another problem with using an array or switch is that there is likely to be more than one function to be called this way. For instance, depending on context, an expression may be compiled proper, or it may only be evaluated for the result type. But only a subset of nodes can be evaluated like that. So you wind up with multiple arrays or switches, which for most nodes don't actually map to a function.


Thanks for the feedback. It's appreciated.

I don't see why nodes would be any more likely to exhaust memory if I allocate nodes individually though. The size of a node is still the same.

Every allocation costs an extra few bytes (typically close to 16) for book-keeping and alignment.
Allocating a single byte means around 16 bytes extra for book-keeping, the 1 byte you allocated, plus 3 more bytes for alignment purposes.


Another problem with using an array or switch is that there is likely to be more than one function to be called this way.

You generally need only the compilation process to have the full switch set. Others, such as figuring out the type of an expression, don’t actually replicate the whole switch case, and in fact very little of it.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

I don't see why nodes would be any more likely to exhaust memory if I allocate nodes individually though. The size of a node is still the same.

Every allocation costs an extra few bytes (typically close to 16) for book-keeping and alignment.
Allocating a single byte means around 16 bytes extra for book-keeping, the 1 byte you allocated, plus 3 more bytes for alignment purposes.

I see.

Another problem with using an array or switch is that there is likely to be more than one function to be called this way.

You generally need only the compilation process to have the full switch set. Others, such as figuring out the type of an expression, don’t actually replicate the whole switch case, and in fact very little of it.


L. Spiro

True. It's currently about a quarter.
I fear I've framed this as too domain specific, and scared off people who don't know anything about compilers. This really is a general question about design, and hopefully I've given enough general information for anyone to respond.

This topic is closed to new replies.

Advertisement