Jump to content
  • Advertisement
Sign in to follow this  
thre3dee

advice for optimizations in my script parser

This topic is 3758 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

hi all, i'm working a scripting language, more so to see how far I can get than to actually try to take on the scripting world (coz that takes lots of perfecting and improving ur language/api....) but Ive got a few questions about the parsing stage of the compiler. My tokeniser is finished, however I've got one question about keywords/operators and identifiers. My identifiers match the C++ style (letters, numbers and underscores but only letters or underscors in the first character). Each of my token structures has a 'group' enum (string? keyword? identifier? number?) and a 'type' enum (for keyword?, return keyword? ++ operator?) and the actual token in its string form. So in that case theres only an integer comparison to determine what keyword a keyword token is (token->type == TOKEN_T_IF).
// holds a token
struct token_t
{
	token_t * next;
	token_t * prev;
	token_group group; // type of token such as operator or identifier or keyword
	token_type type; // specific type such as 'if' or 'return' keyword etc
	const char * token;
	char ch; // if its a character literal
	int line; // line for debugging

	// destructor
	~token_t (void);
};

My compiler first loads the file into a text buffer (single-byte), removes comments (turns them into one space character), and clears with a space any non-tab/new-line characters 'invisible' characters then begins tokenising. The tokeniser generates the entire list of tokens before returning control to the compiler and obviously exits when any token errors are found (irrespective of syntax/semantics). During tokenising an identifier, the identifier being added is looked up in the keyword list and if found creates a token based on that keyword with TOKEN_G_KEYWORD, otherwise creates an TOKEN_G_IDENTIFIER token. So when it comes to creating a syntax tree and checking semantics, all the compiler has to do is go use an integer to determine what keyword a keyword token is, not using the whole string. Should this be left until later on? As far I see it, the syntax generator would still have to search a keyword list to determine whether an identifier is not actually an identifier but a keyword. Alternatively, is it better to generate the whole list of tokens before syntax is generated? Or is generating syntax using the current scanned token a better option. As far as I can see, if u come up with a token error then you'll have to waste time destroying any syntax related objects before returning to the program. Any suggestions/advice on 'tried and true' optimisations in this early stage of a compiler?

Share this post


Link to post
Share on other sites
Advertisement
Don't store the tokens. Most parsers call a "get next token" function which simply reads the next token and returns it, and then process the token and discard it. This is the most important optimization, as it tends to save a lot of memory and cache overhead. Even if you don't do it this way, you'll still improve your running time if you drop the handmade list implementation and use std::vector instead (both in terms of speed and memory usage).

Also, avoid doing multiple passes. You should on-the-fly be loading characters from the file, eliminating spaces and comments, combining characters into tokens, combining tokens into a syntax tree, and redecorating the syntax tree into a decorated tree, all in a single pass. A second pass is warranted to do the link step, but even this is not necessary if you handle your layout well. Multiple passes means more memory, more pipeline stalls due to hard-drive and cache latency, and the overhead of having to reorder data around. Doing three passes for the token-generation step is just too much.

Error-handling has no reason to be fast: you'll probably only handle very few syntax or token errors when compared to the actual work performed when the source code is correct, so don't bother optimizing them at the cost of making other things slower. You don't care if error reporting takes two milliseconds instead of one.

I personally wouldn't make a difference between character literals and one-character strings. In fact, I would make a token use the following interface:

struct token
{
token_type type; // Type: KEYWORD_FOR, IDENTIFIER, OPERATOR_PLUS ...
int begin, end; // Start and past-the-end index in the input stream
};


Mundane details (such as the line number) can be deduced automatically in O(log n) time from the begin-end pair without requiring as much storage memory.

Share this post


Link to post
Share on other sites
hmm that makes a lot of sense. you're right about the memory usage, letalone the amount of allocations it takes to store individual copies of the tokens from the stream.

thanks for that. now ive just gotta get my head around complex expression parsing and syntax tree generation and whatnot.

Share this post


Link to post
Share on other sites
ive just refactored my tokeniser to follow this method and its cut out all memory allocations, both string/identifiers and the tokens themselves. my tokeniser does the same checks on things like strings and floating point/hex constants etc but certain aspects like escape sequences will only be reformatted when it comes to finishing up compilation.

i've left the group/type separation as this will reduce the size of switches etc so Ill be able to quickly determine if something is a string, keyword or identifier then evaluate each type quicker.

// token group
enum token_group
{
TOKEN_G_NUMBER,
TOKEN_G_STRING,
TOKEN_G_OPERATOR,
TOKEN_G_KEYWORD,
TOKEN_G_IDENTIFIER,
TOKEN_G_DONTUSE
};

// token type
enum token_type
{
TOKEN_T_IF,
TOKEN_T_MINUSEQ,
TOKEN_T_CLASS // etc etc
};
// holds a token
struct token_t
{
token_group group;
token_type type;
int start, end;
};

// tokenises a script
class token_c
{
public:
// constructor
token_c (const char * text, compile_information & info);

// destructor
~token_c (void);

// scans for the next token from a particular position in the input stream
int scan (int from, token_t & token);

// gets the line number from a specific index
int lines (int at);

// gets the input stream
void token_at (int start, int end, char * buf);

private:
// finds the token definition of a keyword
const token_def * find_keyword (const char * tok);

// finds the token definition of an operator
const token_def * find_operator (const char * tok);

const char * _text;
compile_information & _comp_info;
};

Share this post


Link to post
Share on other sites
I have a lexer I coded a while back. It's more complicated than most because it keeps track of indentation. I'm working on a tab block language and I found the easiest way to do that is turn a change in indentation into a token. Feel free to have a look (and feedback would be appreciated as well!).

It's written in C#. I have a C++ version somewhere if you think it would be useful.

ToohrVyk's comments ring true. If your grammar is simple enough you will only need to know the next token to change state and so you can discard all tokens after you have seen them.

As a suggestion, a recursive decent parser is a good place to start if you are writing it yourself.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!