Token insertion (?)

Started by
8 comments, last by thre3dee 17 years, 7 months ago
Hi, I have a script language in which I am now able to tokenize. Once I have tokenized a line of "raw code", I place a token into a "std::vector" for further processing at a later stage. My questions is once I have built a list of tokens, I have to search though the list to find the appropriate token. If the list has (theorically) hundreds of tokens I am now faced with the dreaded O(n) problem, which is very slow. Is there a way to make this faster? One possable solution to the problem is to use a "std::map" along with ether a 'hash' function or 'handler' to perform the needed insertion and look-up. I'd like to get some feed back on the problem and perhaps some hints, suggestion or comments please. Thank you, Sabrina.
Advertisement
Scripts are usually processed linearly. Why do you need to search through it?
if your tokens are strings, surely you would use a binary search tree or something like that if you were really worried about going through a list of tokens.
Anything posted is personal opinion which does not in anyway reflect or represent my employer. Any code and opinion is expressed “as is” and used at your own risk – it does not constitute a legal relationship of any kind.
It sounds like you're implementing jumps or function calls. In that case, the usual approach is to keep a map with the appropriate identifiers as you suggested that holds pointers/indices to the desired destination.

Otherwise, please be more specific :)
In case you were wondering what to put in your next christian game; don't ask me, because I'm an atheist, an infidel, and all in all an antitheist. If that doesn't bother you, visit my site that has all kinds of small utilities and widgets.

Hi guys,

First, I was somewhat vague about what I am trying to ask, so let me back up and explain.

Here is a very primitive script file that has a variable and corresponding value. (The 'script' file is just a fancy ".ini" file.)
[SOURCE]{  Size = 1;}[/SOURCE]


When I scan the file to create Tokens, I get the following:
[SOURCE]TokenType: LEFTBRACETokenName: {TokenValue: NONETokenType: LITERAL (Variable)TokenName: SizeTokenValue: NONETokenType: ASSIGNMENTTokenName: = TokenValue: NONETokenType: INTEGERTokenName: 1TokenValue: 1TokenType: SEMICOLONTokenName: ;TokenValue: NONETokenType: RIGHTBRACETokenName: }TokenValue: NONE[/SOURCE]


Once I have generated a token, I place said token into a list (std::vector). When the EOF marker ('\0' since I'm reading from a buffer and not a file.) has been reached, I pass the token list to the Parser. Once the Parser has the list, it itor's though the list validating each token. (evaluate the "grammar rules" of my script.)

Second, let me back up and ask this question (I think Kylotan hinted on this).
Is this the proper methodology of parsing scripts? That is, instead of creating a list of tokens and passing that list off to the parser, should I be instead be creating a token THEN pass that token to the parser? Is this what you meat Kylotan by .."processed linearly", if not could you clearify please?

Any thoughts, comments, suggestions?

Thanks,


Sabrina
There is no 'proper' methodology; only different methodologies that suit different tasks. What you do with those tokens depends on several things, such as how you're going to execute the language, how large/long the token stream is going to be, whether tokenisation is context dependent and requires knowledge of previous token order, etc.
Quote:Original post by SMcNeil
Any thoughts, comments, suggestions?


If you've got something working right now, then it's good. If / when you get performance issues, then investigate.
Anything posted is personal opinion which does not in anyway reflect or represent my employer. Any code and opinion is expressed “as is” and used at your own risk – it does not constitute a legal relationship of any kind.
The tokenisers and compiler programs I've worked with process the tokens one at a time (with a bit of read ahead). Is there any particular reason for reading in the whole thing at once then parsing? It's possible to design a tokeniser to only tokenise when needed.

Parsers I've written call scanner.scan() then access scanner.token then repeat (scan also returns the token for simple operations).
I guess consider using stacks in your implementation if you've not already.
Anything posted is personal opinion which does not in anyway reflect or represent my employer. Any code and opinion is expressed “as is” and used at your own risk – it does not constitute a legal relationship of any kind.
a simple tokenizer i made uses one big string and has a 'current token' char ptr.

The tokens in the string were delimited by a null character and two for the last. then on 'advance to next' i just do while (*curtkn) ++curtkn; return ++curtkn;

this could be optimized to not have to inc to next token but hey if most of ur tokens are only 3-5 chars its not going to make much differences.

This topic is closed to new replies.

Advertisement