# Token insertion (?)

This topic is 4316 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi, I have a script language in which I am now able to tokenize. Once I have tokenized a line of "raw code", I place a token into a "std::vector" for further processing at a later stage. My questions is once I have built a list of tokens, I have to search though the list to find the appropriate token. If the list has (theorically) hundreds of tokens I am now faced with the dreaded O(n) problem, which is very slow. Is there a way to make this faster? One possable solution to the problem is to use a "std::map" along with ether a 'hash' function or 'handler' to perform the needed insertion and look-up. I'd like to get some feed back on the problem and perhaps some hints, suggestion or comments please. Thank you, Sabrina.

##### Share on other sites
Scripts are usually processed linearly. Why do you need to search through it?

##### Share on other sites
if your tokens are strings, surely you would use a binary search tree or something like that if you were really worried about going through a list of tokens.

##### Share on other sites
It sounds like you're implementing jumps or function calls. In that case, the usual approach is to keep a map with the appropriate identifiers as you suggested that holds pointers/indices to the desired destination.

Otherwise, please be more specific :)

##### Share on other sites

Hi guys,

First, I was somewhat vague about what I am trying to ask, so let me back up and explain.

Here is a very primitive script file that has a variable and corresponding value. (The 'script' file is just a fancy ".ini" file.)
[SOURCE]{  Size = 1;}[/SOURCE]

When I scan the file to create Tokens, I get the following:
[SOURCE]TokenType: LEFTBRACETokenName: {TokenValue: NONETokenType: LITERAL (Variable)TokenName: SizeTokenValue: NONETokenType: ASSIGNMENTTokenName: = TokenValue: NONETokenType: INTEGERTokenName: 1TokenValue: 1TokenType: SEMICOLONTokenName: ;TokenValue: NONETokenType: RIGHTBRACETokenName: }TokenValue: NONE[/SOURCE]

Once I have generated a token, I place said token into a list (std::vector). When the EOF marker ('\0' since I'm reading from a buffer and not a file.) has been reached, I pass the token list to the Parser. Once the Parser has the list, it itor's though the list validating each token. (evaluate the "grammar rules" of my script.)

Second, let me back up and ask this question (I think Kylotan hinted on this).
Is this the proper methodology of parsing scripts? That is, instead of creating a list of tokens and passing that list off to the parser, should I be instead be creating a token THEN pass that token to the parser? Is this what you meat Kylotan by .."processed linearly", if not could you clearify please?

Thanks,

Sabrina

##### Share on other sites
There is no 'proper' methodology; only different methodologies that suit different tasks. What you do with those tokens depends on several things, such as how you're going to execute the language, how large/long the token stream is going to be, whether tokenisation is context dependent and requires knowledge of previous token order, etc.

##### Share on other sites
Quote:
 Original post by SMcNeilAny thoughts, comments, suggestions?

If you've got something working right now, then it's good. If / when you get performance issues, then investigate.

##### Share on other sites
The tokenisers and compiler programs I've worked with process the tokens one at a time (with a bit of read ahead). Is there any particular reason for reading in the whole thing at once then parsing? It's possible to design a tokeniser to only tokenise when needed.

Parsers I've written call scanner.scan() then access scanner.token then repeat (scan also returns the token for simple operations).

##### Share on other sites
I guess consider using stacks in your implementation if you've not already.

##### Share on other sites
a simple tokenizer i made uses one big string and has a 'current token' char ptr.

The tokens in the string were delimited by a null character and two for the last. then on 'advance to next' i just do while (*curtkn) ++curtkn; return ++curtkn;

this could be optimized to not have to inc to next token but hey if most of ur tokens are only 3-5 chars its not going to make much differences.

1. 1
Rutin
19
2. 2
JoeJ
15
3. 3
4. 4
5. 5

• 24
• 20
• 13
• 13
• 17
• ### Forum Statistics

• Total Topics
631700
• Total Posts
3001778
×