What is a token?

Started by
10 comments, last by ravyne2001 19 years, 9 months ago
All this talk about tokens and tokenizing. What is a token?
Advertisement
In general, I think "token" means "a simple and easy-to-handle thing that represents a more complicated thing"

When you tokenize a string, it means you break the string up into words, so that it's easier to work with. Like, if you tokenize the string "fluffy, bunny" or the string " fluffy bunny "

then in both cases you end up with the tokens "fluffy" and "bunny".

Tokens aren't always strings, but I think that's the most common usage.
There are five kinds of tokens: identifiers (names), keywords (template, int etc), literals (constants), operators (* etc) and other seperators.
I'd say that a token is the largest chunk of whatever is being processed that can be processed in one go by whatever is doing the processing. How's that for a general purpose definition? :)

Chris's definition is what I'd use if I were talking about a programming language (or trying to figure out what a compiler were talking about). I think pinacolada said the same thing as I did but differently. :)
Well, tokenizing is usually just something that splits a very large thing, into a container of smaller things, although at least in Java, and Python, there is a split command, that will return a container with the resulting objects from that split. So, a token, is just a string, that you can do anything that you want to do to it, as long as it is legal for your language definition of a string.
I use C++, and I guess the definition I'm looking for is "tokenizing" in terms of writing a scripting language.

I'm ALSO wondering what the C++ token is I see. It comes up now and then, but not much is in my books. I believe, when I see them, they are just a name with an underscore in front of it.

Example:

_name;

That confuses me, because it doesn't seem like a full statement, since there's no specifier. Thanks guys.
Sorry to have to bump, but no replys :(
In parsing theory a token is the smallest string of characters with meaning which doesn't rely on other tokens.

An example would be a number, roughly "[0-9]+[.0-9]*"
This is also called a lexeme.(or lexical element)

The lexer divides the input stream into these tokens which are fed to a parser which decides how to interpret them.



So let's say I have a script file that I want to read in. Let's say there's lines:

NEWINT gold (200)
NEWINT silver (100)

If I write a program that does this:

While(!EOF) {
-Read until space or newline (to get one word)
-If word == NEWINT
--Read next word
--Allocate new integer / create pointer
--Index the integer by it's name (next word)
--Initialize
}

Very simple psuedo above, so I hope it convey's my way of thinking.

Does it mean that I'm tokenizing the script by reading each word and determinging what it is?
I have to ask,are you doing this as a learning experience or to get things done(with a game I would guess).

If you are trying to get things done, do yourself a favor and get a scripting library. I suggest Lua(my favorite), Angelscript or Small. They have been tested, are free, and will be faster than a solution you could roll yourself.

If you are just learning, buy a good book on the subject compilers and computer languages are both very interesting topics.

Typical human-being parsers are usually what we call recursive descent parsers. They recursively call subroutines looking for particular language features, usually pushing the results of what they find onto a stack of some kind. Look at this:

Here is your line:
NEWINT <varname> <left-parentheses> <number> <right-parentheses>

and here is the pseudo-code
if parse_newint_command() then ....store integer, varname, etcfunction parse_newint_command()  if token!='NEWINT' then return false;  next_token(); // move from NEWINT to the next token  if not parse_var() then return false;  next_token();  if token!='(' then return false;  ..end


The trick here is to save the place you are in the token list before you call parse_newint_command() and if it fails, restore to that point.

Sorry if this is unclear.

This topic is closed to new replies.

Advertisement