# A TernarySearchTree Based ~> LexicalAnalyzer

This topic is 2709 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

/*
Hey,

I am working on a new interpreter, building it with C++ & InlineAssembly (masm) (OOP approach) .
I currently at the LexicalAnalyzer stage.
It now Includes~>
A ) C&C++ style comments & SkipWhiteSpace masm Inline Procedures.
B ) full Unicode support.

I Hacked the Lex() function using the TernarySearchTrees DataSturcture ( for( ; ; ) SymTab as well)? .
It works as accepted.{1,3} However there is a only one big restriction := Tokens must be separated with WhiteSpace || Comments:

/* ... */ Foo + Bar ~> $1 := Foo ;$2 := + ; $3 := Bar ; Foo+ Bar // ... '\n'||'\0' ~>$1 := Foo+ ; \$2 := Bar ;

What do you suggest?{1,3}

*/

What?

##### Share on other sites
What, exactly, is the problem here? And what are you trying to accomplish?

Your post is, unfortunately, very cryptic and not very informative for anyone who isn't actively working on your code - in other words, it isn't useful. Without a lot more detail on what exactly you're doing and trying to accomplish, there's not much any of us can offer, I'm afraid.

##### Share on other sites
Sorry.

I am working on a hand-written Lexical Analyzer.
Instead of the traditional way i folowed this tutorial at Dr Dobbs website.
I modified it to support unicode ,packed it as an object
And enabled Tokens to end with all the whitespace spectrum and not just with '\0'.
such as :
"TokenName\n" , "TokenName//" , "TokenName/*" , "TokenName\r" ... all will point to the same Symbol structure

Some Snippets:

 class LexicalAnalyzer{ protected: wchar_t * c ; wchar_t * sRoot ; /* Ternary members */ unsigned BUFSIZE ; Tnode * Root ; Tnode * buf ; int bufn ; int freen ; void * freearr[10000]; public: LexicalAnalyzer(wchar_t * c); ~LexicalAnalyzer(); void Lex(); inline void SkipWhiteSpace(); inline bool SkipComments(); inline wchar_t * TearToken(); /* Ternary members */ struct Symbol * Lookup(wchar_t *s); inline Symbol * Search(wchar_t *s ) ; inline bool IsWhiteSpace(wchar_t * c); }; 

 LexicalAnalyzer :: LexicalAnalyzer(wchar_t * c){ this->c = this->sRoot = c ; /* Ternary members */ this->Root = 0 ; this->bufn = this->freen = 0; this->BUFSIZE = 1000 ; Symbol * sym = this->Lookup(L"Hello"); sym->LineNumber = 99 ; Symbol * sym2 = this->Lookup(L"/"); sym2->LineNumber = 33 ; } LexicalAnalyzer :: ~LexicalAnalyzer(){ for (int i = 0; i < this->freen; i++) free(this->freearr); } void LexicalAnalyzer :: Lex(){ this->SkipWhiteSpace(); if(this->SkipComments()) return this->Lex(); if(!this->c) return ; /* End Of File '\0' */ wprintf(this->c); std :: cout << this->Search(this->c)->LineNumber ; // wprintf(this->c); int n ; std :: cin >> n ; } 

 struct Symbol { wchar_t * Name ; wchar_t * Type ; wchar_t * Scope; unsigned Size ; void * Val ; int LineNumber ; wchar_t * FileName; struct Symbol * Next ; }; struct Tnode { wchar_t s; Tnode * l , *h ; union{ Tnode * e ; Symbol * sym ; }; }; 

 Symbol * LexicalAnalyzer :: Search(wchar_t *s ) { Tnode * p = this->Root; while (p) { if (this->IsWhiteSpace(this->c)) return p->sym ; if(*this->c == p->s) { this->c++ ; p = p->e; } else if (*this->c < p->s) p = p->l; else p = p->h; } return 0; } struct Symbol * LexicalAnalyzer :: Lookup(wchar_t *s){ int d; wchar_t *instr = s; struct Symbol * Sym ; Tnode * pp ; Tnode ** p; p = &this->Root; while (pp = *p) { if ((d = *s - pp->s) == 0) { if (IsWhiteSpace(s++)) return pp->sym; p = &(pp->e); } else if (d < 0) p = &(pp->l); else p = &(pp->h); } for (; { if (this->bufn-- == 0) { this->buf = (Tnode *) malloc( this->BUFSIZE * sizeof(Tnode) ); this->freearr[this->freen++] = (void *) buf; bufn = BUFSIZE-1; } *p = buf++; pp = *p; pp->s = *s; pp->l = pp->e = pp->h = 0; if (IsWhiteSpace(s++)) { Sym = (Symbol *) malloc(sizeof(Symbol)); Sym->Name = instr ; pp->sym = (Symbol *) Sym ; return pp->sym ; } p = &(pp->e); } } 

My problem is that Tokens without white space between them are considered as only one Token in here.
so this code will intrepretred well ( A + B * C ) := 7 Tokens ~> "(" "A" "+" "B" "*" "C" ")"
But this one not ~> (A+B*C) := Only one Token ~> "(A+B*C)"

##### Share on other sites
Zee goggles, they do nothink!

I'm not going to read through a pile of markup, but you're likely just missing the optional bit from the whitespace rule. (which would be implemented as a simple if... somewhere)

[edit: yay fixed... why is SkipWhitespace not used?

okay, I'm not going to trace through a pile of single letter variables either...]

##### Share on other sites
In most lexers, the usual strategy is to do something like this:

• Start in "unknown" mode
• Loop over all characters in the input
• Classify the current character (whitespace, symbol, literal, etc.)
• Check classification for continuity (see below)
• If classification is different from previous character, emit a token and restart
• Otherwise, append character to current token
Continuity checks are simple: you don't want "Foo123" to be tokenized as "Foo" and the literal "123", so you check some simple state machine rules to see if the current classification can potentially overlap or continue the previous classification. In this case, we might allow "123" to be either a literal or a part of an identifier token (provided at least one character classified as an identifier precedes the "123") so we just throw in a simple if-check to see if it should be counted as its own token or not.

Note that any time we emit a token we should reset to "unknown" mode, and that any mode can continue from unknown mode cleanly. This permits you to start in the middle of a token, for instance, and still get correct results, which is great for incremental parsers.

Fairly simple and straightforward, although as your grammar gets increasingly complex, it can become tedious and fragile. There's a good reason why most people don't hand-roll lexers anymore.

You can see an example of a simple lexer used for Scintilla syntax highlighting here.

##### Share on other sites
It's a great idea.

Thanks for the link! := I used this method. And for no doubts it's pretty fast.
I realy realy want to create a dynamic Lexer by myself (Add more Tokens on the fly).

What do you think about my TernarrySearchTrees approach? particularly Interpreter speed considerations.( I am going to port more of it's procedures to assembly.)
Does the suggested new pass will decrease significantly the Interpreter's speed?

There are more interesting/famous dynamic Lexing techniques ?

By the way, i am going to spend a couple of weeks (or more) enjoying trying to understand your programming language source.
It looks brilliant++ =]

##### Share on other sites
What exactly do you mean by "dynamic lexing"? The purpose of a lexer is to tokenize input, not to do any kind of semantic checks on the token stream; once you start doing semantic checks you're up in parser territory. Of course for things like syntax highlighting it makes sense to have a tight feedback loop between your parser (when, say, it recognizes a new identifier that should be highlighted across the board) and the lexer itself (since lexers are commonly used directly for highlighting purposes, as in Scintilla). However, dynamic parsing is tricky and depends highly on the exact goals you're trying to accomplish.

Can you provide us with a little more detail on exactly what you want your program to be doing?

##### Share on other sites
I can't stop myself from building my own tools.
When i say a dynamic Lexer i mean a Lexer generator.
I can't say it's like flex because it's not a compiler compiler tool, but the end purpose is just the same.

Is there a Lexer compiler which generates an Inline Assembly Code?

I do have some idea in BioInformatics i want to perform...

##### Share on other sites

When i say a dynamic Lexer i mean a Lexer generator.

Generation off of what syntax/format/input?

• 34
• 12
• 10
• 9
• 9
• ### Forum Statistics

• Total Topics
631355
• Total Posts
2999520
×