Lexical Parser

Started by
5 comments, last by Ectara 11 years, 10 months ago
I am at the point where I have the ability to read in a file describing the grammar of a language, with regular expressions defining a token, and symbols that are either terminal or non-terminal, similar to a BNF description of a language. However, I am now having a hard time figuring out how to parse the input based on this information.

The grammar is stored in a sort of tree. The grammar starts with an array of non-terminals; one of them is designated as the starting point. Within each one is a group of alternatives, which constitute the definitions of the non-terminal symbol. Each alternative has a list of nodes, which are one of a literal string, a token defined by one of the aforementioned regular expressions, or a pointer to a non-terminal, which points to one of the non-terminals in the original array. In this way, the grammar is stored in a structure that can be iterated through. So, my first idea was to use a finite automation, to have a state machine with various threads to iterate through, since I read the lexemes from a sequential file, and I cannot move backwards in the file, so using recursion will not be an option, as it requires that it go back and re-evaluate the lexemes if the current path turns out to be the wrong one.

So, the problem is, it's difficult to conceptualize, as I cannot figure out how to evaluate a non-terminal within a non-terminal, then return back to evaluating the topmost non-terminal without recursion. I hear it is very posible to use a state machine, and someone somewhere has done it. Can anyone give advice on what worked or might make more sense?
Advertisement
How is recursion not an option? As long as you're not modifying your position based on the parsing (which you should not do until you have a successful parse), there's no issue there.

If on the other hand you're trying to parse as you read, then you're pretty much doomed to failure.
Reading this documentation below, it would sound as if I can parse this with a stack.
http://www.cs.man.ac...1/ho/node6.html

Perhaps I push a stack of lexemes, then decide what to do with them. Not sure yet; I'm not at my workstation right now. I use recursion as a last resort. If this were a finite automation, parsing one lexeme at a time would work, if I could figure that angle out. I doubt I'm doomed, unless you've tried this yourself.

Also, not sure what you mean by not moving the position until I parse successfully; reading at all will move the position.
Part of attempting to parse a non-terminal involves possible failure. It also involves variable parse depth (since different grammar entries are naturally different lengths). The 'current position' is based on the result of previous parsing, and necessarily needs to involve either an index into a random access collection or a try/undo stack.
Yes, but with multiple threads of a state machine, all parse at once with the same input; only the ones that fail die, whereas the successful ones continue.

And if I'm reading from a file stream, there's only one current file position. I'm not sure what other position there is.

Yes, but with multiple threads of a state machine, all parse at once with the same input; only the ones that fail die, whereas the successful ones continue.


I'm curious how you expect threads to do meaningful work when the nature of lexing is inherently sequential (since any given state in the machine/position in the stream is determined by previous state).


And if I'm reading from a file stream, there's only one current file position.


READING FROM THE STREAM AS YOU GO IS BAD.

I'm not sure how clearer I can make it. Read the entire file into a random access collection (read: string). Have the different constructs attempt to parse either a subsection of that collection (yielding an index to where they finished or simply 'the rest') or the whole collection starting at an index (faster due to less copying).

That index is the position. It represents the current state of the machine if you're looking at it in that perspective. The state of the machine doesn't change until a whole element is successfully parsed. Parsing part of one isn't good enough. By making this state implied in the position of your in-progress stream, you prevent that sort of "parse part A, now check part B" behavior that is required for common grammars.
Each thread is independent and they each operate sequentially, operating on all possible paths. I'd rather not go this route, so I'll try a PDA next.

This topic is closed to new replies.

Advertisement