Implementing Regex Operator Modifiers

Started by
10 comments, last by Ectara 12 years, 2 months ago
I have a regular expression tester that compiles regex patterns into an NFA state machine. I ditched the use of traditional states and unlabeled arrows to allow myself greater control over how the loop is handled, and as a result, all counted repetition is done in the same way, with loop instances tracking the minimum, maximum, and current counter.

However, I am wondering how to implement modifiers on operators. So far, I have what seems to be non-greedy, or reluctant, and possessive behavior.

If I attempt to find matches for /[0-9]*4/ or /[0-9]*+4/ in "00404", I get no matches, since all of the numbers are consumed in the character class and Kleene star, which appears to be possessive behavior.

If I attempt to find matches for /[0-9]*?4/ in "00404", I get "004" and "04", which seems to be the correct non-greedy behavior.

What I would like is for /[0-9]*4/ to match "00404" in its entirety, by being greedy and matching as much as it can with the Kleene star while still allowing a match with the ending 4.

The non-greedy works by encountering a loop, and if it is set to non-greedy, it will spawn a thread on every iteration after the minimum is reached that breaks out of the loop, while the current thread continues to iterate, so that if it is possible to match faster with a shorter submatch, the spawned threads will find it before the loop finishes in entirety.

The possessive works by not spawning more threads, and simply continues until the maximum is reached, or the next character does not match, in which case it breaks out of the loop if the minimum was reached, else it fails.

So, my question is, how would I design a greedy operator that will take the longest submatch that will still allow the pattern to match the text? If I create threads as frequently as I can on the off chance it will work, it is non-greedy. If I wait until it breaks, it is possessive. I might not be able to check if the next state matches the character, because the next state might be a metastate that has no character defined, or it might be a special state like a character class or a metacharacter; writing another parser within the parser might be a bit much.
Advertisement
You use more terms than what I am familiar with, so perhaps you know more about this than I do. However, the way I would implement something like this is to build a graph with modes corresponding to the parts of the expression where we might be, and then the state at any given time is the set of nodes indicating the places where we might be.

So for instance, the expression /[0-9]*4/ will result in a graph that contains two nodes (before the "4" or after it), with transitions from the first node to itself labeled with every digit and with one transition from the first node to the second node labeled only with "4". When you parse "00404", you start with your state being only the first node. The first two zeros keep the same state. When you find the "4", you follow both transitions, and your state is now the set of both nodes. After the next "0", you are back to being at the first state only, and after the last "4" your state is again both nodes.

We consider that we found a match whenever the last node is part of the state. So "004" and "00404" are both matches. If you want to return the first match, that's really easy to implement: Whenever the last node is part of the state, you are done. If you want to be greedy, you need to save the longest match so far, and return it when you are done with the string.
I'm not quite sure I understand your answer. If it helps, mine is based off of Thompson's NFA, and I check for a match by checking if any of the threads are at a match metastate; a unique state indicating success. Tracking each match is not possible without a frustrating amount of memory consumption. Also, taking the longest match does not mean that the longest match is the result of a greedy operator; it could be the result of any of the operators, greedy or not. Tracking which is responsible is difficult.
You are right that what I said about making the match greedy is not correct, or at least not general enough. I need to think harder about how something like that would be implemented. I have only ever thought of how to write recognizers for plain regular expressions, without any considerations such as greedy operators.
To even generalize further, what do you think each kind of loop, for, say, a kleene closure, would consist of? I think a simple "must meet minimum, but match until maximum or not a match" describes possessive; it'll match all that it can, within the bounds of its limitations. A non-greedy version would likely be "match until minimum, but after that, check the rest of the expression at the same time on every iteration". This would match a shorter match from the operator sooner than fully completing the loop and continuing a match; shorter matches can also always fail, requiring the use of the longer match.
I don't think of a Kleene closure as a loop, in the usual sense of a loop in a programming language. When you see a "4" in /[0-9]*4/, you don't know if this is the last character and you should interpret it as being outside the loop or if it is part of the `[0-9]*', so you have to assume both as possibilities. As I said earlier, I would describe that by having the state be a set of nodes (i.e., use a nondeterministic finite automaton). There is a discussion of this and other methods here.
Yes, I am using an NFA, as I mentioned earlier. However, if I simply created a new thread every time, it wouldn't be greedy. I also have no way of predicting the next character and its success, and if I did, if I created a new thread every time it might go either way, I would get the non-greedy result.
As far as I can tell, you need to consider every possible match regardless of whether you want the operator to be greedy or not. Greediness only matters as a tiebreaker in case of multiple matches, and can probably be implemented by some scoring system.
Hm... Thompson's NFA design is basically designed to halt after the first match. Modifying it to track all matches would suck. How much memory would it take to match /a*a*a*a*a*a*a*a*a*a*a*a*a*a*a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?/ against aaaaaaaaaaaaaaaaa...?
To put it further, for capture groups, how much memory would it take to keep capture groups for all of those matches if I threw parenthesis around some of those? The memory consumption would be highly exponential in many cases.

This topic is closed to new replies.

Advertisement