Jump to content

  • Log In with Google      Sign In   
  • Create Account

Implementing Regex Operator Modifiers


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
11 replies to this topic

#1 Ectara   Crossbones+   -  Reputation: 2970

Like
0Likes
Like

Posted 21 February 2012 - 03:08 PM

I have a regular expression tester that compiles regex patterns into an NFA state machine. I ditched the use of traditional states and unlabeled arrows to allow myself greater control over how the loop is handled, and as a result, all counted repetition is done in the same way, with loop instances tracking the minimum, maximum, and current counter.

However, I am wondering how to implement modifiers on operators. So far, I have what seems to be non-greedy, or reluctant, and possessive behavior.

If I attempt to find matches for /[0-9]*4/ or /[0-9]*+4/ in "00404", I get no matches, since all of the numbers are consumed in the character class and Kleene star, which appears to be possessive behavior.

If I attempt to find matches for /[0-9]*?4/ in "00404", I get "004" and "04", which seems to be the correct non-greedy behavior.

What I would like is for /[0-9]*4/ to match "00404" in its entirety, by being greedy and matching as much as it can with the Kleene star while still allowing a match with the ending 4.

The non-greedy works by encountering a loop, and if it is set to non-greedy, it will spawn a thread on every iteration after the minimum is reached that breaks out of the loop, while the current thread continues to iterate, so that if it is possible to match faster with a shorter submatch, the spawned threads will find it before the loop finishes in entirety.

The possessive works by not spawning more threads, and simply continues until the maximum is reached, or the next character does not match, in which case it breaks out of the loop if the minimum was reached, else it fails.

So, my question is, how would I design a greedy operator that will take the longest submatch that will still allow the pattern to match the text? If I create threads as frequently as I can on the off chance it will work, it is non-greedy. If I wait until it breaks, it is possessive. I might not be able to check if the next state matches the character, because the next state might be a metastate that has no character defined, or it might be a special state like a character class or a metacharacter; writing another parser within the parser might be a bit much.

Sponsor:

#2 Álvaro   Crossbones+   -  Reputation: 13317

Like
0Likes
Like

Posted 21 February 2012 - 04:19 PM

You use more terms than what I am familiar with, so perhaps you know more about this than I do. However, the way I would implement something like this is to build a graph with modes corresponding to the parts of the expression where we might be, and then the state at any given time is the set of nodes indicating the places where we might be.

So for instance, the expression /[0-9]*4/ will result in a graph that contains two nodes (before the "4" or after it), with transitions from the first node to itself labeled with every digit and with one transition from the first node to the second node labeled only with "4". When you parse "00404", you start with your state being only the first node. The first two zeros keep the same state. When you find the "4", you follow both transitions, and your state is now the set of both nodes. After the next "0", you are back to being at the first state only, and after the last "4" your state is again both nodes.

We consider that we found a match whenever the last node is part of the state. So "004" and "00404" are both matches. If you want to return the first match, that's really easy to implement: Whenever the last node is part of the state, you are done. If you want to be greedy, you need to save the longest match so far, and return it when you are done with the string.

#3 Ectara   Crossbones+   -  Reputation: 2970

Like
0Likes
Like

Posted 21 February 2012 - 04:28 PM

I'm not quite sure I understand your answer. If it helps, mine is based off of Thompson's NFA, and I check for a match by checking if any of the threads are at a match metastate; a unique state indicating success. Tracking each match is not possible without a frustrating amount of memory consumption. Also, taking the longest match does not mean that the longest match is the result of a greedy operator; it could be the result of any of the operators, greedy or not. Tracking which is responsible is difficult.

#4 Álvaro   Crossbones+   -  Reputation: 13317

Like
0Likes
Like

Posted 21 February 2012 - 08:42 PM

You are right that what I said about making the match greedy is not correct, or at least not general enough. I need to think harder about how something like that would be implemented. I have only ever thought of how to write recognizers for plain regular expressions, without any considerations such as greedy operators.

#5 Ectara   Crossbones+   -  Reputation: 2970

Like
0Likes
Like

Posted 23 February 2012 - 12:31 AM

To even generalize further, what do you think each kind of loop, for, say, a kleene closure, would consist of? I think a simple "must meet minimum, but match until maximum or not a match" describes possessive; it'll match all that it can, within the bounds of its limitations. A non-greedy version would likely be "match until minimum, but after that, check the rest of the expression at the same time on every iteration". This would match a shorter match from the operator sooner than fully completing the loop and continuing a match; shorter matches can also always fail, requiring the use of the longer match.

#6 Álvaro   Crossbones+   -  Reputation: 13317

Like
0Likes
Like

Posted 23 February 2012 - 11:05 AM

I don't think of a Kleene closure as a loop, in the usual sense of a loop in a programming language. When you see a "4" in /[0-9]*4/, you don't know if this is the last character and you should interpret it as being outside the loop or if it is part of the `[0-9]*', so you have to assume both as possibilities. As I said earlier, I would describe that by having the state be a set of nodes (i.e., use a nondeterministic finite automaton). There is a discussion of this and other methods here.

#7 Ectara   Crossbones+   -  Reputation: 2970

Like
0Likes
Like

Posted 23 February 2012 - 11:52 AM

Yes, I am using an NFA, as I mentioned earlier. However, if I simply created a new thread every time, it wouldn't be greedy. I also have no way of predicting the next character and its success, and if I did, if I created a new thread every time it might go either way, I would get the non-greedy result.

#8 Álvaro   Crossbones+   -  Reputation: 13317

Like
0Likes
Like

Posted 23 February 2012 - 01:56 PM

As far as I can tell, you need to consider every possible match regardless of whether you want the operator to be greedy or not. Greediness only matters as a tiebreaker in case of multiple matches, and can probably be implemented by some scoring system.

#9 Ectara   Crossbones+   -  Reputation: 2970

Like
0Likes
Like

Posted 23 February 2012 - 04:00 PM

Hm... Thompson's NFA design is basically designed to halt after the first match. Modifying it to track all matches would suck. How much memory would it take to match /a*a*a*a*a*a*a*a*a*a*a*a*a*a*a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?/ against aaaaaaaaaaaaaaaaa...?

#10 Ectara   Crossbones+   -  Reputation: 2970

Like
0Likes
Like

Posted 23 February 2012 - 04:38 PM

To put it further, for capture groups, how much memory would it take to keep capture groups for all of those matches if I threw parenthesis around some of those? The memory consumption would be highly exponential in many cases.

#11 Álvaro   Crossbones+   -  Reputation: 13317

Like
0Likes
Like

Posted 23 February 2012 - 08:51 PM

If you have some sort of score to decide which match is most appropriate, you don't need to ever keep all the matches in memory: You have a best match that led and its associated score, and you update them when a better one is found. At most you would need to remember the best match that got you to each node, or something of that sort. I don't think memory consumption needs to get out of control.

#12 Ectara   Crossbones+   -  Reputation: 2970

Like
0Likes
Like

Posted 23 February 2012 - 10:58 PM

This seems like an oddly complex solution for only one operator. I don't know how to score a match in such a way that I know which operator is responsible for the longest match. In addition to needing to know if the preferred match is a result of the operator in question, if there are more than one, not only would we need to track a match for this operator, but we'd need to track a match for every operator, because a high "score" for one operator might be a low score for another; with each additional greedy operator, the most common kind, I might add, the amount of work and storage to evaluate it increases exponentially, as we have to track and measure a score for each of them and try to determine which is the best match. For a type of operator that occurs many times in an expression, this work is unacceptable.

Is there really no way to implement this without using something outside of the state machine? Bear in mind that I also have a special set of loop nodes: one before a list of states to be repeated, and one that denotes the end of the repeated chain. A thread iterating between these two has an instance indicating the loop type, its minimum required count, the maximum count, and the current count.

I can't seem to figure this out. Let's try working from the problem backwards. The original example implementation was an anchored match; all of its operators were greedy because it evaluated as many iterations as possible, breaking when absolutely necessary. It checked at the end of the machine's execution if any of the threads were at the match state, and if so, it'd report a match. The thing is, if a match was encountered early, it'd be discarded and the thread terminated, because as an anchored matching expression, it must fully match the input text in its entirety; finding a match before the end of the string would be a false positive, because the string isn't a match. By discarding all of the shorter matches, it ensured each operator was operating at its greediest.

However, in modifying the machine to be unanchored, I changed it to stop at the sight of the first match. This works great, except for the few cases like these where the first match is not the best (or correct) match. Perhaps it'd solve my problem if I change how a match is determined, or if I find my method of unanchoring the match is improper?




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS