Jump to content
  • Advertisement
Sign in to follow this  
Ectara

Finding unanchored Regex matches with an NFA

This topic is 2478 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

My regular expression tester is made using an NFA, and I'm having a problem where greedy operators don't match correctly; this problem stems from how I test for a match, I presume. Within the NFA machine, there is a special metastate that indicates that a match is found; when one of the threads is at the match state, it found a match. However, this isn't always the best match; if a greedy operator is being used, the shortest match for the expression may not be the correct one. Thus the problem arises that my tester stops after the first possible match is reached, ignoring the leftmost longest rule.

I posted a related question here:
http://www.gamedev.net/topic/620697-implementing-regex-operator-modifiers/

As an example, consider this regex: /(a*)(a*)/ testing against the string "aaaa"

If I were to spawn a new thread to branch out of the kleene enclosure, while the current thread continues to iterate, a match would be found. However, all possible matches and submatches will finish at the same time.

This means that I will find:
()(aaaa)
(a)(aaa)
(aa)(aa)
(aaa)(a)
(aaaa)()

all at the same time. Now, the problem is, they are all the shortest match, and all threads will reach the match state at the same time, I have no way of determining which is correct or desirable. I'm pretty sure the last one is the correct match, following leftmost longest. How would one determine when a correct unanchored match is found with an NFA?

Share this post


Link to post
Share on other sites
Advertisement
This is entirely off the cuff, so take it with the appropriate measure of salt: what happens if you have a unification step after the threads all complete, where you "score" the various matches and see which one matches the most characters of the input string? Then you select that match as the result.

Share this post


Link to post
Share on other sites
Something like that was mentioned in the other thread; I don't know of a way to score the matches that considers all factors. For instance, how do we score the above matches in such a way that it is significant? After that, how do we score /(a*a*?)a*/? I have found no way to score a match, let alone determine which operator is responsible for the particular score; if one scores the highest, but because of misuse of a non-greedy operator, or the in the case of the above, how do we differentiate between ()(aaaa) and (aaaa)()'s scores? They look nearly identical, with a difference of order. I can't determine which operator is responsible, or if that operator should be greedier or not.

Share this post


Link to post
Share on other sites
This really depends on what the user experience should be; otherwise the difference between ()(aaaa) and (aaaa)() is entirely arbitrary. You could simply prefer to match as much as possible in each capture group, and score accordingly; that would result in (aaaa)() being preferred which "feels" more natural to me. AFAIK most regular expression implementations tend to work this way anyways (i.e. greedy matches are always just that - they match as much as possible before continuing to the next operator).

Share this post


Link to post
Share on other sites
Yes, but they must differentiate between possessive operators; /a++a/ will not match "aaaa", while /a+a/ would. Simply matching as much as possible would result in possessive behavior; I would need to explore other avenues, though I cannot backtrack, and I must progress forward one character at a time. (aaaa)() and ()(aaaa) must be distinct; in order to have backreferences, I must have predictable and deterministic behavior that corresponds to at least POSIX standards.

Share this post


Link to post
Share on other sites
Saying that I can't is getting me nowhere. I've modified my regex tester to track matches; if it encounters one with a higher score, it discards the current match, and replaces it with the new one and its score. I have a bad feeling about this, as this now requires me to check the entirety of the string for each match, which could be an entire book, simply because of greedy operators, but I must get this working. How would one suggest I "score" such a thing?

According to POSIX, I'd take the longest. Is this really the best option...?

Share this post


Link to post
Share on other sites
You should explain what notation you are using for a greedy operator (I guess "a*" is greedy and "a*?" isn't, and "a+" is greedy but "a++" isn't?).

I think the interpretation of a greedy operator should be such that the match is the first match that a backtracking algorithm would find, if it explores the branch where the operator loops before the branch where it continues, and vice versa for non-greedy operators. This interpretation would mean that "aaaa" /(a*)(a*)/ should match "(aaaa)()".

With that interpretation, the "score" should be the list of lengths of matches for each operator from left to right, compared lexicographically with each component compared in ascending or descending order depending on the greediness of the operator.

Share this post


Link to post
Share on other sites
a* - greedy
a*? - non-greedy
a*+ - possessive

I've been basing my work on Perl's notations.

I agree that the best match should be one that requires the fewest backtracks; however, since I don't back track, this is not an easy task.

I think for non-greedy, matches that break out early should be given a higher score than those that continue to loop. Vice versa for greedy, and possessive need not concern itself; it will never break out early.

Share this post


Link to post
Share on other sites
I think the greedy/non-greedy/possessive operators are defined with a backtracking solution in mind (see this page, for instance). You should probably start by implementing that, so you have something that you are convinced is correct. I see implementing these regex extensions with an NFA as an advanced optimization, and it would be really handy to have the reference implementation around to make sure you can test its correctness.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!