Implementing Regex Operator Modifiers

Started by
10 comments, last by Ectara 12 years, 2 months ago
If you have some sort of score to decide which match is most appropriate, you don't need to ever keep all the matches in memory: You have a best match that led and its associated score, and you update them when a better one is found. At most you would need to remember the best match that got you to each node, or something of that sort. I don't think memory consumption needs to get out of control.
Advertisement
This seems like an oddly complex solution for only one operator. I don't know how to score a match in such a way that I know which operator is responsible for the longest match. In addition to needing to know if the preferred match is a result of the operator in question, if there are more than one, not only would we need to track a match for this operator, but we'd need to track a match for every operator, because a high "score" for one operator might be a low score for another; with each additional greedy operator, the most common kind, I might add, the amount of work and storage to evaluate it increases exponentially, as we have to track and measure a score for each of them and try to determine which is the best match. For a type of operator that occurs many times in an expression, this work is unacceptable.

Is there really no way to implement this without using something outside of the state machine? Bear in mind that I also have a special set of loop nodes: one before a list of states to be repeated, and one that denotes the end of the repeated chain. A thread iterating between these two has an instance indicating the loop type, its minimum required count, the maximum count, and the current count.

I can't seem to figure this out. Let's try working from the problem backwards. The original example implementation was an anchored match; all of its operators were greedy because it evaluated as many iterations as possible, breaking when absolutely necessary. It checked at the end of the machine's execution if any of the threads were at the match state, and if so, it'd report a match. The thing is, if a match was encountered early, it'd be discarded and the thread terminated, because as an anchored matching expression, it must fully match the input text in its entirety; finding a match before the end of the string would be a false positive, because the string isn't a match. By discarding all of the shorter matches, it ensured each operator was operating at its greediest.

However, in modifying the machine to be unanchored, I changed it to stop at the sight of the first match. This works great, except for the few cases like these where the first match is not the best (or correct) match. Perhaps it'd solve my problem if I change how a match is determined, or if I find my method of unanchoring the match is improper?

This topic is closed to new replies.

Advertisement