• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
Ectara

Implementing Regex Operator Modifiers

11 posts in this topic

I have a regular expression tester that compiles regex patterns into an NFA state machine. I ditched the use of traditional states and unlabeled arrows to allow myself greater control over how the loop is handled, and as a result, all counted repetition is done in the same way, with loop instances tracking the minimum, maximum, and current counter.

However, I am wondering how to implement modifiers on operators. So far, I have what seems to be non-greedy, or reluctant, and possessive behavior.

If I attempt to find matches for /[0-9]*4/ or /[0-9]*+4/ in "00404", I get no matches, since all of the numbers are consumed in the character class and Kleene star, which appears to be possessive behavior.

If I attempt to find matches for /[0-9]*?4/ in "00404", I get "004" and "04", which seems to be the correct non-greedy behavior.

What I would like is for /[0-9]*4/ to match "00404" in its entirety, by being greedy and matching as much as it can with the Kleene star while still allowing a match with the ending 4.

The non-greedy works by encountering a loop, and if it is set to non-greedy, it will spawn a thread on every iteration after the minimum is reached that breaks out of the loop, while the current thread continues to iterate, so that if it is possible to match faster with a shorter submatch, the spawned threads will find it before the loop finishes in entirety.

The possessive works by not spawning more threads, and simply continues until the maximum is reached, or the next character does not match, in which case it breaks out of the loop if the minimum was reached, else it fails.

So, my question is, how would I design a greedy operator that will take the longest submatch that will still allow the pattern to match the text? If I create threads as frequently as I can on the off chance it will work, it is non-greedy. If I wait until it breaks, it is possessive. I might not be able to check if the next state matches the character, because the next state might be a metastate that has no character defined, or it might be a special state like a character class or a metacharacter; writing another parser within the parser might be a bit much.
0

Share this post


Link to post
Share on other sites
You use more terms than what I am familiar with, so perhaps you know more about this than I do. However, the way I would implement something like this is to build a graph with modes corresponding to the parts of the expression where we might be, and then the state at any given time is the set of nodes indicating the places where we might be.

So for instance, the expression /[0-9]*4/ will result in a graph that contains two nodes (before the "4" or after it), with transitions from the first node to itself labeled with every digit and with one transition from the first node to the second node labeled only with "4". When you parse "00404", you start with your state being only the first node. The first two zeros keep the same state. When you find the "4", you follow both transitions, and your state is now the set of both nodes. After the next "0", you are back to being at the first state only, and after the last "4" your state is again both nodes.

We consider that we found a match whenever the last node is part of the state. So "004" and "00404" are both matches. If you want to return the first match, that's really easy to implement: Whenever the last node is part of the state, you are done. If you want to be greedy, you need to save the longest match so far, and return it when you are done with the string.
0

Share this post


Link to post
Share on other sites
I'm not quite sure I understand your answer. If it helps, mine is based off of Thompson's NFA, and I check for a match by checking if any of the threads are at a match metastate; a unique state indicating success. Tracking each match is not possible without a frustrating amount of memory consumption. Also, taking the longest match does not mean that the longest match is the result of a greedy operator; it could be the result of any of the operators, greedy or not. Tracking which is responsible is difficult.
0

Share this post


Link to post
Share on other sites
You are right that what I said about making the match greedy is not correct, or at least not general enough. I need to think harder about how something like that would be implemented. I have only ever thought of how to write recognizers for plain regular expressions, without any considerations such as greedy operators.
0

Share this post


Link to post
Share on other sites
To even generalize further, what do you think each kind of loop, for, say, a kleene closure, would consist of? I think a simple "must meet minimum, but match until maximum or not a match" describes possessive; it'll match all that it can, within the bounds of its limitations. A non-greedy version would likely be "match until minimum, but after that, check the rest of the expression at the same time on every iteration". This would match a shorter match from the operator sooner than fully completing the loop and continuing a match; shorter matches can also always fail, requiring the use of the longer match.
0

Share this post


Link to post
Share on other sites
I don't think of a Kleene closure as a loop, in the usual sense of a loop in a programming language. When you see a "4" in /[0-9]*4/, you don't know if this is the last character and you should interpret it as being outside the loop or if it is part of the `[0-9]*', so you have to assume both as possibilities. As I said earlier, I would describe that by having the state be a set of nodes (i.e., use a [url="http://en.wikipedia.org/wiki/Nondeterministic_finite_automaton"]nondeterministic finite automaton[/url]). There is a discussion of this and other methods [url="http://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times"]here[/url].
0

Share this post


Link to post
Share on other sites
Yes, I am using an NFA, as I mentioned earlier. However, if I simply created a new thread every time, it wouldn't be greedy. I also have no way of predicting the next character and its success, and if I did, if I created a new thread every time it might go either way, I would get the non-greedy result.
0

Share this post


Link to post
Share on other sites
As far as I can tell, you need to consider every possible match regardless of whether you want the operator to be greedy or not. Greediness only matters as a tiebreaker in case of multiple matches, and can probably be implemented by some scoring system.
0

Share this post


Link to post
Share on other sites
Hm... Thompson's NFA design is basically designed to halt after the first match. Modifying it to track all matches would suck. How much memory would it take to match /a*a*a*a*a*a*a*a*a*a*a*a*a*a*a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?/ against aaaaaaaaaaaaaaaaa...?
0

Share this post


Link to post
Share on other sites
To put it further, for capture groups, how much memory would it take to keep capture groups for all of those matches if I threw parenthesis around some of those? The memory consumption would be highly exponential in many cases.
0

Share this post


Link to post
Share on other sites
If you have some sort of score to decide which match is most appropriate, you don't need to ever keep all the matches in memory: You have a best match that led and its associated score, and you update them when a better one is found. At most you would need to remember the best match that got you to each node, or something of that sort. I don't think memory consumption needs to get out of control.
0

Share this post


Link to post
Share on other sites
This seems like an oddly complex solution for only one operator. I don't know how to score a match in such a way that I know which operator is responsible for the longest match. In addition to needing to know if the preferred match is a result of the operator in question, if there are more than one, not only would we need to track a match for this operator, but we'd need to track a match for every operator, because a high "score" for one operator might be a low score for another; with each additional greedy operator, the most common kind, I might add, the amount of work and storage to evaluate it increases exponentially, as we have to track and measure a score for each of them and try to determine which is the best match. For a type of operator that occurs many times in an expression, this work is unacceptable.

Is there really no way to implement this without using something outside of the state machine? Bear in mind that I also have a special set of loop nodes: one before a list of states to be repeated, and one that denotes the end of the repeated chain. A thread iterating between these two has an instance indicating the loop type, its minimum required count, the maximum count, and the current count.

I can't seem to figure this out. Let's try working from the problem backwards. The original example implementation was an anchored match; all of its operators were greedy because it evaluated as many iterations as possible, breaking when absolutely necessary. It checked at the end of the machine's execution if any of the threads were at the match state, and if so, it'd report a match. The thing is, if a match was encountered early, it'd be discarded and the thread terminated, because as an anchored matching expression, it must fully match the input text in its entirety; finding a match before the end of the string would be a false positive, because the string isn't a match. By discarding all of the shorter matches, it ensured each operator was operating at its greediest.

However, in modifying the machine to be unanchored, I changed it to stop at the sight of the first match. This works great, except for the few cases like these where the first match is not the best (or correct) match. Perhaps it'd solve my problem if I change how a match is determined, or if I find my method of unanchoring the match is improper?
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0