Sign in to follow this  
BeanDog

Optimize my regular expression

Recommended Posts

Consider the following regex: (final)?\s*(abstract)?\s*(public|private|protected)?\s*(static)?\s*function\s*([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)\s*\(([^\)]*)\)(\s*;)? It is designed to find PHP function declarations. The problem is that it takes minutes to search 4KB of PHP source code. The problem is in the beginning. I'm looking for a sequence of possible modifiers (final, abstract, etc.) that may or may not appear before the function declaration. If I remove them from the regex entirely, it runs instantaneously. So, I've just removed them and manually searched for the modifiers after the simplified regex matched. It's a mess, though... Does anyone know how I could make this work reasonably quickly while still having the regular expression find the modifiers for me? Thanks!

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Do the modifiers even have to be in that order? Or could some function be "abstract private static final function"? If so, all the more reasons not to use regexp for looking up the modifiers.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Oh. Well, I think this *might* help:

(final\s+)?(abstract\s+)?(public|private|protected\s+)?(static\s+)?function\s*

Now you'd need to strip the spaces from after the modifiers though. But the matcher wouldn't try match for substrings that begin with \s (and then always fail when the next word isn't one of "final", "abstract" etc.), but for things that start with words "final", "abstract",..., or "function". Which should be much rarer.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Hmm, I tested that with Python and it made the regexp 4 times faster. But it's still a far cry from leaving out the quantifiers which makes it ~700 times faster than the original regexp..

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
With some more testing I saw that the fix can actually make the regexp 100+ times faster if the program has enough whitespace in it. So it's worth trying for you.

Share this post


Link to post
Share on other sites
If you don't need to check what matched within the groupings (i.e. don't care whether or not the function actually is final or abstract etc.), you can use non-capturing groupings:

(?:final\s+)?(?:abstract\s+)?(?:(?:public|private|protected)\s+)?(?:static\s+)?function\s*

Share this post


Link to post
Share on other sites
Does the function need to match only functions exactly? Or is it okay if it gives incorrect results for input that is not valid PHP code? If that is the case then you can simplify the regex to something like:
((final|abstract|public|private|protected|static)\s+)*

Share this post


Link to post
Share on other sites
twanvl beat me to it; you don't need a regexp to match 'real' syntax, so long as your real error checking is elsewhere (which it is; in the actual scripting engine). Your editor can use a much simplified form of the syntax, unless you want to be able to specifically highlight syntax errors, in which case a regexp is far too weak and slow a tool; you want an iterative re-sequencable parser, which takes up more memory to store the reparsing tree, but gives updates near-linear on the input size, better if you break the parsing system into distinct lines of code.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this