Sign in to follow this  
Servant of the Lord

RegEx matching difficulties

Recommended Posts

[color=#FF0000][size=2][To any GD.net staff: Requesting 'Regex' as a GD.net forum topic prefix][/size][/color]

I'm trying to solve a text parsing problem using C++11's <regex> library. This is somewhat unrelated to C++, however. This is mostly a Regex question, with only a few minor C++ twists. I could solve this problem easily using other C++ methods, but I figure knowing RegEx is a very important skill that I'm currently lacking.

The text I'm trying to parse has this format: (with a varying number of parameters - but for now, I'm assuming it has at least one)
[b][color=#ff8c00]functionName[/color][size=5][color=#ff0000]([/color][/size][color=#008080]"blah"[/color][color=#800080], "Foo", Stuff[/color][size=5][color=#ff0000])[/color][/size][/b]

Quotation marks on the parameters are optional, so I need to accept both cases.
I want to get 'functionName' as one string, and each parameter as another string.

Here's what I'm trying to do: (Code highlighted to match my _expectation_ of where it should line up on the format above).
[b][color=#ff8c00][\w-]+[/color][size=5][color=#ff0000]\([/color][/size][color=#008080](("[\w- ]*")|([\w-]+))[/color][color=#800080](, *[/color][/b][color=#800080][b](("[\w- ]*")|([\w-]+))[/b][/color][b][color=#800080])*[/color][size=5][color=#ff0000]\)[/color][/size] [/b]

I've been using [url="http://www.regextester.com/index2.html"]this site to test[/url], and it's not matching as I'm thinking it should.

Here's what I'm [i]thinking[/i] each part of this code does.

[b][color=#FF8C00][\w-]+[/color][/b]
Function name. Alphanumerical, and includes hypens.
Should match: [b][color=#ff0000]functionName[/color][size=5]([/size]"blah", "Foo", Stuff[size=5])[/size][/b]

[b][size=5][color=#FF0000]\([/color][/size][/b]
Opening function bracket.
Should match: [b]functionName[color=#ff0000][size=5]([/size][/color]"blah", "Foo", Stuff[size=5])[/size][/b]

[b][color=#008080](("[\w- ]*")|([\w-]+))[/color][/b]
First function parameter.
Should match: [b]functionName[size=5]([/size][color=#ff0000]"blah"[/color], "Foo", Stuff[size=5])[/size][/b]

Look at it like this:
[b][color=#ff0000]( [/color][color=#008080]("[\w- ]*") [/color][size=4][color=#ff0000]| [/color][/size][color=#008080]([\w-]+) [/color][color=#ff0000])[/color][/b]
First half: <quotation-mark> [b]OneOrMore[/b]:(alphaNumerical [b]OR[/b] hyphen [b]OR[/b] space) <quotation-mark>
Second half: [b]OneOrMore[/b]:(alphaNumerical [b]OR[/b] hyphen)
The second half should match the same as the first half, but without quotes and without spaces (because arguments should only have spaces within quotes).

[b][color=#800080](, *[/color][/b][color=#800080][b](("[\w- ]*")|([\w-]+))[/b][/color][b][color=#800080])*[/color][/b]
This is the exact same as the previous, except it is prefixed with a comma and optional spaces. This entire sub-expression occures zero or more times.
Should match: [b]functionName[size=5]([/size]"blah", [color=#ff0000]"Foo", Stuff[/color][size=5])[/size][/b]

[b][size=5][color=#FF0000]\)[/color][/size][/b]
Closing function bracket.
Should match: [b]functionName[size=5]([/size]"blah", "Foo", Stuff[size=5][color=#FF0000])[/color][/size][/b]

=============================================

My questions are several:
[b]1) [/b]What am I overlooking in this above expression? Why doesn't the expression 'match' the example input?

[b]2) [/b]Assuming it did match, how do I 'pull-out' or retrieve the different parts I want? I know how to do this on C++'s side of it, but I don't know how to specify in the expression itself which parts are the parts I want (the function name, and each argument), verses which parts to discard.

[b]3) [/b]How do I repeat a sub-expression, so I don't have to copy + paste it multiple times into the expression? (Example: I have the second+ arguments as a copy and paste of the first argument... how do I avoid that?

Share this post


Link to post
Share on other sites
What you are attempting to do cannot be done with one regular expression. Regular expressions cannot have a variable number of capturing groups. Of course, there are multiple ways you could work around this: a simple parser, multiple regular expressions, or some other option.

I highly recommend you read [url="http://www.regular-expressions.info/tutorial.html"]these tutorials on regular expressions[/url], specifically, the article on [url="http://www.regular-expressions.info/brackets.html"]grouping and backreferences[/url].

You're on the right track, but here's how I would do it. Create one regular expression that captures [i]functionName[/i] in one group and [i]parameters[/i] in a second group. By capturing the parameters with the regular expression you are assured that they are correct. Once this is done you can take the results from the [i]parameters[/i] group and parse them any way you like e.g. split the string on commas.

Share this post


Link to post
Share on other sites
Hidden
[quote name='DenzelM' timestamp='1330047711' post='4916056']
What you are attempting to do cannot be done with one regular expression. Regular expressions cannot have a variable number of capturing groups.[/quote]
That's what I was thinking. Multiple seperate expressions seemed alot easier, but I was trying to avoid taking an easy out just because I might've misunderstood something and overlooked a potentially powerful feature.

The only reason I thought it might have been possible to capture more than one, uh, "capture groups", was because [url="http://www.johndcook.com/cpp_regex.html#retrieve"]over here[/url], someone appeared to have done so. He seems to have captured '[i]2[/i]' and '[i][color=#000000][size=1][left]Egg[/left][/size][/color][/i][color=#000000][size=1][left][i] prices[/i]' with a single expression, with different parts of his one expression capturing the different matches. I'm probably just misunderstanding what he did there, but that was the reason why I thought that possible.[/left][/size][/color]

[quote]I highly recommend you read [url="http://www.regular-expressions.info/tutorial.html"]these tutorials on regular expressions[/url], specifically, the article on [url="http://www.regular-expressions.info/brackets.html"]grouping and backreferences[/url].[/quote]
Thanks, I'll check them out! [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img]

Share this post


Link to post
[quote name='DenzelM' timestamp='1330047711' post='4916056']
What you are attempting to do cannot be done with one regular expression. Regular expressions cannot have a variable number of capturing groups.[/quote]
That's what I was thinking. Multiple seperate expressions seemed alot easier, but I was trying to avoid taking an easy out just because I might've misunderstood something and overlooked a potentially powerful feature.

The only reason I thought it might have been possible to capture more than one, uh, "capture groups", was because [url="http://www.johndcook.com/cpp_regex.html#retrieve"]over here[/url], someone appeared to have done so. He seems to have captured '[i]2[/i]' and '[i]Egg prices[/i]' with a single expression, with different parts of his one expression capturing the different matches. I'm probably just misunderstanding what he did there, but that was the reason why I thought that possible.

[b][Edit:][/b] Ah, that'd be that backreference thing you were talking about. I think I get that.

[quote]I highly recommend you read [url="http://www.regular-expressions.info/tutorial.html"]these tutorials on regular expressions[/url], specifically, the article on [url="http://www.regular-expressions.info/brackets.html"]grouping and backreferences[/url].[/quote]
Thanks, I'll check them out! [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img]

Share this post


Link to post
Share on other sites
[quote name='Servant of the Lord' timestamp='1330052975' post='4916086']
The only reason I thought it might have been possible to capture more than one, uh, "capture groups", was because over here, someone appeared to have done so. He seems to have captured '2' and 'Egg prices' with a single expression, with different parts of his one expression capturing the different matches.
[/quote]
You can have multiple capturing groups, but you cannot have a [i]variable[/i] amount of capturing groups. There is a key distinction. Furthermore, as the articles I linked to will show you, anything in parentheses will create a backreference (capturing group) assuming that it has not been disabled.

Look at his regular expression. Notice there are two sets of matching parentheses, therefore two backreferences are created. His regular expression will always capture two groups; never more, never less.

Share this post


Link to post
Share on other sites
After going over the tutorial, here is my current regex which matches correctly. Are there any places I should improve it?
[i](\w+)\s*\(
\s*(?: ([\w-]+|"[\w- "]+")\s*)?
(?:,\s*([\w-]+|"[\w- "]+")\s*)?
(?:,\s*([\w-]+|"[\w- "]+")\s*)?
(?:,\s*([\w-]+|"[\w- "]+")\s*)?
(?:,\s*([\w-]+|"[\w- "]+")\s*)?
\)[/i]

([size=2]The newlines added for legibillity, but are not in the actual regex[/size])

It matches something like this, just fine:
functionName("arg1", arg2,arg3, "arg4" , "arg five")

And matches zero to five arguments.

I grab the function name as the first capture group, and each (optional) argument as different capture group.

There are two things I've tried to do, that I can't figure out how:
[b]1) [/b]How can I capture the arguments without capturing the quotes themselves? I tried using conditionals, but I'd accidentally end up with more capture groups than I wanted (Two per argument - one invalid and one valid).
[b]2) [/b]The last four capture groups are completely identical - I just copied and pasted them. Is there a way to say, "Use this same expression" instead of "Use the result of this expression (back referencing)"?

Another question is, is there any way to make regular expressions more legible by spacing out the characters, or something? Or is there a good free RegEx editor (with syntax highlighting) you'd recommend?

Share this post


Link to post
Share on other sites
[quote name='Servant of the Lord' timestamp='1330112292' post='4916300']
[b]2) [/b]The last four capture groups are completely identical - I just copied and pasted them. Is there a way to say, "Use this same expression" instead of "Use the result of this expression (back referencing)"?
[/quote]
Sounds like you want to use std::regex_iterator.
[quote]
Another question is, is there any way to make regular expressions more legible by spacing out the characters, or something?
[/quote]
Well, the regex languages are not context-free, so no there is no way of writing the programs with extra spaces. What you can do is paste together segments using regular C++ string concatenation and you can use named symbols for more descriptive code.

BTW, you need to be careful when using a hyphen in a subset expression. If the hyphen appears anywhere but the first or last position, it indicates a subset range, and does not match explicitly. By that I mean [\w- ] means "match any character between [:alnum:] and the space in the current collating order" which makes no sense to me. I would expect it to throw [font=courier new,courier,monospace]std::regex_error[/font] with [font=courier new,courier,monospace]code() == std::regex_error::error_range[/font].

Share this post


Link to post
Share on other sites
[quote name='Bregma' timestamp='1330113755' post='4916311']
By that I mean [\w- ] means "match any character between [:alnum:] and the space in the current collating order" which makes no sense to me. I would expect it to throw [font=courier new,courier,monospace]std::regex_error[/font] with [font=courier new,courier,monospace]code() == std::regex_error::error_range[/font].
[/quote]
You're right, it's throwing. I've been testing [url="http://www.regextester.com/index2.html"]here[/url] for learning. Fixed the hypen issue, but now it's throwing [b]std::regex_constants::escape[/b], which apparently means I have a 'trailing space' or an invalid escape code.

Well, that shouldn't be too hard to find, right? [img]http://public.gamedev.net//public/style_emoticons/default/dry.png[/img]

[color=#800080](\w+)\s*\(\s*(?:([\w-]+|"[\w"-]+")\s*)?(?:,\s*([\w-]+|"[\w"-]+")\s*)?(?:,\s*([\w-]+|"[\w"-]+")\s*)?(?:,\s*([\w-]+|"[\w"-]+")\s*)?(?:,\s*([\w-]+|"[\w"-]+")\s*)?\)[/color]

Anyway, I trimmed it down, and even a regex like [color=#800080]\w[/color] or [color=#800080]\s [/color]throw the error.[font=Courier][size=2][color=#aa00aa] [/color][/size][/font]This makes me believe that \w or \s is different in C++ from the Javascript ones. I can find so little info on the C++ regex engine, so I don't know what special characters C++ RegEx uses.

Is there any C++11 regex reference sheets available? I can only find perl and javascript stuff.

Even having expressions like: [color=#800080][a-z][/color] or even [color=#800080][a] [/color]throws exceptions. (In those two cases, it throws: "[b]error_brack[/b]")

The only expression that hasn't thrown so far, is just " [color=#800080]a [/color]" on it's own... And that fails to properly match a string of also only a single 'a'. Weird.

Share this post


Link to post
Share on other sites
Anyone want to give this a shot, and see if it works for them?

[CODE]
std::string stringToSearch = "Book";
std::string expression = "B";
try
{
std::regex regex(expression, std::regex_constants::ECMAScript);
std::smatch results;
if(std::regex_search(stringToSearch, results, regex))
{
std::cout << "The regex matched! Results:" << std::endl;
int num = 0;
for(const auto &result : results)
{
std::cout << "\tSubmatch" << IntToString(num++) << ": " << result.str() << std::endl;
}
}
else
{
std::cout << "The regex didn't match!"
<< "\nRegex: " << expression
<< "\nString: " << stringToSearch << std::endl;
}
}
catch(std::regex_error &err)
{
if(err.code() == std::regex_constants::error_collate)
std::cout << "Error: error_collate" << std::endl;
else if(err.code() == std::regex_constants::error_stack)
std::cout << "Error: error_stack" << std::endl;
else if(err.code() == std::regex_constants::error_complexity)
std::cout << "Error: error_complexity" << std::endl;
else if(err.code() == std::regex_constants::error_badrepeat)
std::cout << "Error: error_badrepeat" << std::endl;
else if(err.code() == std::regex_constants::error_space)
std::cout << "Error: error_space" << std::endl;
else if(err.code() == std::regex_constants::error_range)
std::cout << "Error: error_range" << std::endl;
else if(err.code() == std::regex_constants::error_badbrace)
std::cout << "Error: error_badbrace" << std::endl;
else if(err.code() == std::regex_constants::error_brace)
std::cout << "Error: error_brace" << std::endl;
else if(err.code() == std::regex_constants::error_paren)
std::cout << "Error: error_paren" << std::endl;
else if(err.code() == std::regex_constants::error_brack)
std::cout << "Error: error_brack" << std::endl;
else if(err.code() == std::regex_constants::error_backref)
std::cout << "Error: error_backref" << std::endl;
else if(err.code() == std::regex_constants::error_escape)
std::cout << "Error: error_escape" << std::endl;
else if(err.code() == std::regex_constants::error_ctype)
std::cout << "Error: error_ctype" << std::endl;
else
std::cout << "Error: [UNKNOWN]" << std::endl;
}
return 0;
[/CODE]

[b]My output:[/b]

The regex didn't match!
Regex: B
String: Book

Share this post


Link to post
Share on other sites
Found the problem. Here is GCC's version of the function I'm using:
[code]template<...>
inline bool regex_search(...)
{
return false;
}[/code]

Well, no cookies for me today. Apparently the Regex portion of the C++11 standard library hasn't yet been implemented in GCC (and thus, MinGW), but the functions and classes exist and have stub definitions, so compiling gives you no errors. [img]http://public.gamedev.net//public/style_emoticons/default/laugh.png[/img]

Thanks for all your help. At least I'm learning RegEx.

Share this post


Link to post
Share on other sites
[quote name='Servant of the Lord' timestamp='1330124561' post='4916358']
Found the problem. Here is GCC's version of the function I'm using:
[code]template<...>
inline bool regex_search(...)
{
return false;
}[/code]

Well, no cookies for me today. Apparently the Regex portion of the C++11 standard library hasn't yet been implemented in GCC (and thus, MinGW), but the functions and classes exist and have stub definitions, so compiling gives you no errors. [img]http://public.gamedev.net//public/style_emoticons/default/laugh.png[/img]
[/quote]
Yeah, sorry. My day job has been sucking all my productivity after I got about half-way through the implementation. If only I could convince my employer it would be worthwhile to pay me for 6 months to finish it, the world would be a much better place.

Share this post


Link to post
Share on other sites
erm, so use boost. Its works fine...

although, to be honest, a simple parser would probably be quicker to implement.

Share this post


Link to post
Share on other sites
[quote name='Washu' timestamp='1330136496' post='4916413']
erm, so use boost. Its works fine...[/quote]
Yep, I'm aware of Boost.Regex.
[quote]although, to be honest, a simple parser would probably be quicker to implement.
[/quote]
Certainly, I did so immediately after I found out about GCC's regex. It only took about 15 minutes.
But it is a very good, yet still simple enough, non-contrived problem to get my feet wet in Regex.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this