RegEx matching difficulties

Started by
11 comments, last by Servant of the Lord 12 years, 1 month ago
[color=#FF0000][size=2][To any GD.net staff: Requesting 'Regex' as a GD.net forum topic prefix]

I'm trying to solve a text parsing problem using C++11's <regex> library. This is somewhat unrelated to C++, however. This is mostly a Regex question, with only a few minor C++ twists. I could solve this problem easily using other C++ methods, but I figure knowing RegEx is a very important skill that I'm currently lacking.

The text I'm trying to parse has this format: (with a varying number of parameters - but for now, I'm assuming it has at least one)
[color=#ff8c00]functionName

[color=#ff0000]([color=#008080]"blah"[color=#800080], "Foo", Stuff

[color=#ff0000])



Quotation marks on the parameters are optional, so I need to accept both cases.
I want to get 'functionName' as one string, and each parameter as another string.

Here's what I'm trying to do: (Code highlighted to match my _expectation_ of where it should line up on the format above).
[color=#ff8c00][\w-]+

[color=#ff0000]\([color=#008080](("[\w- ]*")|([\w-]+))[color=#800080](, *

[color=#800080](("[\w- ]*")|([\w-]+))[color=#800080])*

[color=#ff0000]\)



I've been using this site to test, and it's not matching as I'm thinking it should.

Here's what I'm thinking each part of this code does.

[color=#FF8C00][\w-]+
Function name. Alphanumerical, and includes hypens.
Should match: [color=#ff0000]functionName

("blah", "Foo", Stuff

)



[color=#FF0000]\(


Opening function bracket.
Should match: functionName[color=#ff0000]

("blah", "Foo", Stuff

)



[color=#008080](("[\w- ]*")|([\w-]+))
First function parameter.
Should match: functionName

([color=#ff0000]"blah", "Foo", Stuff

)



Look at it like this:
[color=#ff0000]( [color=#008080]("[\w- ]*")

[color=#ff0000]| [color=#008080]([\w-]+) [color=#ff0000])


First half: <quotation-mark> OneOrMore:(alphaNumerical OR hyphen OR space) <quotation-mark>
Second half: OneOrMore:(alphaNumerical OR hyphen)
The second half should match the same as the first half, but without quotes and without spaces (because arguments should only have spaces within quotes).

[color=#800080](, *[color=#800080](("[\w- ]*")|([\w-]+))[color=#800080])*
This is the exact same as the previous, except it is prefixed with a comma and optional spaces. This entire sub-expression occures zero or more times.
Should match: functionName

("blah", [color=#ff0000]"Foo", Stuff

)



[color=#FF0000]\)


Closing function bracket.
Should match: functionName

("blah", "Foo", Stuff

[color=#FF0000])



=============================================

My questions are several:
1) What am I overlooking in this above expression? Why doesn't the expression 'match' the example input?

2) Assuming it did match, how do I 'pull-out' or retrieve the different parts I want? I know how to do this on C++'s side of it, but I don't know how to specify in the expression itself which parts are the parts I want (the function name, and each argument), verses which parts to discard.

3) How do I repeat a sub-expression, so I don't have to copy + paste it multiple times into the expression? (Example: I have the second+ arguments as a copy and paste of the first argument... how do I avoid that?
Advertisement
What you are attempting to do cannot be done with one regular expression. Regular expressions cannot have a variable number of capturing groups. Of course, there are multiple ways you could work around this: a simple parser, multiple regular expressions, or some other option.

I highly recommend you read these tutorials on regular expressions, specifically, the article on grouping and backreferences.

You're on the right track, but here's how I would do it. Create one regular expression that captures functionName in one group and parameters in a second group. By capturing the parameters with the regular expression you are assured that they are correct. Once this is done you can take the results from the parameters group and parse them any way you like e.g. split the string on commas.
Denzel Morris (@drdizzy) :: Software Engineer :: SkyTech Enterprises, Inc.
"When men are most sure and arrogant they are commonly most mistaken, giving views to passion without that proper deliberation which alone can secure them from the grossest absurdities." - David Hume

What you are attempting to do cannot be done with one regular expression. Regular expressions cannot have a variable number of capturing groups.

That's what I was thinking. Multiple seperate expressions seemed alot easier, but I was trying to avoid taking an easy out just because I might've misunderstood something and overlooked a potentially powerful feature.

The only reason I thought it might have been possible to capture more than one, uh, "capture groups", was because over here, someone appeared to have done so. He seems to have captured '2' and '[color=#000000][size=1]

Egg

[color=#000000][size=1]

prices' with a single expression, with different parts of his one expression capturing the different matches. I'm probably just misunderstanding what he did there, but that was the reason why I thought that possible.



I highly recommend you read these tutorials on regular expressions, specifically, the article on grouping and backreferences.[/quote]
Thanks, I'll check them out! smile.png

What you are attempting to do cannot be done with one regular expression. Regular expressions cannot have a variable number of capturing groups.

That's what I was thinking. Multiple seperate expressions seemed alot easier, but I was trying to avoid taking an easy out just because I might've misunderstood something and overlooked a potentially powerful feature.

The only reason I thought it might have been possible to capture more than one, uh, "capture groups", was because over here, someone appeared to have done so. He seems to have captured '2' and 'Egg prices' with a single expression, with different parts of his one expression capturing the different matches. I'm probably just misunderstanding what he did there, but that was the reason why I thought that possible.

[Edit:] Ah, that'd be that backreference thing you were talking about. I think I get that.

I highly recommend you read these tutorials on regular expressions, specifically, the article on grouping and backreferences.[/quote]
Thanks, I'll check them out! smile.png

The only reason I thought it might have been possible to capture more than one, uh, "capture groups", was because over here, someone appeared to have done so. He seems to have captured '2' and 'Egg prices' with a single expression, with different parts of his one expression capturing the different matches.

You can have multiple capturing groups, but you cannot have a variable amount of capturing groups. There is a key distinction. Furthermore, as the articles I linked to will show you, anything in parentheses will create a backreference (capturing group) assuming that it has not been disabled.

Look at his regular expression. Notice there are two sets of matching parentheses, therefore two backreferences are created. His regular expression will always capture two groups; never more, never less.
Denzel Morris (@drdizzy) :: Software Engineer :: SkyTech Enterprises, Inc.
"When men are most sure and arrogant they are commonly most mistaken, giving views to passion without that proper deliberation which alone can secure them from the grossest absurdities." - David Hume
After going over the tutorial, here is my current regex which matches correctly. Are there any places I should improve it?
(\w+)\s*\(
\s*(?: ([\w-]+|"[\w- "]+")\s*)?
(?:,\s*([\w-]+|"[\w- "]+")\s*)?
(?:,\s*([\w-]+|"[\w- "]+")\s*)?
(?:,\s*([\w-]+|"[\w- "]+")\s*)?
(?:,\s*([\w-]+|"[\w- "]+")\s*)?
\)


([size=2]The newlines added for legibillity, but are not in the actual regex)

It matches something like this, just fine:
functionName("arg1", arg2,arg3, "arg4" , "arg five")

And matches zero to five arguments.

I grab the function name as the first capture group, and each (optional) argument as different capture group.

There are two things I've tried to do, that I can't figure out how:
1) How can I capture the arguments without capturing the quotes themselves? I tried using conditionals, but I'd accidentally end up with more capture groups than I wanted (Two per argument - one invalid and one valid).
2) The last four capture groups are completely identical - I just copied and pasted them. Is there a way to say, "Use this same expression" instead of "Use the result of this expression (back referencing)"?

Another question is, is there any way to make regular expressions more legible by spacing out the characters, or something? Or is there a good free RegEx editor (with syntax highlighting) you'd recommend?

2) The last four capture groups are completely identical - I just copied and pasted them. Is there a way to say, "Use this same expression" instead of "Use the result of this expression (back referencing)"?

Sounds like you want to use std::regex_iterator.

Another question is, is there any way to make regular expressions more legible by spacing out the characters, or something?
[/quote]
Well, the regex languages are not context-free, so no there is no way of writing the programs with extra spaces. What you can do is paste together segments using regular C++ string concatenation and you can use named symbols for more descriptive code.

BTW, you need to be careful when using a hyphen in a subset expression. If the hyphen appears anywhere but the first or last position, it indicates a subset range, and does not match explicitly. By that I mean [\w- ] means "match any character between [:alnum:] and the space in the current collating order" which makes no sense to me. I would expect it to throw [font=courier new,courier,monospace]std::regex_error[/font] with [font=courier new,courier,monospace]code() == std::regex_error::error_range[/font].

Stephen M. Webb
Professional Free Software Developer


By that I mean [\w- ] means "match any character between [:alnum:] and the space in the current collating order" which makes no sense to me. I would expect it to throw [font=courier new,courier,monospace]std::regex_error[/font] with [font=courier new,courier,monospace]code() == std::regex_error::error_range[/font].

You're right, it's throwing. I've been testing here for learning. Fixed the hypen issue, but now it's throwing std::regex_constants::escape, which apparently means I have a 'trailing space' or an invalid escape code.

Well, that shouldn't be too hard to find, right? dry.png

[color=#800080](\w+)\s*\(\s*(?:([\w-]+|"[\w"-]+")\s*)?(?:,\s*([\w-]+|"[\w"-]+")\s*)?(?:,\s*([\w-]+|"[\w"-]+")\s*)?(?:,\s*([\w-]+|"[\w"-]+")\s*)?(?:,\s*([\w-]+|"[\w"-]+")\s*)?\)

Anyway, I trimmed it down, and even a regex like [color=#800080]\w or [color=#800080]\s throw the error.[font=Courier][size=2][color=#aa00aa] [/font]This makes me believe that \w or \s is different in C++ from the Javascript ones. I can find so little info on the C++ regex engine, so I don't know what special characters C++ RegEx uses.

Is there any C++11 regex reference sheets available? I can only find perl and javascript stuff.

Even having expressions like: [color=#800080][a-z] or even [color=#800080][a] throws exceptions. (In those two cases, it throws: "error_brack")

The only expression that hasn't thrown so far, is just " [color=#800080]a " on it's own... And that fails to properly match a string of also only a single 'a'. Weird.
Anyone want to give this a shot, and see if it works for them?


std::string stringToSearch = "Book";
std::string expression = "B";
try
{
std::regex regex(expression, std::regex_constants::ECMAScript);
std::smatch results;
if(std::regex_search(stringToSearch, results, regex))
{
std::cout << "The regex matched! Results:" << std::endl;
int num = 0;
for(const auto &result : results)
{
std::cout << "\tSubmatch" << IntToString(num++) << ": " << result.str() << std::endl;
}
}
else
{
std::cout << "The regex didn't match!"
<< "\nRegex: " << expression
<< "\nString: " << stringToSearch << std::endl;
}
}
catch(std::regex_error &err)
{
if(err.code() == std::regex_constants::error_collate)
std::cout << "Error: error_collate" << std::endl;
else if(err.code() == std::regex_constants::error_stack)
std::cout << "Error: error_stack" << std::endl;
else if(err.code() == std::regex_constants::error_complexity)
std::cout << "Error: error_complexity" << std::endl;
else if(err.code() == std::regex_constants::error_badrepeat)
std::cout << "Error: error_badrepeat" << std::endl;
else if(err.code() == std::regex_constants::error_space)
std::cout << "Error: error_space" << std::endl;
else if(err.code() == std::regex_constants::error_range)
std::cout << "Error: error_range" << std::endl;
else if(err.code() == std::regex_constants::error_badbrace)
std::cout << "Error: error_badbrace" << std::endl;
else if(err.code() == std::regex_constants::error_brace)
std::cout << "Error: error_brace" << std::endl;
else if(err.code() == std::regex_constants::error_paren)
std::cout << "Error: error_paren" << std::endl;
else if(err.code() == std::regex_constants::error_brack)
std::cout << "Error: error_brack" << std::endl;
else if(err.code() == std::regex_constants::error_backref)
std::cout << "Error: error_backref" << std::endl;
else if(err.code() == std::regex_constants::error_escape)
std::cout << "Error: error_escape" << std::endl;
else if(err.code() == std::regex_constants::error_ctype)
std::cout << "Error: error_ctype" << std::endl;
else
std::cout << "Error: [UNKNOWN]" << std::endl;
}
return 0;


My output:

The regex didn't match!
Regex: B
String: Book
Found the problem. Here is GCC's version of the function I'm using:
template<...>
inline bool regex_search(...)
{
return false;
}


Well, no cookies for me today. Apparently the Regex portion of the C++11 standard library hasn't yet been implemented in GCC (and thus, MinGW), but the functions and classes exist and have stub definitions, so compiling gives you no errors. laugh.png

Thanks for all your help. At least I'm learning RegEx.

This topic is closed to new replies.

Advertisement