• Create Account

# RegEx matching difficulties

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

12 replies to this topic

### #1Servant of the Lord  Crossbones+   -  Reputation: 12496

Like
0Likes
Like

Posted 23 February 2012 - 05:20 PM

[To any GD.net staff: Requesting 'Regex' as a GD.net forum topic prefix]

I'm trying to solve a text parsing problem using C++11's <regex> library. This is somewhat unrelated to C++, however. This is mostly a Regex question, with only a few minor C++ twists. I could solve this problem easily using other C++ methods, but I figure knowing RegEx is a very important skill that I'm currently lacking.

The text I'm trying to parse has this format: (with a varying number of parameters - but for now, I'm assuming it has at least one)
functionName("blah", "Foo", Stuff)

Quotation marks on the parameters are optional, so I need to accept both cases.
I want to get 'functionName' as one string, and each parameter as another string.

Here's what I'm trying to do: (Code highlighted to match my _expectation_ of where it should line up on the format above).
[\w-]+$$(("[\w- ]*")|([\w-]+))(, *(("[\w- ]*")|([\w-]+)))*$$

I've been using this site to test, and it's not matching as I'm thinking it should.

Here's what I'm thinking each part of this code does.

[\w-]+
Function name. Alphanumerical, and includes hypens.
Should match: functionName("blah", "Foo", Stuff)

$$Opening function bracket. Should match: functionName("blah", "Foo", Stuff) (("[\w- ]*")|([\w-]+)) First function parameter. Should match: functionName("blah", "Foo", Stuff) Look at it like this: ( ("[\w- ]*") | ([\w-]+) ) First half: <quotation-mark> OneOrMore:(alphaNumerical OR hyphen OR space) <quotation-mark> Second half: OneOrMore:(alphaNumerical OR hyphen) The second half should match the same as the first half, but without quotes and without spaces (because arguments should only have spaces within quotes). (, *(("[\w- ]*")|([\w-]+)))* This is the exact same as the previous, except it is prefixed with a comma and optional spaces. This entire sub-expression occures zero or more times. Should match: functionName("blah", "Foo", Stuff)$$
Closing function bracket.
Should match: functionName("blah", "Foo", Stuff)

=============================================

My questions are several:
1) What am I overlooking in this above expression? Why doesn't the expression 'match' the example input?

2) Assuming it did match, how do I 'pull-out' or retrieve the different parts I want? I know how to do this on C++'s side of it, but I don't know how to specify in the expression itself which parts are the parts I want (the function name, and each argument), verses which parts to discard.

3) How do I repeat a sub-expression, so I don't have to copy + paste it multiple times into the expression? (Example: I have the second+ arguments as a copy and paste of the first argument... how do I avoid that?

It's perfectly fine to abbreviate my username to 'Servant' rather than copy+pasting it all the time.

All glory be to the Man at the right hand... On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.                                                                                                                                                       [Need free cloud storage? I personally like DropBox]

### #2DenzelM  Members   -  Reputation: 295

Like
1Likes
Like

Posted 23 February 2012 - 07:41 PM

What you are attempting to do cannot be done with one regular expression. Regular expressions cannot have a variable number of capturing groups. Of course, there are multiple ways you could work around this: a simple parser, multiple regular expressions, or some other option.

I highly recommend you read these tutorials on regular expressions, specifically, the article on grouping and backreferences.

You're on the right track, but here's how I would do it. Create one regular expression that captures functionName in one group and parameters in a second group. By capturing the parameters with the regular expression you are assured that they are correct. Once this is done you can take the results from the parameters group and parse them any way you like e.g. split the string on commas.
Denzel Morris (@drdizzy) :: Software Engineer :: SkyTech Enterprises, Inc.
"When men are most sure and arrogant they are commonly most mistaken, giving views to passion without that proper deliberation which alone can secure them from the grossest absurdities." - David Hume

### #3Servant of the Lord  Crossbones+   -  Reputation: 12496

Like
0Likes
Like

Posted 23 February 2012 - 09:09 PM

What you are attempting to do cannot be done with one regular expression. Regular expressions cannot have a variable number of capturing groups.

That's what I was thinking. Multiple seperate expressions seemed alot easier, but I was trying to avoid taking an easy out just because I might've misunderstood something and overlooked a potentially powerful feature.

The only reason I thought it might have been possible to capture more than one, uh, "capture groups", was because over here, someone appeared to have done so. He seems to have captured '2' and 'Egg prices' with a single expression, with different parts of his one expression capturing the different matches. I'm probably just misunderstanding what he did there, but that was the reason why I thought that possible.

[Edit:] Ah, that'd be that backreference thing you were talking about. I think I get that.

I highly recommend you read these tutorials on regular expressions, specifically, the article on grouping and backreferences.

Thanks, I'll check them out!

It's perfectly fine to abbreviate my username to 'Servant' rather than copy+pasting it all the time.

All glory be to the Man at the right hand... On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.                                                                                                                                                       [Need free cloud storage? I personally like DropBox]

### #4DenzelM  Members   -  Reputation: 295

Like
1Likes
Like

Posted 23 February 2012 - 09:31 PM

The only reason I thought it might have been possible to capture more than one, uh, "capture groups", was because over here, someone appeared to have done so. He seems to have captured '2' and 'Egg prices' with a single expression, with different parts of his one expression capturing the different matches.

You can have multiple capturing groups, but you cannot have a variable amount of capturing groups. There is a key distinction. Furthermore, as the articles I linked to will show you, anything in parentheses will create a backreference (capturing group) assuming that it has not been disabled.

Look at his regular expression. Notice there are two sets of matching parentheses, therefore two backreferences are created. His regular expression will always capture two groups; never more, never less.
Denzel Morris (@drdizzy) :: Software Engineer :: SkyTech Enterprises, Inc.
"When men are most sure and arrogant they are commonly most mistaken, giving views to passion without that proper deliberation which alone can secure them from the grossest absurdities." - David Hume

### #5Servant of the Lord  Crossbones+   -  Reputation: 12496

Like
0Likes
Like

Posted 24 February 2012 - 01:38 PM

After going over the tutorial, here is my current regex which matches correctly. Are there any places I should improve it?
(\w+)\s*$$\s*(?: ([\w-]+|"[\w- "]+")\s*)? (?:,\s*([\w-]+|"[\w- "]+")\s*)? (?:,\s*([\w-]+|"[\w- "]+")\s*)? (?:,\s*([\w-]+|"[\w- "]+")\s*)? (?:,\s*([\w-]+|"[\w- "]+")\s*)?$$

(The newlines added for legibillity, but are not in the actual regex)

It matches something like this, just fine:
functionName("arg1", arg2,arg3, "arg4" , "arg five")

And matches zero to five arguments.

I grab the function name as the first capture group, and each (optional) argument as different capture group.

There are two things I've tried to do, that I can't figure out how:
1) How can I capture the arguments without capturing the quotes themselves? I tried using conditionals, but I'd accidentally end up with more capture groups than I wanted (Two per argument - one invalid and one valid).
2) The last four capture groups are completely identical - I just copied and pasted them. Is there a way to say, "Use this same expression" instead of "Use the result of this expression (back referencing)"?

Another question is, is there any way to make regular expressions more legible by spacing out the characters, or something? Or is there a good free RegEx editor (with syntax highlighting) you'd recommend?

It's perfectly fine to abbreviate my username to 'Servant' rather than copy+pasting it all the time.

All glory be to the Man at the right hand... On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.                                                                                                                                                       [Need free cloud storage? I personally like DropBox]

### #6Bregma  Crossbones+   -  Reputation: 3444

Like
1Likes
Like

Posted 24 February 2012 - 02:02 PM

2) The last four capture groups are completely identical - I just copied and pasted them. Is there a way to say, "Use this same expression" instead of "Use the result of this expression (back referencing)"?

Sounds like you want to use std::regex_iterator.

Another question is, is there any way to make regular expressions more legible by spacing out the characters, or something?

Well, the regex languages are not context-free, so no there is no way of writing the programs with extra spaces. What you can do is paste together segments using regular C++ string concatenation and you can use named symbols for more descriptive code.

BTW, you need to be careful when using a hyphen in a subset expression. If the hyphen appears anywhere but the first or last position, it indicates a subset range, and does not match explicitly. By that I mean [\w- ] means "match any character between [:alnum:] and the space in the current collating order" which makes no sense to me. I would expect it to throw std::regex_error with code() == std::regex_error::error_range.
Stephen M. Webb
Professional Free Software Developer

### #7Servant of the Lord  Crossbones+   -  Reputation: 12496

Like
0Likes
Like

Posted 24 February 2012 - 04:18 PM

By that I mean [\w- ] means "match any character between [:alnum:] and the space in the current collating order" which makes no sense to me. I would expect it to throw std::regex_error with code() == std::regex_error::error_range.

You're right, it's throwing. I've been testing here for learning. Fixed the hypen issue, but now it's throwing std::regex_constants::escape, which apparently means I have a 'trailing space' or an invalid escape code.

Well, that shouldn't be too hard to find, right?

(\w+)\s*$$\s*(?:([\w-]+|"[\w"-]+")\s*)?(?:,\s*([\w-]+|"[\w"-]+")\s*)?(?:,\s*([\w-]+|"[\w"-]+")\s*)?(?:,\s*([\w-]+|"[\w"-]+")\s*)?(?:,\s*([\w-]+|"[\w"-]+")\s*)?$$

Anyway, I trimmed it down, and even a regex like \w or \s throw the error. This makes me believe that \w or \s is different in C++ from the Javascript ones. I can find so little info on the C++ regex engine, so I don't know what special characters C++ RegEx uses.

Is there any C++11 regex reference sheets available? I can only find perl and javascript stuff.

Even having expressions like: [a-z] or even [a] throws exceptions. (In those two cases, it throws: "error_brack")

The only expression that hasn't thrown so far, is just " a " on it's own... And that fails to properly match a string of also only a single 'a'. Weird.

It's perfectly fine to abbreviate my username to 'Servant' rather than copy+pasting it all the time.

All glory be to the Man at the right hand... On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.                                                                                                                                                       [Need free cloud storage? I personally like DropBox]

### #8Servant of the Lord  Crossbones+   -  Reputation: 12496

Like
0Likes
Like

Posted 24 February 2012 - 04:40 PM

Anyone want to give this a shot, and see if it works for them?

    std::string stringToSearch = "Book";
std::string expression = "B";
try
{
std::regex regex(expression, std::regex_constants::ECMAScript);
std::smatch results;
if(std::regex_search(stringToSearch, results, regex))
{
std::cout << "The regex matched! Results:" << std::endl;
int num = 0;
for(const auto &result : results)
{
std::cout << "\tSubmatch" << IntToString(num++) << ": " << result.str() << std::endl;
}
}
else
{
std::cout << "The regex didn't match!"
<< "\nRegex: " << expression
<< "\nString: " << stringToSearch << std::endl;
}
}
catch(std::regex_error &err)
{
if(err.code() == std::regex_constants::error_collate)
std::cout << "Error: error_collate" << std::endl;
else if(err.code() == std::regex_constants::error_stack)
std::cout << "Error: error_stack" << std::endl;
else if(err.code() == std::regex_constants::error_complexity)
std::cout << "Error: error_complexity" << std::endl;
std::cout << "Error: error_badrepeat" << std::endl;
else if(err.code() == std::regex_constants::error_space)
std::cout << "Error: error_space" << std::endl;
else if(err.code() == std::regex_constants::error_range)
std::cout << "Error: error_range" << std::endl;
std::cout << "Error: error_badbrace" << std::endl;
else if(err.code() == std::regex_constants::error_brace)
std::cout << "Error: error_brace" << std::endl;
else if(err.code() == std::regex_constants::error_paren)
std::cout << "Error: error_paren" << std::endl;
else if(err.code() == std::regex_constants::error_brack)
std::cout << "Error: error_brack" << std::endl;
else if(err.code() == std::regex_constants::error_backref)
std::cout << "Error: error_backref" << std::endl;
else if(err.code() == std::regex_constants::error_escape)
std::cout << "Error: error_escape" << std::endl;
else if(err.code() == std::regex_constants::error_ctype)
std::cout << "Error: error_ctype" << std::endl;
else
std::cout << "Error: [UNKNOWN]" << std::endl;
}
return 0;


My output:

The regex didn't match!
Regex: B
String: Book

It's perfectly fine to abbreviate my username to 'Servant' rather than copy+pasting it all the time.

All glory be to the Man at the right hand... On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.                                                                                                                                                       [Need free cloud storage? I personally like DropBox]

### #9Servant of the Lord  Crossbones+   -  Reputation: 12496

Like
0Likes
Like

Posted 24 February 2012 - 05:02 PM

Found the problem. Here is GCC's version of the function I'm using:
template<...>
inline bool regex_search(...)
{
return false;
}

Well, no cookies for me today. Apparently the Regex portion of the C++11 standard library hasn't yet been implemented in GCC (and thus, MinGW), but the functions and classes exist and have stub definitions, so compiling gives you no errors.

Thanks for all your help. At least I'm learning RegEx.

It's perfectly fine to abbreviate my username to 'Servant' rather than copy+pasting it all the time.

All glory be to the Man at the right hand... On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.                                                                                                                                                       [Need free cloud storage? I personally like DropBox]

### #10Bregma  Crossbones+   -  Reputation: 3444

Like
1Likes
Like

Posted 24 February 2012 - 05:38 PM

Found the problem. Here is GCC's version of the function I'm using:

template<...>
inline bool regex_search(...)
{
return false;
}

Well, no cookies for me today. Apparently the Regex portion of the C++11 standard library hasn't yet been implemented in GCC (and thus, MinGW), but the functions and classes exist and have stub definitions, so compiling gives you no errors.

Yeah, sorry. My day job has been sucking all my productivity after I got about half-way through the implementation. If only I could convince my employer it would be worthwhile to pay me for 6 months to finish it, the world would be a much better place.
Stephen M. Webb
Professional Free Software Developer

### #11Servant of the Lord  Crossbones+   -  Reputation: 12496

Like
0Likes
Like

Posted 24 February 2012 - 06:01 PM

You're the one implementing the GCC Regex library? Well, completed or not, thank you for having moved it forward! As a less experienced developer, it is invaluable that tools and compilers and IDEs exist that allows one to get started programming quickly and smoothly.

It's perfectly fine to abbreviate my username to 'Servant' rather than copy+pasting it all the time.

All glory be to the Man at the right hand... On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.                                                                                                                                                       [Need free cloud storage? I personally like DropBox]

### #12Washu  Senior Moderators   -  Reputation: 3693

Like
0Likes
Like

Posted 24 February 2012 - 08:21 PM

erm, so use boost. Its works fine...

although, to be honest, a simple parser would probably be quicker to implement.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.
ScapeCode - Blog | SlimDX

### #13Servant of the Lord  Crossbones+   -  Reputation: 12496

Like
0Likes
Like

Posted 24 February 2012 - 09:04 PM

erm, so use boost. Its works fine...

Yep, I'm aware of Boost.Regex.

although, to be honest, a simple parser would probably be quicker to implement.

Certainly, I did so immediately after I found out about GCC's regex. It only took about 15 minutes.
But it is a very good, yet still simple enough, non-contrived problem to get my feet wet in Regex.

It's perfectly fine to abbreviate my username to 'Servant' rather than copy+pasting it all the time.

All glory be to the Man at the right hand... On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.                                                                                                                                                       [Need free cloud storage? I personally like DropBox]

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

PARTNERS