#1 Marketplace Seller - Reputation: 1413
Posted 23 February 2012 - 05:20 PM
I'm trying to solve a text parsing problem using C++11's <regex> library. This is somewhat unrelated to C++, however. This is mostly a Regex question, with only a few minor C++ twists. I could solve this problem easily using other C++ methods, but I figure knowing RegEx is a very important skill that I'm currently lacking.
The text I'm trying to parse has this format: (with a varying number of parameters - but for now, I'm assuming it has at least one)
functionName("blah", "Foo", Stuff)
Quotation marks on the parameters are optional, so I need to accept both cases.
I want to get 'functionName' as one string, and each parameter as another string.
Here's what I'm trying to do: (Code highlighted to match my _expectation_ of where it should line up on the format above).
[\w-]+\((("[\w- ]*")|([\w-]+))(, *(("[\w- ]*")|([\w-]+)))*\)
I've been using this site to test, and it's not matching as I'm thinking it should.
Here's what I'm thinking each part of this code does.
[\w-]+
Function name. Alphanumerical, and includes hypens.
Should match: functionName("blah", "Foo", Stuff)
\(
Opening function bracket.
Should match: functionName("blah", "Foo", Stuff)
(("[\w- ]*")|([\w-]+))
First function parameter.
Should match: functionName("blah", "Foo", Stuff)
Look at it like this:
( ("[\w- ]*") | ([\w-]+) )
First half: <quotation-mark> OneOrMore:(alphaNumerical OR hyphen OR space) <quotation-mark>
Second half: OneOrMore:(alphaNumerical OR hyphen)
The second half should match the same as the first half, but without quotes and without spaces (because arguments should only have spaces within quotes).
(, *(("[\w- ]*")|([\w-]+)))*
This is the exact same as the previous, except it is prefixed with a comma and optional spaces. This entire sub-expression occures zero or more times.
Should match: functionName("blah", "Foo", Stuff)
\)
Closing function bracket.
Should match: functionName("blah", "Foo", Stuff)
=============================================
My questions are several:
1) What am I overlooking in this above expression? Why doesn't the expression 'match' the example input?
2) Assuming it did match, how do I 'pull-out' or retrieve the different parts I want? I know how to do this on C++'s side of it, but I don't know how to specify in the expression itself which parts are the parts I want (the function name, and each argument), verses which parts to discard.
3) How do I repeat a sub-expression, so I don't have to copy + paste it multiple times into the expression? (Example: I have the second+ arguments as a copy and paste of the first argument... how do I avoid that?
On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.
#2 Members - Reputation: 295
Posted 23 February 2012 - 07:41 PM
I highly recommend you read these tutorials on regular expressions, specifically, the article on grouping and backreferences.
You're on the right track, but here's how I would do it. Create one regular expression that captures functionName in one group and parameters in a second group. By capturing the parameters with the regular expression you are assured that they are correct. Once this is done you can take the results from the parameters group and parse them any way you like e.g. split the string on commas.
"When men are most sure and arrogant they are commonly most mistaken, giving views to passion without that proper deliberation which alone can secure them from the grossest absurdities." - David Hume
#3 Marketplace Seller - Reputation: 1413
Posted 23 February 2012 - 09:09 PM
DenzelM, on 23 February 2012 - 07:41 PM, said:
The only reason I thought it might have been possible to capture more than one, uh, "capture groups", was because over here, someone appeared to have done so. He seems to have captured '2' and 'Egg prices' with a single expression, with different parts of his one expression capturing the different matches. I'm probably just misunderstanding what he did there, but that was the reason why I thought that possible.
[Edit:] Ah, that'd be that backreference thing you were talking about. I think I get that.
Quote
On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.
#4 Members - Reputation: 295
Posted 23 February 2012 - 09:31 PM
Servant of the Lord, on 23 February 2012 - 09:09 PM, said:
Look at his regular expression. Notice there are two sets of matching parentheses, therefore two backreferences are created. His regular expression will always capture two groups; never more, never less.
"When men are most sure and arrogant they are commonly most mistaken, giving views to passion without that proper deliberation which alone can secure them from the grossest absurdities." - David Hume
#5 Marketplace Seller - Reputation: 1413
Posted 24 February 2012 - 01:38 PM
(\w+)\s*\(
\s*(?: ([\w-]+|"[\w- "]+")\s*)?
(?:,\s*([\w-]+|"[\w- "]+")\s*)?
(?:,\s*([\w-]+|"[\w- "]+")\s*)?
(?:,\s*([\w-]+|"[\w- "]+")\s*)?
(?:,\s*([\w-]+|"[\w- "]+")\s*)?
\)
(The newlines added for legibillity, but are not in the actual regex)
It matches something like this, just fine:
functionName("arg1", arg2,arg3, "arg4" , "arg five")
And matches zero to five arguments.
I grab the function name as the first capture group, and each (optional) argument as different capture group.
There are two things I've tried to do, that I can't figure out how:
1) How can I capture the arguments without capturing the quotes themselves? I tried using conditionals, but I'd accidentally end up with more capture groups than I wanted (Two per argument - one invalid and one valid).
2) The last four capture groups are completely identical - I just copied and pasted them. Is there a way to say, "Use this same expression" instead of "Use the result of this expression (back referencing)"?
Another question is, is there any way to make regular expressions more legible by spacing out the characters, or something? Or is there a good free RegEx editor (with syntax highlighting) you'd recommend?
On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.
#6 Members - Reputation: 1217
Posted 24 February 2012 - 02:02 PM
Servant of the Lord, on 24 February 2012 - 01:38 PM, said:
Quote
BTW, you need to be careful when using a hyphen in a subset expression. If the hyphen appears anywhere but the first or last position, it indicates a subset range, and does not match explicitly. By that I mean [\w- ] means "match any character between [:alnum:] and the space in the current collating order" which makes no sense to me. I would expect it to throw std::regex_error with code() == std::regex_error::error_range.
Professional Free Software Developer
#7 Marketplace Seller - Reputation: 1413
Posted 24 February 2012 - 04:18 PM
Bregma, on 24 February 2012 - 02:02 PM, said:
Well, that shouldn't be too hard to find, right?
(\w+)\s*\(\s*(?:([\w-]+|"[\w"-]+")\s*)?(?:,\s*([\w-]+|"[\w"-]+")\s*)?(?:,\s*([\w-]+|"[\w"-]+")\s*)?(?:,\s*([\w-]+|"[\w"-]+")\s*)?(?:,\s*([\w-]+|"[\w"-]+")\s*)?\)
Anyway, I trimmed it down, and even a regex like \w or \s throw the error. This makes me believe that \w or \s is different in C++ from the Javascript ones. I can find so little info on the C++ regex engine, so I don't know what special characters C++ RegEx uses.
Is there any C++11 regex reference sheets available? I can only find perl and javascript stuff.
Even having expressions like: [a-z] or even [a] throws exceptions. (In those two cases, it throws: "error_brack")
The only expression that hasn't thrown so far, is just " a " on it's own... And that fails to properly match a string of also only a single 'a'. Weird.
On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.
#8 Marketplace Seller - Reputation: 1413
Posted 24 February 2012 - 04:40 PM
std::string stringToSearch = "Book";
std::string expression = "B";
try
{
std::regex regex(expression, std::regex_constants::ECMAScript);
std::smatch results;
if(std::regex_search(stringToSearch, results, regex))
{
std::cout << "The regex matched! Results:" << std::endl;
int num = 0;
for(const auto &result : results)
{
std::cout << "\tSubmatch" << IntToString(num++) << ": " << result.str() << std::endl;
}
}
else
{
std::cout << "The regex didn't match!"
<< "\nRegex: " << expression
<< "\nString: " << stringToSearch << std::endl;
}
}
catch(std::regex_error &err)
{
if(err.code() == std::regex_constants::error_collate)
std::cout << "Error: error_collate" << std::endl;
else if(err.code() == std::regex_constants::error_stack)
std::cout << "Error: error_stack" << std::endl;
else if(err.code() == std::regex_constants::error_complexity)
std::cout << "Error: error_complexity" << std::endl;
else if(err.code() == std::regex_constants::error_badrepeat)
std::cout << "Error: error_badrepeat" << std::endl;
else if(err.code() == std::regex_constants::error_space)
std::cout << "Error: error_space" << std::endl;
else if(err.code() == std::regex_constants::error_range)
std::cout << "Error: error_range" << std::endl;
else if(err.code() == std::regex_constants::error_badbrace)
std::cout << "Error: error_badbrace" << std::endl;
else if(err.code() == std::regex_constants::error_brace)
std::cout << "Error: error_brace" << std::endl;
else if(err.code() == std::regex_constants::error_paren)
std::cout << "Error: error_paren" << std::endl;
else if(err.code() == std::regex_constants::error_brack)
std::cout << "Error: error_brack" << std::endl;
else if(err.code() == std::regex_constants::error_backref)
std::cout << "Error: error_backref" << std::endl;
else if(err.code() == std::regex_constants::error_escape)
std::cout << "Error: error_escape" << std::endl;
else if(err.code() == std::regex_constants::error_ctype)
std::cout << "Error: error_ctype" << std::endl;
else
std::cout << "Error: [UNKNOWN]" << std::endl;
}
return 0;
My output:
The regex didn't match!
Regex: B
String: Book
On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.
#9 Marketplace Seller - Reputation: 1413
Posted 24 February 2012 - 05:02 PM
template<...>
inline bool regex_search(...)
{
return false;
}
Well, no cookies for me today. Apparently the Regex portion of the C++11 standard library hasn't yet been implemented in GCC (and thus, MinGW), but the functions and classes exist and have stub definitions, so compiling gives you no errors.
Thanks for all your help. At least I'm learning RegEx.
On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.
#10 Members - Reputation: 1217
Posted 24 February 2012 - 05:38 PM
Servant of the Lord, on 24 February 2012 - 05:02 PM, said:
template<...>
inline bool regex_search(...)
{
return false;
}
Well, no cookies for me today. Apparently the Regex portion of the C++11 standard library hasn't yet been implemented in GCC (and thus, MinGW), but the functions and classes exist and have stub definitions, so compiling gives you no errors.
Professional Free Software Developer
#11 Marketplace Seller - Reputation: 1413
Posted 24 February 2012 - 06:01 PM
On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.
#12 Senior Moderators - Reputation: 2448
Posted 24 February 2012 - 08:21 PM
although, to be honest, a simple parser would probably be quicker to implement.
ScapeCode - Blog | SlimDX
#13 Marketplace Seller - Reputation: 1413
Posted 24 February 2012 - 09:04 PM
Washu, on 24 February 2012 - 08:21 PM, said:
Quote
But it is a very good, yet still simple enough, non-contrived problem to get my feet wet in Regex.
On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.


















