Sign in to follow this  

boost::regex and boost::match_results

This topic is 3593 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi, I've been playing with boost::regex, and to learn it, I've been trying a parse simple xml nodes of the form : <nodeName attrib1="value1" attrib2="value2" ... /> This is the peace of code I've come up with :
boost::regex node_regex("\\s*<([[:alpha:]_]+\\w*)\\s*(([[:alpha:]_]+\\w*)=\"([^\"]+)\"\\s+)*/>.*");
boost::match_results< std::string::const_iterator > matches;
if (boost::regex_search(line, matches, node_regex))
{
    std::cout << "size() : " << matches.size() << "\n";
    for (boost::match_results< std::string::const_iterator >::const_iterator i = matches.begin(); i != matches.end(); ++i)
    {
        if (i->matched) std::cout << "matches :       [" << i->str() << "]\n";
        else            std::cout << "doesn't match : [" << i->str() << "]\n";
    }
}
First of all, I'm pretty sure this regex is NOT "safe" For example, it requires a whitespace after an attrib, so I can't have : <node attrib="value"/> But my main problem is that it only outputs 5 lines. The first is the complete node match, the second is the node name (the word right after the opening <) and then the 3 last are the last attributes' matches ... For example :
// input :
<element attr1="value1" attr2="value2" attr3="value3" />

// output:
size() : 5
matches :  [<element attr1="value1" attr2="value2" attr3="value3" />]
matches :  [element]
matches :  [attr3="value3" ]
matches :  [attr3]
matches :  [value3]
Why don't I get the other attributes ? Forgive me if my questions are a little trivial, I'm quite new to regular expression, and a complete newby with boost :) Thx in advance for any help !

Share this post


Link to post
Share on other sites
ok, I know I should wait 24Hrs before bumping this, but I'm surprised nobody answered, and I really need an answer to continue.

Let's take a simpler example :

std::string str("aaa aaa aaa ");
boost::regex e("(aaa )+"); // this doesn't do what I'd like to

I'd like to know if it's possible to get a boost::match_results which would contain something like :
m[0] = "aaa aaa aaa ";
m[1] = "aaa ";
m[2] = "aaa ";
m[3] = "aaa ";

?? If it's not possible, if I'm completely wrong in my understanding of regular expressions and / or boost::regex, could you please tell me ? Point me to a good article / documentation ? (I've read, and re-read the boost::regex doc, but it's not really noob' friendly :))

Thx !

Share this post


Link to post
Share on other sites
I think you're on the right track, this seems to be related to how boost regex works (this may be universal, i'm not sure). But as far as i can tell the part intended to match the attributes IS matching all the attributes, but it's only keeping the last match.

If you're familiar with other regex flavors, you'll know that you can do "backreferences" by either $n or \n where n is an integer generall 1-number_of_parenthesis

so in, say, sed.

\0 matches the entire string passed
\1 would match the first match (the first parenthesis starting from left to right (ie element)
\2 would match the second parenthesis, from left to right
\3 etc
\4 etc

I guess what i'm trying to get at here is there's no support/recognition of repetition, so it appears the behavior is just to keep the last match (subsequent matches overwrite previous ones)

One answer might be to "advance" through the string, from one match to the next by hand, though i suspect there may be a better way (i'd hope so).

Share this post


Link to post
Share on other sites
HAH! There is a way

Quote:

Repeated Captures

When a marked sub-expression is repeated, then the sub-expression gets "captured" multiple times, however normally only the final capture is available, for example if

(?:(\w+)\W+)+

is matched against

one fine day

Then $1 will contain the string "day", and all the previous captures will have been forgotten.

However, Boost.Regex has an experimental feature that allows all the capture information to be retained - this is accessed either via the match_results::captures member function or the sub_match::captures member function. These functions return a container that contains a sequence of all the captures obtained during the regular expression matching. The following example program shows how this information may be used:

#include <boost/regex.hpp>
#include <iostream>


void print_captures(const std::string& regx, const std::string& text)
{
boost::regex e(regx);
boost::smatch what;
std::cout << "Expression: \"" << regx << "\"\n";
std::cout << "Text: \"" << text << "\"\n";
if(boost::regex_match(text, what, e, boost::match_extra))
{
unsigned i, j;
std::cout << "** Match found **\n Sub-Expressions:\n";
for(i = 0; i < what.size(); ++i)
std::cout << " $" << i << " = \"" << what[i] << "\"\n";
std::cout << " Captures:\n";
for(i = 0; i < what.size(); ++i)
{
std::cout << " $" << i << " = {";
for(j = 0; j < what.captures(i).size(); ++j)
{
if(j)
std::cout << ", ";
else
std::cout << " ";
std::cout << "\"" << what.captures(i)[j] << "\"";
}
std::cout << " }\n";
}
}
else
{
std::cout << "** No Match found **\n";
}
}

int main(int , char* [])
{
print_captures("(([[:lower:]]+)|([[:upper:]]+))+", "aBBcccDDDDDeeeeeeee");
print_captures("(.*)bar|(.*)bah", "abcbar");
print_captures("(.*)bar|(.*)bah", "abcbah");
print_captures("^(?:(\\w+)|(?>\\W+))*$", "now is the time for all good men to come to the aid of the party");
return 0;
}

Which produces the following output:

Expression: "(([[:lower:]]+)|([[:upper:]]+))+"
Text: "aBBcccDDDDDeeeeeeee"
** Match found **
Sub-Expressions:
$0 = "aBBcccDDDDDeeeeeeee"
$1 = "eeeeeeee"
$2 = "eeeeeeee"
$3 = "DDDDD"
Captures:
$0 = { "aBBcccDDDDDeeeeeeee" }
$1 = { "a", "BB", "ccc", "DDDDD", "eeeeeeee" }
$2 = { "a", "ccc", "eeeeeeee" }
$3 = { "BB", "DDDDD" }
Expression: "(.*)bar|(.*)bah"
Text: "abcbar"
** Match found **
Sub-Expressions:
$0 = "abcbar"
$1 = "abc"
$2 = ""
Captures:
$0 = { "abcbar" }
$1 = { "abc" }
$2 = { }
Expression: "(.*)bar|(.*)bah"
Text: "abcbah"
** Match found **
Sub-Expressions:
$0 = "abcbah"
$1 = ""
$2 = "abc"
Captures:
$0 = { "abcbah" }
$1 = { }
$2 = { "abc" }
Expression: "^(?:(\w+)|(?>\W+))*$"
Text: "now is the time for all good men to come to the aid of the party"
** Match found **
Sub-Expressions:
$0 = "now is the time for all good men to come to the aid of the party"
$1 = "party"
Captures:
$0 = { "now is the time for all good men to come to the aid of the party" }
$1 = { "now", "is", "the", "time", "for", "all", "good", "men", "to", "come", "to", "the", "aid", "of", "the", "party" }

Unfortunately enabling this feature has an impact on performance (even if you don't use it), and a much bigger impact if you do use it, therefore to use this feature you need to:

* Define BOOST_REGEX_MATCH_EXTRA for all translation units including the library source (the best way to do this is to uncomment this define in boost/regex/user.hpp and then rebuild everything.
* Pass the match_extra flag to the particular algorithms where you actually need the captures information (regex_search, regex_match, or regex_iterator).


from: http://www.boost.org/libs/regex/doc/captures.html

Share this post


Link to post
Share on other sites
Ademan555> thx... now I remember seeing this example (but at the very beginning, when I was still trying to understand regex, so I ignored it :)
But "has an impact on performance (even if you don't use it)" ...

Another way around that : can you tell boost::regex_match to match the FIRST occurence ? So that I could just "advance" through my string ?

And another question about boost. Isn't there a way to do something like :

boost::regex word("[[:alpha:]_]+\\w*");
boost::regex attribute = word + "=\"([^\"]+)\"";

Don't focus on the meaning of the regex. Just want to know if you can define simple regex, and combine them ? It would be reeeeaaally nice if you could, would make the code a loooot more human readable !!

Thx for the answers so far :)

Share this post


Link to post
Share on other sites
Sorry I had gone to sleep (it was way into the morning hours when i originally posted lol). As far as I know what you described isn't possible :-/

You could do something like this though:


std::string word = "[[:alpha:]_]+\\w*";

boost::regex attribute(word + "=\"([^\"]+)\"");



Of course that's not as convenient.

As far as matching only the first occurrence, I briefly looked through the docs and unfortunately didn't really see anything of the sort.

Although if you think about it, I can't imagine that if you hand wrote your own system for iterating through the matches, that it would be faster than what boost came up with, they're generally pretty good about optimizing things (and even if it's not too fast now, it could get way faster without you having to change the code)

So really my recommendation is to use their multiple matches system.

Share this post


Link to post
Share on other sites
Hi,
Thx for the answer. Well, for iterating through the matches, I've found another solution : I use boost::regex_search which stops at the first match. Then, I get the position and size of the match, and use them to match only the remaining string. Something like :

// str is the input string to test
int start = 0;
int end = (int)str.size();
for (;;)
{
// check for the first match
if (boost::regex_search((std::string::const_iterator)(str.begin() + start), (std::string::const_iterator)str.end(),
matches, regex, boost::match_partial))
{
// do something with the match

// update the beginning position
start += (int)matches.position(0) + (int)matches[0].length() + 1;

// check if there's still something to parse
if (start >= end)
break;
}
else
break;
}

I think it's quite optimal to do this since the parsing stops at the first match and we update the position where the algorithm starts. At least I prefer that instead of recompiling boost and make it globally slower :)

Share this post


Link to post
Share on other sites
Seems it might be of use :)
There are a lot of libraries, and I didn't think that "spirit" was a parser library, that's why I started using regex :)
Thx !

Share this post


Link to post
Share on other sites

This topic is 3593 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this