# boost::regex and boost::match_results

This topic is 3871 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi, I've been playing with boost::regex, and to learn it, I've been trying a parse simple xml nodes of the form : <nodeName attrib1="value1" attrib2="value2" ... /> This is the peace of code I've come up with :
boost::regex node_regex("\\s*<([[:alpha:]_]+\\w*)\\s*(([[:alpha:]_]+\\w*)=\"([^\"]+)\"\\s+)*/>.*");
boost::match_results< std::string::const_iterator > matches;
if (boost::regex_search(line, matches, node_regex))
{
std::cout << "size() : " << matches.size() << "\n";
for (boost::match_results< std::string::const_iterator >::const_iterator i = matches.begin(); i != matches.end(); ++i)
{
if (i->matched) std::cout << "matches :       [" << i->str() << "]\n";
else            std::cout << "doesn't match : [" << i->str() << "]\n";
}
}

First of all, I'm pretty sure this regex is NOT "safe" For example, it requires a whitespace after an attrib, so I can't have : <node attrib="value"/> But my main problem is that it only outputs 5 lines. The first is the complete node match, the second is the node name (the word right after the opening <) and then the 3 last are the last attributes' matches ... For example :
// input :
<element attr1="value1" attr2="value2" attr3="value3" />

// output:
size() : 5
matches :  [<element attr1="value1" attr2="value2" attr3="value3" />]
matches :  [element]
matches :  [attr3="value3" ]
matches :  [attr3]
matches :  [value3]

Why don't I get the other attributes ? Forgive me if my questions are a little trivial, I'm quite new to regular expression, and a complete newby with boost :) Thx in advance for any help !

##### Share on other sites
ok, I know I should wait 24Hrs before bumping this, but I'm surprised nobody answered, and I really need an answer to continue.

Let's take a simpler example :
std::string  str("aaa aaa aaa ");boost::regex e("(aaa )+");         // this doesn't do what I'd like to

I'd like to know if it's possible to get a boost::match_results which would contain something like :
m[0] = "aaa aaa aaa ";
m[1] = "aaa ";
m[2] = "aaa ";
m[3] = "aaa ";

?? If it's not possible, if I'm completely wrong in my understanding of regular expressions and / or boost::regex, could you please tell me ? Point me to a good article / documentation ? (I've read, and re-read the boost::regex doc, but it's not really noob' friendly :))

Thx !

##### Share on other sites
Hah, reading someone else's regex is like gouging your eyes out with a spoon, I'll take a crack at it though.

##### Share on other sites
I think you're on the right track, this seems to be related to how boost regex works (this may be universal, i'm not sure). But as far as i can tell the part intended to match the attributes IS matching all the attributes, but it's only keeping the last match.

If you're familiar with other regex flavors, you'll know that you can do "backreferences" by either $n or \n where n is an integer generall 1-number_of_parenthesis so in, say, sed. \0 matches the entire string passed \1 would match the first match (the first parenthesis starting from left to right (ie element) \2 would match the second parenthesis, from left to right \3 etc \4 etc I guess what i'm trying to get at here is there's no support/recognition of repetition, so it appears the behavior is just to keep the last match (subsequent matches overwrite previous ones) One answer might be to "advance" through the string, from one match to the next by hand, though i suspect there may be a better way (i'd hope so). #### Share this post ##### Link to post ##### Share on other sites HAH! There is a way Quote:  Repeated CapturesWhen a marked sub-expression is repeated, then the sub-expression gets "captured" multiple times, however normally only the final capture is available, for example if(?:(\w+)\W+)+is matched againstone fine dayThen$1 will contain the string "day", and all the previous captures will have been forgotten.However, Boost.Regex has an experimental feature that allows all the capture information to be retained - this is accessed either via the match_results::captures member function or the sub_match::captures member function. These functions return a container that contains a sequence of all the captures obtained during the regular expression matching. The following example program shows how this information may be used:#include #include void print_captures(const std::string& regx, const std::string& text){ boost::regex e(regx); boost::smatch what; std::cout << "Expression: \"" << regx << "\"\n"; std::cout << "Text: \"" << text << "\"\n"; if(boost::regex_match(text, what, e, boost::match_extra)) { unsigned i, j; std::cout << "** Match found **\n Sub-Expressions:\n"; for(i = 0; i < what.size(); ++i) std::cout << " $" << i << " = \"" << what << "\"\n"; std::cout << " Captures:\n"; for(i = 0; i < what.size(); ++i) { std::cout << "$" << i << " = {"; for(j = 0; j < what.captures(i).size(); ++j) { if(j) std::cout << ", "; else std::cout << " "; std::cout << "\"" << what.captures(i)[j] << "\""; } std::cout << " }\n"; } } else { std::cout << "** No Match found **\n"; }}int main(int , char* []){ print_captures("(([[:lower:]]+)|([[:upper:]]+))+", "aBBcccDDDDDeeeeeeee"); print_captures("(.*)bar|(.*)bah", "abcbar"); print_captures("(.*)bar|(.*)bah", "abcbah"); print_captures("^(?:(\\w+)|(?>\\W+))*$", "now is the time for all good men to come to the aid of the party"); return 0;}Which produces the following output:Expression: "(([[:lower:]]+)|([[:upper:]]+))+"Text: "aBBcccDDDDDeeeeeeee"** Match found ** Sub-Expressions:$0 = "aBBcccDDDDDeeeeeeee" $1 = "eeeeeeee"$2 = "eeeeeeee" $3 = "DDDDD" Captures:$0 = { "aBBcccDDDDDeeeeeeee" } $1 = { "a", "BB", "ccc", "DDDDD", "eeeeeeee" }$2 = { "a", "ccc", "eeeeeeee" } $3 = { "BB", "DDDDD" }Expression: "(.*)bar|(.*)bah"Text: "abcbar"** Match found ** Sub-Expressions:$0 = "abcbar" $1 = "abc"$2 = "" Captures: $0 = { "abcbar" }$1 = { "abc" } $2 = { }Expression: "(.*)bar|(.*)bah"Text: "abcbah"** Match found ** Sub-Expressions:$0 = "abcbah" $1 = ""$2 = "abc" Captures: $0 = { "abcbah" }$1 = { } $2 = { "abc" }Expression: "^(?:(\w+)|(?>\W+))*$"Text: "now is the time for all good men to come to the aid of the party"** Match found ** Sub-Expressions: $0 = "now is the time for all good men to come to the aid of the party"$1 = "party" Captures: $0 = { "now is the time for all good men to come to the aid of the party" }$1 = { "now", "is", "the", "time", "for", "all", "good", "men", "to", "come", "to", "the", "aid", "of", "the", "party" }Unfortunately enabling this feature has an impact on performance (even if you don't use it), and a much bigger impact if you do use it, therefore to use this feature you need to: * Define BOOST_REGEX_MATCH_EXTRA for all translation units including the library source (the best way to do this is to uncomment this define in boost/regex/user.hpp and then rebuild everything. * Pass the match_extra flag to the particular algorithms where you actually need the captures information (regex_search, regex_match, or regex_iterator).

from: http://www.boost.org/libs/regex/doc/captures.html

##### Share on other sites
Ademan555> thx... now I remember seeing this example (but at the very beginning, when I was still trying to understand regex, so I ignored it :)
But "has an impact on performance (even if you don't use it)" ...

Another way around that : can you tell boost::regex_match to match the FIRST occurence ? So that I could just "advance" through my string ?

And another question about boost. Isn't there a way to do something like :
boost::regex   word("[[:alpha:]_]+\\w*");boost::regex   attribute = word + "=\"([^\"]+)\"";

Don't focus on the meaning of the regex. Just want to know if you can define simple regex, and combine them ? It would be reeeeaaally nice if you could, would make the code a loooot more human readable !!

Thx for the answers so far :)

##### Share on other sites
Sorry I had gone to sleep (it was way into the morning hours when i originally posted lol). As far as I know what you described isn't possible :-/

You could do something like this though:

std::string word = "[[:alpha:]_]+\\w*";boost::regex   attribute(word + "=\"([^\"]+)\"");

Of course that's not as convenient.

As far as matching only the first occurrence, I briefly looked through the docs and unfortunately didn't really see anything of the sort.

Although if you think about it, I can't imagine that if you hand wrote your own system for iterating through the matches, that it would be faster than what boost came up with, they're generally pretty good about optimizing things (and even if it's not too fast now, it could get way faster without you having to change the code)

So really my recommendation is to use their multiple matches system.

##### Share on other sites
Hi,
Thx for the answer. Well, for iterating through the matches, I've found another solution : I use boost::regex_search which stops at the first match. Then, I get the position and size of the match, and use them to match only the remaining string. Something like :
// str is the input string to testint start = 0;int end   = (int)str.size();for (;;){    // check for the first match    if (boost::regex_search((std::string::const_iterator)(str.begin() + start),		                            (std::string::const_iterator)str.end(),                            matches, regex, boost::match_partial))    {        // do something with the match        // update the beginning position        start += (int)matches.position(0) + (int)matches[0].length() + 1;        // check if there's still something to parse        if (start >= end)            break;    }    else        break;}

I think it's quite optimal to do this since the parsing stops at the first match and we update the position where the algorithm starts. At least I prefer that instead of recompiling boost and make it globally slower :)

##### Share on other sites
You might consider if you'd have more luck with, say, boost::spirit? :/

##### Share on other sites
Seems it might be of use :)
There are a lot of libraries, and I didn't think that "spirit" was a parser library, that's why I started using regex :)
Thx !

1. 1
2. 2
Rutin
21
3. 3
4. 4
frob
18
5. 5

• 9
• 12
• 9
• 33
• 13
• ### Forum Statistics

• Total Topics
632589
• Total Posts
3007242

×