std::regex

Started by
7 comments, last by swiftcoder 7 years, 4 months ago
I am doing basic parsing via basic pattern matching, so basically basic RegEx basically basic basic.

When calling std::regex_match(), “groups” (things inside parentheses) will be returned in std::smatch.
This is great, except that this causes stack overflow in many cases due to how it is implemented in Microsoft® Visual Studio™.

I only want to match the start of the string, but std::regex_match() matches the whole string.
In order to get around this, I postfix my pattern with “.*”.

This fails to catch \r and \n, but otherwise works.
So I change it to “(.|[\r\n])*”

Now here is the problem. Because I have added (), it is now a match that will be returned to std::smatch, which means it goes recursive for every trailing character until the end of the file.
Thus, on long files, it causes stack-overflow. “.*” works on any-sized files as it does not evaluate the rule recursively.

Temporarily I have removed \r and \n from the input and gone back to using “.*”, but I need a better solution.


There are several ways in which I could get around this, if only one of them is possible.
#1: This might be just because it is trying to return the contents of the grouping. Is there any way to add () as part of a rule, but to not have it in std::smatch?
#2: Is there a better way to match any character until the end of the input?
#3: Is there a flag or something I have overlooked on std::regex_match() so that I can tell it just to match the start of the input rather than all of the input?


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Advertisement

Can't you explicitly match the start of a string with the ^ character? Top of this list: https://msdn.microsoft.com/en-us/library/h5181w5w(v=vs.110).aspx

That’s done it (along with swapping to std::regex_search()).
Thank you.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

I have no experience with std::regexp, but

I only want to match the start of the string, but std::regex_match() matches the whole string.

This is what http://en.cppreference.com/w/cpp/regex promises, so that's working.

Would regex_search work instead?

#1: This might be just because it is trying to return the contents of the grouping. Is there any way to add () as part of a rule, but to not have it in std::smatch?

the "nosubs" entry speaks about "(?: expr)" as a non-saved substitute, maybe that works?

#2: Is there a better way to match any character until the end of the input?

Afaik, only by not doing it. REs are designed to match short simple sequences, especially the implementations that don't use a finite automaton as implementation.

#3: Is there a flag or something I have overlooked on std::regex_match() so that I can tell it just to match the start of the input rather than all of the input?

The site suggests "match" matches the entire text, at least that's how I read it. If you can manage to get it to match a part, there are usually several grades of "greedy" in the '*' (and '+' if you have it). Two common forms are 'as less as possible', and 'as much as possible'. Often you can specify how greedy an iterator should be by using a different symbol or an annotation or so.

A different direction is to go up a level, and use a scanner generator like Lex, and specify your parse problem as a Lex specification. Input of the generated scanner is a stream of characters, by default a FILE *, but often there are hooks to divert the calls to a custom function. Output of the scanner is a stream token numbers, and any meta-data you associate with a token (typically line number, position index, and eg matched text)

Edit: Ninja'd :)

I spoke too soon.
Prefixing with ^ does not tell it to only match from the start of the string.
But there should be some way to do that.
If so, it would fix the problem.

Would regex_search work instead?

Only if it matches from the start of the input. I don’t need to know if my pattern exists anywhere in the string, I need to know it exists immediately where I am looking.
This is because my patterns alternate and even depend on each other, so order matters.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

std::regex_constants::match_continuous does what I want.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

I spoke too soon.
Prefixing with ^ does not tell it to only match from the start of the string.

Well, I'm a little skeptical. If you're using ^ as the first character of a BRE expression (which you implied in the first post, requires the std::basic flag to be set) it matches the start of the string. If you're using the default regex grammar (ECMAScript, aka PERL5-style) it asserts on the first character in the string or (if std::multiline is set and you're using C++17 otherwise by default), the first character after any newline. If you're using the default grammar instead of the baic regex like you implied and your target string contains newlines, you may be in for surprises.

But there should be some way to do that.
If so, it would fix the problem.

Would regex_search work instead?

Only if it matches from the start of the input. I don’t need to know if my pattern exists anywhere in the string, I need to know it exists immediately where I am looking.
This is because my patterns alternate and even depend on each other, so order matters.

You should probably be aware that it is not possible to write context-dependent match rules using regular expressions. I don't know if you're trying to do that. Be aware that trying to do that using backreferences may seem like it works until you hit some cases in which it will *always* result in infinite recursion or factorial explosion. If you have multiple patterns dependent on each other, you're better off having multiple matchers and algorithmic correlation.

Stephen M. Webb
Professional Free Software Developer

For future reference this website is great for building up regular expressions and learning how they work using it's interactive debugger.

https://regex101.com/

Our Current Game: Smith and Winston Ikari Warriors + Space Harrier + Voxel Destruction

I've got to warn you that while regex works fine for simple lexing cases, it is just not a good fit for parsing problems. You are typically much better off with a simple recursive descent parser (something hand rolled, or boost::spirit). Or firing up ANTLR and generating a parser...

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

This topic is closed to new replies.

Advertisement