# Noobie Regex question

This topic is 2080 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Playing around with regex's in c#, I seem to have a problem with matching the character '[', in strings like: "[bob]", or even "[[".  I can match bob by itself, but surounded by '[' and ']' it doesn't match.  I can match a single '[' but it'll never match the double.  Here's currently what I have:

private static string token_pattern =
@"(?<string>""(?:\\.|[^""\\])*"")|" +
@"(?<whitespace>[\s\t\r]+)|" +
@"(?<newline>\n)|" +

@"(?<profile>$\[)|" + @"(?<profile>\[ps_4_0$)|" +
@"(?<profile>ps_4_1)|" +

@"(?<identifier>[a-zA-Z_$][a-zA-Z0-9_$]*)|" +
@"(?<number>[0-9]+)|" +
@"(?<pp_line>#line)|" +
@"(?<left_brace>\{)|" +
@"(?<right_brace>\})|" +
@"(?<left_paren>$$)|" + @"(?<right_paren>$$)|" +
//@"(?<left_bracket>$)|" + //@"(?<right_bracket>$)|" +
@"(?<other>[^\s\t\r\n]+)"; // anything else not whitespace, and not identified

I was trying to write a regex to tokenize .hlsl files, part of a little side project but mostly just to play around with c#.  If I remove the <profile> matches the string "[ps_4_0]" won't match to a <profile>, but rather will match to <other> <identifier> <other>.  If I uncomment the <left_bracket> and <right_bracket> I will get <left_bracket> <identifier> <right_bracket>.  If I enter the string [ps_4_1] I get <other> <profile> <other>.  But I can find no way to make it match something like "[ps_4_0]" with just <profile>.

Now I do some processing with the tokens after, and I can work around this, but the point was to learn.  So the questions is, why is the character '[' giving me such a problem?  Where have I gone wrong?  And if u were to write a regex to parse something similar how would u go about it?

##### Share on other sites
Your "other" rule is being greedy and screwing up the multiple-match behavior that you want. Put a +? instead of a + and it should do what you want.

static void Main(string[] args)
{
string token_pattern =
@"(?<string>""(?:\\.|[^""\\])*"")|" +
@"(?<whitespace>[\s\t\r]+)|" +
@"(?<newline>\n)|" +

@"(?<profile>$\[)|" + @"(?<profile>\[ps_4_0$)|" +
@"(?<profile>ps_4_1)|" +

@"(?<identifier>[a-zA-Z_$][a-zA-Z0-9_$]*)|" +
@"(?<number>[0-9]+)|" +
@"(?<pp_line>#line)|" +
@"(?<left_brace>\{)|" +
@"(?<right_brace>\})|" +
@"(?<left_paren>$$)|" + @"(?<right_paren>$$)|" +
//@"(?<left_bracket>$)|" + //@"(?<right_bracket>$)|" +
@"(?<other>[^\s\t\r\n]+?)"; // anything else not whitespace, and not identified

Regex rx = new Regex(token_pattern);

foreach (Match match in rx.Matches("[bob]"))
foreach (var groupName in rx.GetGroupNames())
{
var group = match.Groups[groupName];
foreach (Capture capture in group.Captures)
{
Console.WriteLine("Group: {0}, Capture: {1}", groupName, capture);
}
}
I get:

Group: 0, Capture: [
Group: other, Capture: [
Group: 0, Capture: bob
Group: identifier, Capture: bob
Group: 0, Capture: ]
Group: other, Capture: ]
For "[ps_4_0]" I get <profile> as desired:

Group: 0, Capture: [ps_4_0]
Group: profile, Capture: [ps_4_0]
Normally, I run into LOTS of problems with Regex being too greedy and screwing up my complex expressions, so I have gotten into the habit of always putting ? after *s and +s. (especially .* - I *always* use .*? )

I also tested the expression after uncommenting the bracket lines and it seems fine. Edited by Nypyren

##### Share on other sites

I figured it was something like that.  TBH I don't understand how precedence in regex's work.  How does the ? at the end cause it not to be 'greedy'?

##### Share on other sites
As far as I understand it, * and + will continue matching whatever they're currently matching and will ignore other options until an invalid character is found, but *? and +? will prioritize other alternatives first before continuing with the current subexpression.

Unfortunately I don't 100% understand how it works under the hood to give a more detailed explanation, or whether there are any expressions you could write that still mess up the precedence.

I typically don't use regular expressions as complex as yours due to this - I've had cases where I just don't understand what's going on well enough to diagnose the issue. In those cases I usually make separate expressions that start with \G (match must start at the current position) and use the optional "int startat" parameter to .Match, and have my C# loop over the entire input string manually, trying individual token Regexs separately until one matches. Using \G with startat is similar to using the ^ anchor to ensure that a match starts at the beginning of the string. \G will let you require the match to occur at the explicit position in the string you ask it to. Edited by Nypyren

##### Share on other sites

I tried, still not working, I'll post the full code:

// tokenizing regular expression
private static string token_pattern =
@"(?<string>""(?:\\.|[^""\\])*"")|" +
@"(?<whitespace>[\s\t\r]+)|" +
@"(?<newline>\n)|" +

@"(?<profile>$vs_1_1$)|" +				// vs_1_1

@"(?<profile>$ps_2_0$)|" +				// ps_2_0
@"(?<profile>$ps_2_x$)|" +				// ps_2_x
@"(?<profile>$vs_2_0$)|" +				// vs_2_0
@"(?<profile>$vs_2_x$)|" +				// vs_2_x

@"(?<profile>$ps_4_0_level_9_0$)|" +		// ps_4_0_level_9_0
@"(?<profile>$ps_4_0_level_9_1$)|" +		// ps_4_0_level_9_1
@"(?<profile>$ps_4_0_level_9_3$)|" +		// ps_4_0_level_9_3

@"(?<profile>$vs_4_0_level_9_0$)|" +		// vs_4_0_level_9_0
@"(?<profile>$vs_4_0_level_9_1$)|" +		// vs_4_0_level_9_1
@"(?<profile>$vs_4_0_level_9_3$)|" +		// vs_4_0_level_9_3

@"(?<profile>$lib_4_0_level_9_1$)|" +		// lib_4_0_level_9_1
@"(?<profile>$lib_4_0_level_9_3$)|" +		// lib_4_0_level_9_3

@"(?<profile>$ps_3_0$)|" +				// ps_3_0
@"(?<profile>$vs_3_0$)|" +				// vs_3_0

@"(?<profile>$cs_4_0$)|" +				// cs_4_0
@"(?<profile>$gs_4_0$)|" +				// gs_4_0
@"(?<profile>$ps_4_0$)|" +				// ps_4_0
@"(?<profile>$vs_4_0$)|" +				// vs_4_0

@"(?<profile>$cs_4_1$)|" +				// cs_4_1
@"(?<profile>$gs_4_1$)|" +				// gs_4_1
@"(?<profile>$ps_4_1$)|" +				// ps_4_1
@"(?<profile>$vs_4_1$)|" +				// vs_4_1

@"(?<profile>$lib_4_0$)|" +				// lib_4_0
@"(?<profile>$lib_4_1$)|" +				// lib_4_1

@"(?<profile>$cs_5_0$)|" +				// cs_5_0
@"(?<profile>$ds_5_0$)|" +				// ds_5_0
@"(?<profile>$gs_5_0$)|" +				// gs_5_0
@"(?<profile>$hs_5_0$)|" +				// hs_5_0
@"(?<profile>$ps_5_0$)|" +				// ps_5_0
@"(?<profile>$vs_5_0$)|" +				// vs_5_0
@"(?<profile>$lib_5_0$)|" +				// lib_5_0

@"(?<profile>$fx_1_0$)|" +				// fx_1_0
@"(?<profile>$fx_2_0$)|" +				// fx_2_0
@"(?<profile>$fx_4_0$)|" +				// fx_4_0
@"(?<profile>$fx_4_1$)|" +				// fx_4_1
@"(?<profile>$fx_5_0$)|" +				// fx_5_0

@"(?<identifier>[a-zA-Z_$][a-zA-Z0-9_$]*)|" +
@"(?<number>[0-9]+)|" +
@"(?<pp_line>#line)|" +
@"(?<left_brace>\{)|" +
@"(?<right_brace>\})|" +
@"(?<left_paren>$$)|" + @"(?<right_paren>$$)|" +
@"(?<left_bracket>$)|" + @"(?<right_bracket>$)|" +
@"(?<other>[^\s\t\r\n]+?)";	// anything else not whitespace, and not identified

//  GetToken
Token GetToken (Regex regex, Match match) {
int i = 0;
foreach (Group group in match.Groups) {
if (group.Success && i > 0) {
return new Token(regex.GroupNameFromNumber(i),group.Value);
}
i++;
}
return new Token();			// nothing matched
}

void LoadFile () {
// loads the file, preprocesses it, saves it into tokens

SharpDX.Direct3D.ShaderMacro[] macros = {};
SharpDX.D3DCompiler.Include include = new ShaderInclude(init_file_name);

// load and preprocess file
string pp_errors;
string pp_file = SharpDX.D3DCompiler.ShaderBytecode.PreprocessFromFile(
init_file_name,
macros,
include,
out pp_errors
);

// check for errors
if (pp_errors != null) {
string err_msg = String.Format("{0} : error R002: errors encountered while preparsing file:",init_file_name);
Console.WriteLine(err_msg);
Console.WriteLine(pp_errors);
throw new Exception();
}

// tokenize
Regex regex = new Regex(token_pattern,RegexOptions.Compiled);
MatchCollection matches = regex.Matches(pp_file);
List<Token> token_list = new List<Token>();
foreach (Match match in matches) {
Token token = GetToken(regex,match);
if (token.name != "whitespace") {
Console.WriteLine(String.Format("{0} - {1}",token.name,token.value));
}
}

}

Token is just a simple struct storing two strings, name and value.

##### Share on other sites

As far as I understand it, * and + will continue matching whatever they're currently matching and will ignore other options until an invalid character is found, but *? and +? will prioritize other alternatives first before continuing with the current subexpression.

Unfortunately I don't 100% understand how it works under the hood to give a more detailed explanation, or whether there are any expressions you could write that still mess up the precedence.

I typically don't use regular expressions as complex as yours due to this - I've had cases where I just don't understand what's going on well enough to diagnose the issue. In those cases I usually make separate expressions that start with \G (match must start at the current position) and use the optional "int startat" parameter to .Match, and have my C# loop over the entire input string manually, trying individual token Regexs separately until one matches. Using \G with startat is similar to using the ^ anchor to ensure that a match starts at the beginning of the string. \G will let you require the match to occur at the explicit position in the string you ask it to.

Well thanks for the help.  I've always just used a hand-written DFA for these sort of things.  Ahh well, thanks for the help anyways, very much appreciated.

##### Share on other sites
Yeah, malfunctioning Regexes aren't very pleasant to debug

The only thing that looks suspicious to me is your loop in GetToken. I am not sure whether the GroupNameFromNumber(i) will stay in parallel with the GroupCollection enumerator. What do you get if you loop over group names like I do in my example instead of using a dual-variable loop?

(edit) I tried your code with some modifications (I used a Tuple<string,string> instead of your Token class), and it works for me. At least on small strings like "[bob]" or "[ps_4_0]".

Can you post the entire file (or a dropbox link to it or something) you're trying to parse? Edited by Nypyren

##### Share on other sites

I'm a c# nub, so there's a good chance I've messed up something ; )  These are the 2 files I'm using:

Test1.hlsl:

#include "Test2.hlsl"

//  stuff... stuff

[[]]
[ps_4_0] float4 main (float4 t : SV_POSITION) : SV_TARGET0 {

return float4(1,0,1,1);
}

[ps_4_1] float4 main2 (float4 t : SV_POSITION) : SV_TARGET0 {

return float4(1, 0, 1, 1);
}

[vs_4_0][loop] float4 main3(float4 t : SV_POSITION) : SV_TARGET0 {
return float4(1, 0, 1, 1);

"string test 1 2 3   three spaces"
}

and Test2.hlsl:

int4 some_func() {
return int4(0,0,0,0);
}

I run it first through the D3D preprocessor, and then the next step is tokenizing, so that I can extract information before I feed it into D3DCompiler.  My loop was taken from a website somewhere, I'll play around with the modifications you suggested, see what I can come up with.

##### Share on other sites
I don't have SharpDX installed; Can you debug and copy/paste the pp_file value after it's done preprocessing? I tried hand-preprocessing the files using C-style preprocessing rules since I'm unfamiliar with HLSL preprocessing rules, and it looked like your program worked fine. (Your C# code looks fine to me BTW.)

Here's what I used as input:

int4 some_func() {
return int4(0,0,0,0);
}

//  stuff... stuff

[[]]
[ps_4_0] float4 main (float4 t : SV_POSITION) : SV_TARGET0 {

return float4(1,0,1,1);
}

[ps_4_1] float4 main2 (float4 t : SV_POSITION) : SV_TARGET0 {

return float4(1, 0, 1, 1);
}

[vs_4_0][loop] float4 main3(float4 t : SV_POSITION) : SV_TARGET0 {
return float4(1, 0, 1, 1);

"string test 1 2 3   three spaces"
}
Here's what I got as output:

identifier - int4
identifier - some_func
left_paren - (
right_paren - )
left_brace - {
identifier - return
identifier - int4
left_paren - (
number - 0
other - ,
number - 0
other - ,
number - 0
other - ,
number - 0
right_paren - )
other - ;
right_brace - }
other - /
other - /
identifier - stuff
other - .
other - .
other - .
identifier - stuff
left_bracket - [
left_bracket - [
right_bracket - ]
right_bracket - ]
profile - [ps_4_0]
identifier - float4
identifier - main
left_paren - (
identifier - float4
identifier - t
other - :
identifier - SV_POSITION
right_paren - )
other - :
identifier - SV_TARGET0
left_brace - {
identifier - return
identifier - float4
left_paren - (
number - 1
other - ,
number - 0
other - ,
number - 1
other - ,
number - 1
right_paren - )
other - ;
right_brace - }
profile - [ps_4_1]
identifier - float4
identifier - main2
left_paren - (
identifier - float4
identifier - t
other - :
identifier - SV_POSITION
right_paren - )
other - :
identifier - SV_TARGET0
left_brace - {
identifier - return
identifier - float4
left_paren - (
number - 1
other - ,
number - 0
other - ,
number - 1
other - ,
number - 1
right_paren - )
other - ;
right_brace - }
profile - [vs_4_0]
left_bracket - [
identifier - loop
right_bracket - ]
identifier - float4
identifier - main3
left_paren - (
identifier - float4
identifier - t
other - :
identifier - SV_POSITION
right_paren - )
other - :
identifier - SV_TARGET0
left_brace - {
identifier - return
identifier - float4
left_paren - (
number - 1
other - ,
number - 0
other - ,
number - 1
other - ,
number - 1
right_paren - )
other - ;
string - "string test 1 2 3   three spaces"
right_brace - }
I glanced at it and it seems to tokenize pretty well.

##### Share on other sites

I figured it out.  Always something silly.  The D3D preprocessor does some tokenizing of its own, the string "[ps_4_0]" in the source will actually be turned into "[ ps_4_0 ]" with whitespace added.  Which would explain why it didn't match properly.

Again, thank-you very much for your help.  I wasn't sure if the problem was my misunderstanding of regex's or c#.

• 18
• 29
• 11
• 24
• 20