Question about Regular Expressions (might be Perl-specific)

Started by
3 comments, last by Excors 15 years, 1 month ago
Perl is my first encounter with regex and so I'm not sure if this is specific to Perl or not. Check out this bit of code:

while (<>){
      while ( /(.*?<!--)(.[^-]*)(.*$)/){
          print $2."\n";
          $_=$3;
      }
}

I can't seem to figure out what .*? does in the first group. It seems that the first two characters, .*, would mean "any character, 0 or more times". However, what does appending a ? to it do? Restrict it to 0 or 1 times? If so, why use the *? (I'm guessing I'm way off, which is why I need help).
Advertisement
From the Perl docs ("perldoc perlre"):
Quote:
By default, a quantified subpattern is "greedy", that is, it will match
as many times as possible (given a particular starting location) while
still allowing the rest of the pattern to match. If you want it to
match the minimum number of times possible, follow the quantifier with
a "?". Note that the meanings don’t change, just the "greediness":


In your regex, this means it tries to match the smallest number of characters before "< !--" is encountered, i.e. finds the earliest occurence of "< !--" (I had to put a space to make it show up in my post).
That's a quantifier that:
Quote: Repeats the previous item zero or more times. Lazy, so the engine first attempts to skip the previous item, before trying permutations with ever increasing matches of the preceding item.

Taken from this page, have a look.

Edit: Barius, part of what you posted made my post get mixed with yours until you changed it. That was pretty strange.
Ok thanks I will look into that when I have a chance (btw, looks like the HTML parsing robbed a bit of your post, hehehe, it's cool though I got the gist).

Edit: nm, you fixed it. It looks like it commented out some HTML, therefore concatenating two separate replies. Haha, that is wild.
That seems a slightly unusual way to loop over all the matches in a string. It would be more common to write
while (<>){      while (/<!--(.[^-]*)/g){          print $2."\n";      }}
(using the /g flag to make it match the next occurrence each time through the loop).

If this is meant to actually be extracting HTML comments, then you'd want something more like
/<!--(.*?)-->/
because it's perfectly valid for comments to contain individual dashes. (The .*? in that regexp is the same as before - it means it will match as few characters as possible before finding the -->, instead of as many as possible, which will matter if there are two or more comments on the line.)

This topic is closed to new replies.

Advertisement