"anything but the sequence ]]" in PERL regex ??
Hi,
As the title suggests, I'm trying to find a regex which would matche a string containing anything but the ]]> sequence ...
(aaa)+ means the sequence aaa appearing at least 1 time. Isn't there anything to tell "anything but aaa" ?
Thx in advance !
I had the same problem several times and i had a solution.
I know a ugly way but it should run: [^a] means "anything as long as it is no a".
[^a][^a][^a] -> !aaa. For your brackets, you can use the same technic if you escape them like \[ \]
I know a ugly way but it should run: [^a] means "anything as long as it is no a".
[^a][^a][^a] -> !aaa. For your brackets, you can use the same technic if you escape them like \[ \]
[^\]][^\]][^>] wouldn't work ... I wan't to be able to have ]] in my string. It's the complete ]]> sequence I don't want.
But I've found another way. boost::regex uses PERL standard for regex (by default) and instead of "match anything except ]]>" I now say "match anything that stop at the first occurence of ]]>"
The regex is (for CDATA sections in xml) :
It now stops at the first occurence of the sequence ]]> and matches the whole section plus the content.
But if anyone knows another way to achieve that, feel free to tell :)
But I've found another way. boost::regex uses PERL standard for regex (by default) and instead of "match anything except ]]>" I now say "match anything that stop at the first occurence of ]]>"
The regex is (for CDATA sections in xml) :
boost::regex cdata("(<!\\[CDATA\\[" // opening CDATA tag "(.+)" // anything, and at least one character "(?:\\]\\]>){1}?)"); // closing CDATA tag. Stops at the first occurence.
It now stops at the first occurence of the sequence ]]> and matches the whole section plus the content.
But if anyone knows another way to achieve that, feel free to tell :)
ok now I understand why people are slow to answer regex questions ... it's not because nobody knows, it's simply because even a simple regex is nearly impossible to read for a normal human being
I'll try that in a few hours when I'll have understood what it does
thx !
I'll try that in a few hours when I'll have understood what it does
thx !
Quote:Original post by Sicaine
[^a][^a][^a] -> !aaa.
This doesn't completely work though. For example (taking the aaa example), the string xy will not match, because [^a][^a][^a] requires there to be three characters.
What you're looking for is negative lookahead: (?!expression) will check that at a certain position, the coming text does not match the expression.
As for the OP's issue, your solution will only work in general if the regex is set to lazy matching. By default, the + operator is greedy, so your current solution will not stop at the first, but rather at the last occurrence of ]]>. Also, since you use +, not *, your cdata section may not be empty.
The regex thus becomes:
(<!\[CDATA\[(.*?)\]\]>)
I've left out making the ]]> optional, since in XML you cannot do that anyway.
Edit : this post is in answer to Kippesoep :)
... well, either did I not understand the expression or it doesn't work in my case ...
And I don't even understand the documentation :
"(?!pattern) consumes zero characters, only if pattern does not match."
What does it mean ? Is it a game to make everything related to regex completely obscure ? Or is it just me ??
... well, either did I not understand the expression or it doesn't work in my case ...
And I don't even understand the documentation :
"(?!pattern) consumes zero characters, only if pattern does not match."
What does it mean ? Is it a game to make everything related to regex completely obscure ? Or is it just me ??
I'll break it down for you:
^ is the start of the string, combined with the $ (end of the string), this ensures that everything in between should match the entire input.
((?!]]>).)* is next. The * means that the expression ((?!]]>).) should match 0 or more times. Combined with rule above, that means that everything in the entire input should match that expression.
((?!]]>).) is next. Think of this as (.) -- so, one character, any input whatsoever.
(?!]]>) is added as a qualifier. The ?! specifies that the parenthesized expression is "negative lookahead", so, when looking forward, it may not match the expression ]]>.
And Perl is just executable line noise :)
When building/testing regular expressions, try Regex Coach.
^ is the start of the string, combined with the $ (end of the string), this ensures that everything in between should match the entire input.
((?!]]>).)* is next. The * means that the expression ((?!]]>).) should match 0 or more times. Combined with rule above, that means that everything in the entire input should match that expression.
((?!]]>).) is next. Think of this as (.) -- so, one character, any input whatsoever.
(?!]]>) is added as a qualifier. The ?! specifies that the parenthesized expression is "negative lookahead", so, when looking forward, it may not match the expression ]]>.
And Perl is just executable line noise :)
When building/testing regular expressions, try Regex Coach.
Quote:Original post by Forfaox
As for the OP's issue, your solution will only work in general if the regex is set to lazy matching. By default, the + operator is greedy, so your current solution will not stop at the first, but rather at the last occurrence of ]]>. Also, since you use +, not *, your cdata section may not be empty.
The regex thus becomes:
(<!\[CDATA\[(.*?)\]\]>)
I've left out making the ]]> optional, since in XML you cannot do that anyway.
Ok, that works ! The reason I used + operator for the content of the CDATA is that when I used only (.*) it reported an error since it could be empty ... but with the added ? it now works :)
Well, I still feel uneasy with regex (especially on the concepts of "negative lookheader" and stuff like that, but at least I'm learning something :)
Regexes can be finicky beasts. A large part of the problem is figuring out what it is you want. The regex I gave you is what you originally specified: match a string (the whole string) if (and only if) it doesn't contain ']]>'. What you probably actually needed is what Forfoax gave you -- the subsection of a string that matches everything between '<![CDATA[' and ']]>'.
In short:
matches all of "bla bla <![CDATA[ more bla bla ]] yet more bla bla", as well as basically any string as long as it doesn't contain ']]>' anywhere, so it also doesn't match "bla bla <![CDATA[ more bla bla ]]> yet more bla bla";
matches "bla bla <![CDATA[ more bla bla ]]> yet more bla bla", with the first match being "<![CDATA[ more bla bla ]]>" and the second being " more bla bla ".
Welcome to the wonderful world of regexes :)
In short:
^((?!]]>).)*$
matches all of "bla bla <![CDATA[ more bla bla ]] yet more bla bla", as well as basically any string as long as it doesn't contain ']]>' anywhere, so it also doesn't match "bla bla <![CDATA[ more bla bla ]]> yet more bla bla";
(<!\[CDATA\[(.*?)\]\]>)
matches "bla bla <![CDATA[ more bla bla ]]> yet more bla bla", with the first match being "<![CDATA[ more bla bla ]]>" and the second being " more bla bla ".
Welcome to the wonderful world of regexes :)
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement