Sign in to follow this  

"anything but the sequence ]]" in PERL regex ??

This topic is 3593 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi, As the title suggests, I'm trying to find a regex which would matche a string containing anything but the ]]> sequence ... (aaa)+ means the sequence aaa appearing at least 1 time. Isn't there anything to tell "anything but aaa" ? Thx in advance !

Share this post


Link to post
Share on other sites
I had the same problem several times and i had a solution.

I know a ugly way but it should run: [^a] means "anything as long as it is no a".
[^a][^a][^a] -> !aaa. For your brackets, you can use the same technic if you escape them like \[ \]

Share this post


Link to post
Share on other sites
[^\]][^\]][^>] wouldn't work ... I wan't to be able to have ]] in my string. It's the complete ]]> sequence I don't want.

But I've found another way. boost::regex uses PERL standard for regex (by default) and instead of "match anything except ]]>" I now say "match anything that stop at the first occurence of ]]>"
The regex is (for CDATA sections in xml) :

boost::regex cdata("(<!\\[CDATA\\[" // opening CDATA tag
"(.+)" // anything, and at least one character
"(?:\\]\\]>){1}?)"); // closing CDATA tag. Stops at the first occurence.

It now stops at the first occurence of the sequence ]]> and matches the whole section plus the content.

But if anyone knows another way to achieve that, feel free to tell :)

Share this post


Link to post
Share on other sites
ok now I understand why people are slow to answer regex questions ... it's not because nobody knows, it's simply because even a simple regex is nearly impossible to read for a normal human being

I'll try that in a few hours when I'll have understood what it does

thx !

Share this post


Link to post
Share on other sites
Quote:
Original post by Sicaine
[^a][^a][^a] -> !aaa.


This doesn't completely work though. For example (taking the aaa example), the string xy will not match, because [^a][^a][^a] requires there to be three characters.

What you're looking for is negative lookahead: (?!expression) will check that at a certain position, the coming text does not match the expression.


As for the OP's issue, your solution will only work in general if the regex is set to lazy matching. By default, the + operator is greedy, so your current solution will not stop at the first, but rather at the last occurrence of ]]>. Also, since you use +, not *, your cdata section may not be empty.

The regex thus becomes:

(<!\[CDATA\[(.*?)\]\]>)

I've left out making the ]]> optional, since in XML you cannot do that anyway.

Share this post


Link to post
Share on other sites
Edit : this post is in answer to Kippesoep :)

... well, either did I not understand the expression or it doesn't work in my case ...

And I don't even understand the documentation :
"(?!pattern) consumes zero characters, only if pattern does not match."

What does it mean ? Is it a game to make everything related to regex completely obscure ? Or is it just me ??

Share this post


Link to post
Share on other sites
I'll break it down for you:

^ is the start of the string, combined with the $ (end of the string), this ensures that everything in between should match the entire input.

((?!]]>).)* is next. The * means that the expression ((?!]]>).) should match 0 or more times. Combined with rule above, that means that everything in the entire input should match that expression.

((?!]]>).) is next. Think of this as (.) -- so, one character, any input whatsoever.

(?!]]>) is added as a qualifier. The ?! specifies that the parenthesized expression is "negative lookahead", so, when looking forward, it may not match the expression ]]>.

And Perl is just executable line noise :)

When building/testing regular expressions, try Regex Coach.

Share this post


Link to post
Share on other sites
Quote:
Original post by Forfaox
As for the OP's issue, your solution will only work in general if the regex is set to lazy matching. By default, the + operator is greedy, so your current solution will not stop at the first, but rather at the last occurrence of ]]>. Also, since you use +, not *, your cdata section may not be empty.

The regex thus becomes:

(<!\[CDATA\[(.*?)\]\]>)

I've left out making the ]]> optional, since in XML you cannot do that anyway.


Ok, that works ! The reason I used + operator for the content of the CDATA is that when I used only (.*) it reported an error since it could be empty ... but with the added ? it now works :)

Well, I still feel uneasy with regex (especially on the concepts of "negative lookheader" and stuff like that, but at least I'm learning something :)

Share this post


Link to post
Share on other sites
Regexes can be finicky beasts. A large part of the problem is figuring out what it is you want. The regex I gave you is what you originally specified: match a string (the whole string) if (and only if) it doesn't contain ']]>'. What you probably actually needed is what Forfoax gave you -- the subsection of a string that matches everything between '<![CDATA[' and ']]>'.

In short:
^((?!]]>).)*$

matches all of "bla bla <![CDATA[ more bla bla ]] yet more bla bla", as well as basically any string as long as it doesn't contain ']]>' anywhere, so it also doesn't match "bla bla <![CDATA[ more bla bla ]]> yet more bla bla";

(<!\[CDATA\[(.*?)\]\]>)

matches "bla bla <![CDATA[ more bla bla ]]> yet more bla bla", with the first match being "<![CDATA[ more bla bla ]]>" and the second being " more bla bla ".

Welcome to the wonderful world of regexes :)

Share this post


Link to post
Share on other sites

This topic is 3593 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this