Sign in to follow this  

[web] [PHP] Regular Expressions are horrible: URL matching

This topic is 4465 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

When displaying messages on a forum, I run them through this: echo preg_replace('#(http|ftp|https)://([A-Za-z0-9\./_~]*)#i', "<a href=\"$1://$2\">$2</a>", $body); This works great for most URLs. HOWEVER, my forum software allows [img][/img] tags, as well as replacing smilies with the relevant HTML, which means that I get rubbish like this: <img src="<a href="http://path.to.images.jpg">path.to.images.jpg</a>" /> ...which is rubbish. How can I stop this from butchering URLs that are already inside HTML tags?

Share this post


Link to post
Share on other sites
Match a space before and after the URL so that "http://www.example.com" wont match. Although, if somebody actually does put a URL without spaces around it in their message it will not be changed to a link, but how often does that happen?

Share this post


Link to post
Share on other sites
this is a ruby style regex, but you should be able to convert it to php fairly easily:

gsub( /(?!<.*)(http|ftp|https):\/\/([\w.\\\/_~]*)(?!.*>)/im, '<a href="\1://\2">\2</a>')




basically (?!...) is to not match, so it basically says dont match this if its between < >

i'm pretty sure that php preg_replace supports (?!...) commands.

Share this post


Link to post
Share on other sites
Quote:
Original post by Colin Jeanne
Quote:
Original post by kryat
i'm pretty sure that php preg_replace supports (?!...) commands.

PHP's form of (?!...) is (?:...)


not quite, (?:...) matches the expression, but doesnt add it in to matched collection (eg /1 /2 /3...), but (?!...) is a negative look-ahead match, (?=...) being the postive look ahead. Things get a little tricky with look-aheads...

but for example at the string "bob goes home"

/(bob) (\w*) (home)/ => "bob goes home" /1 = "bob" /2 = "goes" /3 = "home"
/(?:bob) (\w*) (?:home)/ => "bob goes home" /1 = "goes"
/bob (\w*) (?!home)/ => no match.

[edit] still trying to come up with a single regex that works properly. the one from before will fail if there is a greater than sign any where after a link (it thinks its a closing of a tag...

[Edited by - kryat on September 10, 2005 1:39:52 AM]

Share this post


Link to post
Share on other sites
I'm sure its possible to come up with a single expression that works, I just cant figure one out. However, here is some code that does work:

//$body = Post content
$exp = "/(http|ftp|https):\/\/([\w.\_\/~\?=%+]*[^. ])/i"; //URLs regex
$rep = "<a href=\"$1://$2\">$2</a>"; //URL Replacement
$htmlexp = "/<[^>]+>/"; //generic HTML regex

preg_match_all($htmlexp,$body, $html_arr); //Collect all HTML tags
$text_arr = preg_replace($exp, $rep, preg_split($htmlexp,$body)); //Process URL replacement
$body = $text_arr[0]; //Rebuild the post
foreach ($html_arr[0] as $key => $h) $body .= $h . $text_arr[$key+1]; //$body is ready.

Share this post


Link to post
Share on other sites

This topic is 4465 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this