[web] [PHP] Regular Expressions are horrible: URL matching

Started by
8 comments, last by GameDev.net 18 years, 6 months ago
When displaying messages on a forum, I run them through this: echo preg_replace('#(http|ftp|https)://([A-Za-z0-9\./_~]*)#i', "<a href=\"$1://$2\">$2</a>", $body); This works great for most URLs. HOWEVER, my forum software allows [img][/img] tags, as well as replacing smilies with the relevant HTML, which means that I get rubbish like this: <img src="<a href="http://path.to.images.jpg">path.to.images.jpg</a>" /> ...which is rubbish. How can I stop this from butchering URLs that are already inside HTML tags?

[Website] [+++ Divide By Cucumber Error. Please Reinstall Universe And Reboot +++]

Advertisement
Match a space before and after the URL so that "http://www.example.com" wont match. Although, if somebody actually does put a URL without spaces around it in their message it will not be changed to a link, but how often does that happen?
Or just correct afterwards by replacing again:
$body = preg_replace ("#<img src=\"<a href=\"(.*?)\">(.*?)</a>\" />#", "<img src=\"$1\" />", $body);
Kippesoep
You can play around with Regex coach, probably you will find a suitable regex much faster that way.
this is a ruby style regex, but you should be able to convert it to php fairly easily:

gsub( /(?!<.*)(http|ftp|https):\/\/([\w.\\\/_~]*)(?!.*>)/im, '<a href="\1://\2">\2</a>')


basically (?!...) is to not match, so it basically says dont match this if its between < >

i'm pretty sure that php preg_replace supports (?!...) commands.
Quote:Original post by kryat
i'm pretty sure that php preg_replace supports (?!...) commands.

PHP's form of (?!...) is (?:...)

Quote:Original post by Colin Jeanne
Quote:Original post by kryat
i'm pretty sure that php preg_replace supports (?!...) commands.

PHP's form of (?!...) is (?:...)


not quite, (?:...) matches the expression, but doesnt add it in to matched collection (eg /1 /2 /3...), but (?!...) is a negative look-ahead match, (?=...) being the postive look ahead. Things get a little tricky with look-aheads...

but for example at the string "bob goes home"

/(bob) (\w*) (home)/ => "bob goes home" /1 = "bob" /2 = "goes" /3 = "home"
/(?:bob) (\w*) (?:home)/ => "bob goes home" /1 = "goes"
/bob (\w*) (?!home)/ => no match.

[edit] still trying to come up with a single regex that works properly. the one from before will fail if there is a greater than sign any where after a link (it thinks its a closing of a tag...

[Edited by - kryat on September 10, 2005 1:39:52 AM]
My mistake. I misinterpreted your post.
I'm sure its possible to come up with a single expression that works, I just cant figure one out. However, here is some code that does work:

//$body = Post content$exp = "/(http|ftp|https):\/\/([\w.\_\/~\?=%+]*[^. ])/i";               //URLs regex$rep = "<a href=\"$1://$2\">$2</a>";                                   //URL Replacement$htmlexp = "/<[^>]+>/";                                                //generic HTML regexpreg_match_all($htmlexp,$body, $html_arr);                             //Collect all HTML tags$text_arr = preg_replace($exp, $rep, preg_split($htmlexp,$body));      //Process URL replacement$body = $text_arr[0];                                                  //Rebuild the postforeach ($html_arr[0] as $key => $h) $body .= $h . $text_arr[$key+1];  //$body is ready.
I think what you mean to say is www.morphinenation.com

This topic is closed to new replies.

Advertisement