Sign in to follow this  
benryves

[web] [PHP] Regular Expressions are horrible: URL matching

Recommended Posts

When displaying messages on a forum, I run them through this: echo preg_replace('#(http|ftp|https)://([A-Za-z0-9\./_~]*)#i', "<a href=\"$1://$2\">$2</a>", $body); This works great for most URLs. HOWEVER, my forum software allows [img][/img] tags, as well as replacing smilies with the relevant HTML, which means that I get rubbish like this: <img src="<a href="http://path.to.images.jpg">path.to.images.jpg</a>" /> ...which is rubbish. How can I stop this from butchering URLs that are already inside HTML tags?

Share this post


Link to post
Share on other sites
Match a space before and after the URL so that "http://www.example.com" wont match. Although, if somebody actually does put a URL without spaces around it in their message it will not be changed to a link, but how often does that happen?

Share this post


Link to post
Share on other sites
Or just correct afterwards by replacing again:
$body = preg_replace ("#<img src=\"<a href=\"(.*?)\">(.*?)</a>\" />#", "<img src=\"$1\" />", $body);

Share this post


Link to post
Share on other sites
this is a ruby style regex, but you should be able to convert it to php fairly easily:

gsub( /(?!<.*)(http|ftp|https):\/\/([\w.\\\/_~]*)(?!.*>)/im, '<a href="\1://\2">\2</a>')




basically (?!...) is to not match, so it basically says dont match this if its between < >

i'm pretty sure that php preg_replace supports (?!...) commands.

Share this post


Link to post
Share on other sites
Quote:
Original post by Colin Jeanne
Quote:
Original post by kryat
i'm pretty sure that php preg_replace supports (?!...) commands.

PHP's form of (?!...) is (?:...)


not quite, (?:...) matches the expression, but doesnt add it in to matched collection (eg /1 /2 /3...), but (?!...) is a negative look-ahead match, (?=...) being the postive look ahead. Things get a little tricky with look-aheads...

but for example at the string "bob goes home"

/(bob) (\w*) (home)/ => "bob goes home" /1 = "bob" /2 = "goes" /3 = "home"
/(?:bob) (\w*) (?:home)/ => "bob goes home" /1 = "goes"
/bob (\w*) (?!home)/ => no match.

[edit] still trying to come up with a single regex that works properly. the one from before will fail if there is a greater than sign any where after a link (it thinks its a closing of a tag...

[Edited by - kryat on September 10, 2005 1:39:52 AM]

Share this post


Link to post
Share on other sites
I'm sure its possible to come up with a single expression that works, I just cant figure one out. However, here is some code that does work:

//$body = Post content
$exp = "/(http|ftp|https):\/\/([\w.\_\/~\?=%+]*[^. ])/i"; //URLs regex
$rep = "<a href=\"$1://$2\">$2</a>"; //URL Replacement
$htmlexp = "/<[^>]+>/"; //generic HTML regex

preg_match_all($htmlexp,$body, $html_arr); //Collect all HTML tags
$text_arr = preg_replace($exp, $rep, preg_split($htmlexp,$body)); //Process URL replacement
$body = $text_arr[0]; //Rebuild the post
foreach ($html_arr[0] as $key => $h) $body .= $h . $text_arr[$key+1]; //$body is ready.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
I think what you mean to say is www.morphinenation.com

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this