[web] [PHP] Regular Expressions are horrible: URL matching
When displaying messages on a forum, I run them through this:
echo preg_replace('#(http|ftp|https)://([A-Za-z0-9\./_~]*)#i', "<a href=\"$1://$2\">$2</a>", $body);
This works great for most URLs. HOWEVER, my forum software allows [img][/img] tags, as well as replacing smilies with the relevant HTML, which means that I get rubbish like this:
<img src="<a href="http://path.to.images.jpg">path.to.images.jpg</a>" />
...which is rubbish. How can I stop this from butchering URLs that are already inside HTML tags?
Match a space before and after the URL so that "http://www.example.com" wont match. Although, if somebody actually does put a URL without spaces around it in their message it will not be changed to a link, but how often does that happen?
Or just correct afterwards by replacing again:
$body = preg_replace ("#<img src=\"<a href=\"(.*?)\">(.*?)</a>\" />#", "<img src=\"$1\" />", $body);
this is a ruby style regex, but you should be able to convert it to php fairly easily:
basically (?!...) is to not match, so it basically says dont match this if its between < >
i'm pretty sure that php preg_replace supports (?!...) commands.
gsub( /(?!<.*)(http|ftp|https):\/\/([\w.\\\/_~]*)(?!.*>)/im, '<a href="\1://\2">\2</a>')
basically (?!...) is to not match, so it basically says dont match this if its between < >
i'm pretty sure that php preg_replace supports (?!...) commands.
Quote:Original post by kryat
i'm pretty sure that php preg_replace supports (?!...) commands.
PHP's form of (?!...) is (?:...)
Quote:Original post by Colin JeanneQuote:Original post by kryat
i'm pretty sure that php preg_replace supports (?!...) commands.
PHP's form of (?!...) is (?:...)
not quite, (?:...) matches the expression, but doesnt add it in to matched collection (eg /1 /2 /3...), but (?!...) is a negative look-ahead match, (?=...) being the postive look ahead. Things get a little tricky with look-aheads...
but for example at the string "bob goes home"
/(bob) (\w*) (home)/ => "bob goes home" /1 = "bob" /2 = "goes" /3 = "home"
/(?:bob) (\w*) (?:home)/ => "bob goes home" /1 = "goes"
/bob (\w*) (?!home)/ => no match.
[edit] still trying to come up with a single regex that works properly. the one from before will fail if there is a greater than sign any where after a link (it thinks its a closing of a tag...
[Edited by - kryat on September 10, 2005 1:39:52 AM]
I'm sure its possible to come up with a single expression that works, I just cant figure one out. However, here is some code that does work:
//$body = Post content$exp = "/(http|ftp|https):\/\/([\w.\_\/~\?=%+]*[^. ])/i"; //URLs regex$rep = "<a href=\"$1://$2\">$2</a>"; //URL Replacement$htmlexp = "/<[^>]+>/"; //generic HTML regexpreg_match_all($htmlexp,$body, $html_arr); //Collect all HTML tags$text_arr = preg_replace($exp, $rep, preg_split($htmlexp,$body)); //Process URL replacement$body = $text_arr[0]; //Rebuild the postforeach ($html_arr[0] as $key => $h) $body .= $h . $text_arr[$key+1]; //$body is ready.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement