• Advertisement
Sign in to follow this  

[web] Regular expression help!

This topic is 4645 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm writing a simple web board, and I'd like people to be able to post pictures and links. Currently, it escapes all the HTML and then hunts the lt; and gt; tags and converts the image and a href tags back to proper HTML. I'm not very good at regular expressions, and they currently look like:
	$pattern = "/<img[ ]+src[^\"]*\"([^\"\r\n]*)\"[^\"]*>/i"; 
	$replacement = "<img src=\"$1\">";
	$text = preg_replace($pattern, $replacement, $text); 
	
	$pattern = "/&lt;a[ ]+href[^\"]*\"([^\"\r\n]*)\"[^\"]*&gt;([^\"\r\n]*)&lt;\/a&gt;/i"; 
	$replacement = "<a href=\"$1\">$2</a>";
	$text = preg_replace($pattern, $replacement, $text); 
...which break in a number of cases. Any ideas?

Share this post


Link to post
Share on other sites
Advertisement
Because strip_tags() actually removes them, as opposed to encoding them.

Anyway, aside from the fact that you seem to be missing the = signs after scr and a in the original text, I don't see much wrong. Can you give some examples of what breaks?

Share this post


Link to post
Share on other sites
Quote:
Original post by Sander
Anyway, aside from the fact that you seem to be missing the = signs after scr and a in the original text, I don't see much wrong. Can you give some examples of what breaks?
Adding "=" anywhere in the regex breaks it completely, and I have no idea why. Things that break them are (for example)
<a href="http://site"><img src="pic.jpg"><br />Caption!</a>


Also, alt tags, title tags, width/heights and so on break images (they are not converted back to HTML).

Share this post


Link to post
Share on other sites
Weird.. anyway, I built this from the ground up. It should work (unless I made a typo):


//images
$pattern = '#&lt;img\w+src="([^"]+)"&gt;#';
$replacement = '<img src="\\1" />';
$text = preg_replace($pattern, $replacement, $text);

//links
$pattern = '#&lt;a\w+href="([^"]+)"&gt;#';
$replacement = '<a href="\\1" />';
$text = preg_replace($pattern, $replacement, $text);
$text = str_replace('&lt;/a&gt;', '</a>', $text);


Share this post


Link to post
Share on other sites
= is a special character that needs to be escaped in PHP regexps. See here. That function escapes all special characters and I quote: "The special regular expression characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | :"

Share this post


Link to post
Share on other sites
Design critique: I don't think you should escape the HTML first, since it prevents you from manually escaping those tags you don't want processed. You should look for < and > first, not &lt; and &gt;, and only convert < and > to &lt; and &gt; for those tags you do not recognize.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement