[web] Regular expression help!

Started by
5 comments, last by Fruny 18 years, 10 months ago
I'm writing a simple web board, and I'd like people to be able to post pictures and links. Currently, it escapes all the HTML and then hunts the lt; and gt; tags and converts the image and a href tags back to proper HTML. I'm not very good at regular expressions, and they currently look like:
	$pattern = "/<img[ ]+src[^\"]*\"([^\"\r\n]*)\"[^\"]*>/i"; 
	$replacement = "<img src=\"$1\">";
	$text = preg_replace($pattern, $replacement, $text); 
	
	$pattern = "/&lt;a[ ]+href[^\"]*\"([^\"\r\n]*)\"[^\"]*&gt;([^\"\r\n]*)&lt;\/a&gt;/i"; 
	$replacement = "<a href=\"$1\">$2</a>";
	$text = preg_replace($pattern, $replacement, $text); 
...which break in a number of cases. Any ideas?

[Website] [+++ Divide By Cucumber Error. Please Reinstall Universe And Reboot +++]

Advertisement
Have you looked at PHP's strip_tags function?
Free Mac Mini (I know, I'm a tool)
Because strip_tags() actually removes them, as opposed to encoding them.

Anyway, aside from the fact that you seem to be missing the = signs after scr and a in the original text, I don't see much wrong. Can you give some examples of what breaks?

<hr />
Sander Marechal<small>[Lone Wolves][Hearts for GNOME][E-mail][Forum FAQ]</small>

Quote:Original post by Sander
Anyway, aside from the fact that you seem to be missing the = signs after scr and a in the original text, I don't see much wrong. Can you give some examples of what breaks?
Adding "=" anywhere in the regex breaks it completely, and I have no idea why. Things that break them are (for example)
<a href="http://site"><img src="pic.jpg"><br />Caption!</a>

Also, alt tags, title tags, width/heights and so on break images (they are not converted back to HTML).

[Website] [+++ Divide By Cucumber Error. Please Reinstall Universe And Reboot +++]

Weird.. anyway, I built this from the ground up. It should work (unless I made a typo):

//images$pattern = '#&lt;img\w+src="([^"]+)"&gt;#';$replacement = '<img src="\\1" />';$text = preg_replace($pattern, $replacement, $text);//links$pattern = '#&lt;a\w+href="([^"]+)"&gt;#';$replacement = '<a href="\\1" />';$text = preg_replace($pattern, $replacement, $text);$text = str_replace('&lt;/a&gt;', '</a>', $text);

<hr />
Sander Marechal<small>[Lone Wolves][Hearts for GNOME][E-mail][Forum FAQ]</small>

= is a special character that needs to be escaped in PHP regexps. See here. That function escapes all special characters and I quote: "The special regular expression characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | :"
Design critique: I don't think you should escape the HTML first, since it prevents you from manually escaping those tags you don't want processed. You should look for < and > first, not &lt; and &gt;, and only convert < and > to &lt; and &gt; for those tags you do not recognize.
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." — Brian W. Kernighan

This topic is closed to new replies.

Advertisement