Jump to content
  • Advertisement
Sign in to follow this  
benryves

[web] Regular expression help!

This topic is 4889 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm writing a simple web board, and I'd like people to be able to post pictures and links. Currently, it escapes all the HTML and then hunts the lt; and gt; tags and converts the image and a href tags back to proper HTML. I'm not very good at regular expressions, and they currently look like:
	$pattern = "/<img[ ]+src[^\"]*\"([^\"\r\n]*)\"[^\"]*>/i"; 
	$replacement = "<img src=\"$1\">";
	$text = preg_replace($pattern, $replacement, $text); 
	
	$pattern = "/&lt;a[ ]+href[^\"]*\"([^\"\r\n]*)\"[^\"]*&gt;([^\"\r\n]*)&lt;\/a&gt;/i"; 
	$replacement = "<a href=\"$1\">$2</a>";
	$text = preg_replace($pattern, $replacement, $text); 
...which break in a number of cases. Any ideas?

Share this post


Link to post
Share on other sites
Advertisement
Because strip_tags() actually removes them, as opposed to encoding them.

Anyway, aside from the fact that you seem to be missing the = signs after scr and a in the original text, I don't see much wrong. Can you give some examples of what breaks?

Share this post


Link to post
Share on other sites
Quote:
Original post by Sander
Anyway, aside from the fact that you seem to be missing the = signs after scr and a in the original text, I don't see much wrong. Can you give some examples of what breaks?
Adding "=" anywhere in the regex breaks it completely, and I have no idea why. Things that break them are (for example)
<a href="http://site"><img src="pic.jpg"><br />Caption!</a>


Also, alt tags, title tags, width/heights and so on break images (they are not converted back to HTML).

Share this post


Link to post
Share on other sites
Weird.. anyway, I built this from the ground up. It should work (unless I made a typo):


//images
$pattern = '#&lt;img\w+src="([^"]+)"&gt;#';
$replacement = '<img src="\\1" />';
$text = preg_replace($pattern, $replacement, $text);

//links
$pattern = '#&lt;a\w+href="([^"]+)"&gt;#';
$replacement = '<a href="\\1" />';
$text = preg_replace($pattern, $replacement, $text);
$text = str_replace('&lt;/a&gt;', '</a>', $text);


Share this post


Link to post
Share on other sites
= is a special character that needs to be escaped in PHP regexps. See here. That function escapes all special characters and I quote: "The special regular expression characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | :"

Share this post


Link to post
Share on other sites
Design critique: I don't think you should escape the HTML first, since it prevents you from manually escaping those tags you don't want processed. You should look for < and > first, not &lt; and &gt;, and only convert < and > to &lt; and &gt; for those tags you do not recognize.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!