Jump to content
  • Advertisement
Sign in to follow this  
  • entries
  • comments
  • views

Text sanitization

Sign in to follow this  


My work over the past few days has mostly been on the text sanitizer.

The sanitizer is an interesting beast. The basic task it faces is to take a chunk of what may be approximately something approaching XHTML (annotated with custom GDNet extensions), parse and lex it into an XML tree, strip away any elements or attributes that aren't permitted, and ensure that the result is valid XHTML (or that it would be when wrapped inside a DIV).

The first part - generating the XML tree - is actually the simplest. I'm using HTML Tidy, an open-source library for this kind of thing, that can take an arbitrary input and will return valid XML, adding closing tags and stuff where necessary.

The next steps - stripping forbidden elements and attributes - is harder. The sanitizer supports different sanitization 'profiles,' that describe what is and is not allowed for a given chunk of text; this means we can, for example, set a profile for the forums that only grants basic text and formatting tags, but set a profile for the journals that grants things like tables and embedded video.

One significant decision is whether to take an inclusive (only the named tags are allowed, everything else is removed) approach, or an exclusive (only the named tags are removed, everything else is kept) approach. Inclusive is better in that it's more secure, but it also means that the sanitizer needs to know about every possible tag you might want to use, including the attributes permitted on each. The exclusive approach is much easier to write - I just 'blacklist' the disallowed tags and attributes - but it's much more open to abuse, in that if I forget a tag then we've got problems. Things are complicated further by the way in which children of tags should be removed - if you've used the bold tag and it's not allowed, then the tag should be removed without removing the text within it. , on the other hand...

One thing I'm doing to ease the development burden is to use unit tests. I'm building a collection of bits of malformed or malicious text, coupled with the result that the sanitizer should produce.

This is where you can help. What test cases should I have? What finicky tricks and traps do you think the sanitizer should be watching out for?
Sign in to follow this  


Recommended Comments

Hidden/non-renderable characters? Bunch of tricks that can make something appear legit to the human reader but not to the actual PC...

Any tag that can import or link to other content - images, hyperlinks, scripts etc... probably endless ways they can cause chaos, but equally may not be the subject of what you're working on now?!

In general though... are you clearing up the GDNet 'tag language'?? I'm perfectly used to it now but it always struck me as being a little bit of an odd mix of HTML and BBCode [lol]

Keep up the good work,

Share this comment

Link to comment
Yes, the input languages are being reworked. The 'common' language is XHTML, augmented with some gdnet-specific tags (like <smiley> or <latex>). That's the language that text will be stored in inside the DB.

You'll be able to work directly with that language if you like, but the plan is to also offer a WYSIWYG editor as well as a couple of other input methods (e.g. MediaWiki markup). All those other input methods are just extra layers over the XHTML+GDNetXML setup though.

Share this comment

Link to comment
Font colours or sizes that aren't visible against the background. Hyperlink filtering would be nice, to block links to sites that we know are bad. I'd like to embed things like YouTube videos or flash movies, but by avoiding obvious exploit routes such as the object tag. Offsite-links (iframe/frames) in journals could be sanitised, same with image links - might want to consider how we approach offsite hosting of images/files. Paranoid me is in fear of running executables linked/promoted by members of the site - probably very little we can do about that though

Oh so many things.

Share this comment

Link to comment
This is way off topic from your request, but I am a web development noob and was wondering what language/framework GDNet is written in? What is the target language for the new version? I apologize if you have already discussed these details in previous posts - I have begun looking backward, but have only just begun...

Thanks in advance for any info - I'm just curious...

Share this comment

Link to comment
The best implementation of an HTML input sanitizer I've ever seen is: http://htmlpurifier.org/

It takes a very nice whitelist approach.

Share this comment

Link to comment
May be a silly suggestion, but could you write an XML schema for the sanitizer, then just validate the XHTML against that? Of course, this implies that you're happy with the sanitizer stage doing: invalid content -> error message, rather than: invalid content -> valid content.

Share this comment

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!