Text sanitization

posted in Continuous Refinement

Published July 01, 2009

My work over the past few days has mostly been on the text sanitizer.

The sanitizer is an interesting beast. The basic task it faces is to take a chunk of what may be approximately something approaching XHTML (annotated with custom GDNet extensions), parse and lex it into an XML tree, strip away any elements or attributes that aren't permitted, and ensure that the result is valid XHTML (or that it would be when wrapped inside a DIV).

The first part - generating the XML tree - is actually the simplest. I'm using HTML Tidy, an open-source library for this kind of thing, that can take an arbitrary input and will return valid XML, adding closing tags and stuff where necessary.

The next steps - stripping forbidden elements and attributes - is harder. The sanitizer supports different sanitization 'profiles,' that describe what is and is not allowed for a given chunk of text; this means we can, for example, set a profile for the forums that only grants basic text and formatting tags, but set a profile for the journals that grants things like tables and embedded video.

One significant decision is whether to take an inclusive (only the named tags are allowed, everything else is removed) approach, or an exclusive (only the named tags are removed, everything else is kept) approach. Inclusive is better in that it's more secure, but it also means that the sanitizer needs to know about every possible tag you might want to use, including the attributes permitted on each. The exclusive approach is much easier to write - I just 'blacklist' the disallowed tags and attributes - but it's much more open to abuse, in that if I forget a tag then we've got problems. Things are complicated further by the way in which children of tags should be removed - if you've used the bold tag and it's not allowed, then the tag should be removed without removing the text within it. , on the other hand...

One thing I'm doing to ease the development burden is to use unit tests. I'm building a collection of bits of malformed or malicious text, coupled with the result that the sanitizer should produce.

This is where you can help. What test cases should I have? What finicky tricks and traps do you think the sanitizer should be watching out for?

Previous Entry Search, don't Sort

Next Entry Aaargh

0 likes 7 comments

Comments

jollyjeffers

Hidden/non-renderable characters? Bunch of tricks that can make something appear legit to the human reader but not to the actual PC...

Any tag that can import or link to other content - images, hyperlinks, scripts etc... probably endless ways they can cause chaos, but equally may not be the subject of what you're working on now?!

In general though... are you clearing up the GDNet 'tag language'?? I'm perfectly used to it now but it always struck me as being a little bit of an odd mix of HTML and BBCode [lol]

Keep up the good work,
Jack

July 02, 2009 07:24 AM

superpig

Yes, the input languages are being reworked. The 'common' language is XHTML, augmented with some gdnet-specific tags (like <smiley> or <latex>). That's the language that text will be stored in inside the DB.

You'll be able to work directly with that language if you like, but the plan is to also offer a WYSIWYG editor as well as a couple of other input methods (e.g. MediaWiki markup). All those other input methods are just extra layers over the XHTML+GDNetXML setup though.

July 02, 2009 11:42 AM

evolutional

Font colours or sizes that aren't visible against the background. Hyperlink filtering would be nice, to block links to sites that we know are bad. I'd like to embed things like YouTube videos or flash movies, but by avoiding obvious exploit routes such as the object tag. Offsite-links (iframe/frames) in journals could be sanitised, same with image links - might want to consider how we approach offsite hosting of images/files. Paranoid me is in fear of running executables linked/promoted by members of the site - probably very little we can do about that though

Oh so many things.

July 02, 2009 02:47 PM

Washu

You broke BanMan.

July 03, 2009 12:22 AM

Jason Z

This is way off topic from your request, but I am a web development noob and was wondering what language/framework GDNet is written in? What is the target language for the new version? I apologize if you have already discussed these details in previous posts - I have begun looking backward, but have only just begun...

Thanks in advance for any info - I'm just curious...

July 05, 2009 02:46 PM

acidwillburnyou

The best implementation of an HTML input sanitizer I've ever seen is: http://htmlpurifier.org/

It takes a very nice whitelist approach.

July 05, 2009 10:03 PM

ajones

May be a silly suggestion, but could you write an XML schema for the sanitizer, then just validate the XHTML against that? Of course, this implies that you're happy with the sanitizer stage doing: invalid content -> error message, rather than: invalid content -> valid content.

July 16, 2009 05:59 AM

You must log in to join the conversation.

Don't have a GameDev.net account? Sign up!

superpig

Author

Text sanitization

Comments

superpig

Latest Entries

members.gamedev.net

GDNet Slim

Activity streams

V5: User accounts and profiles

V5: What I've been working on recently

Service process account install gotcha

V5: Fun with MSBuild

V5: Continuous Integration and Deployment

V5 Guts: Text Sanitizer

Wheeee

Text sanitization

Comments

superpig

Latest Entries

members.gamedev.net

GDNet Slim

Activity streams

V5: User accounts and profiles

V5: What I've been working on recently

Service process account install gotcha

V5: Fun with MSBuild

V5: Continuous Integration and Deployment

V5 Guts: Text Sanitizer

Wheeee

Reticulating splines