V5 Guts: Text Sanitizer

posted in Continuous Refinement

Published February 01, 2010

One of the biggest causes of security issues in sites - XSS attacks, SQL injection, etc - is a failure to properly handle user input, making sure that it doesn't contain undesirable elements.

This is potentially a very complex task, and it gets more complex the more the user's allowed to do and the more you care about the output. In V5, I want to expand the capabilities of the markup users can include through things like attributes; I also want to keep the data on the server end in a highly flexible format, making it easy to do things like strip out smilies, find posts associated by quotations, and so on. XML seemed the obvious choice.

Another thing I really, really wanted to fix is the way HTML entities get handled. At the moment, if you make a post with < and >gt; entities, they get turned back into < and > when you edit the post, and then treated as HTML when you save the post again... there are also problems with how to encode stuff when putting it out as RSS or similar. I wanted to put a stop to all these encoding issues.

Happily, we've now got a pretty solid pipeline in place. A combination of HTML Agility and OWASP AntiSamy, with my own extensions and modifications, provide the bulk of the work.

HTML Agility takes the tag soup you guys will throw at the site and turns it into an XML document. At its core is a normal state-machine based parser that generates DOM nodes as it encounters them. Agility also handles encoding issues, turning HTML entities like ™ into their actual character sequences. I've also extended it to allow tag names that have namespace prefixes - so it will allow, for example, tags.

The output from Agility is a near-as-dammit-valid XML document that I feed to AntiSamy.NET. Now, AntiSamy I have made some fairly extensive changes to, updating it for C# 2 and multithreading it all. Still, the core concept remains the same: AntiSamy has a 'policy' of which tags are allowed, and which attributes and CSS properties are allowed on them (along with regexps defining the values those attributes and properties can take). When something isn't allowed, it can be dropped entirely - such as I might do to tags - or it can be 'filtered,' removing the tag but leaving its contents. I've set it up to support multiple policies, so I can permit one set of tags when writing articles, another when writing journal entries, and another when writing forum posts, etc.

The result is a neatly-filtered XML fragment that I can quickly and easily perform XPath queries against, or feed to the renderer for processing by the XSLT stylesheets and outputting.

Previous Entry Wheeee

Next Entry V5: Continuous Integration and Deployment

0 likes 0 comments

Comments

Nobody has left a comment. You can be the first!

You must log in to join the conversation.

Don't have a GameDev.net account? Sign up!

superpig

Author

V5 Guts: Text Sanitizer

Comments

superpig

Latest Entries

members.gamedev.net

GDNet Slim

Activity streams

V5: User accounts and profiles

V5: What I've been working on recently

Service process account install gotcha

V5: Fun with MSBuild

V5: Continuous Integration and Deployment

V5 Guts: Text Sanitizer

Wheeee

V5 Guts: Text Sanitizer

Comments

superpig

Latest Entries

members.gamedev.net

GDNet Slim

Activity streams

V5: User accounts and profiles

V5: What I've been working on recently

Service process account install gotcha

V5: Fun with MSBuild

V5: Continuous Integration and Deployment

V5 Guts: Text Sanitizer

Wheeee

Reticulating splines