The sanitizer is an interesting beast. The basic task it faces is to take a chunk of what may be approximately something approaching XHTML (annotated with custom GDNet extensions), parse and lex it into an XML tree, strip away any elements or attributes that aren't permitted, and ensure that the result is valid XHTML (or that it would be when wrapped inside a DIV).
The first part - generating the XML tree - is actually the simplest. I'm using HTML Tidy, an open-source library for this kind of thing, that can take an arbitrary input and will return valid XML, adding closing tags and stuff where necessary.
The next steps - stripping forbidden elements and attributes - is harder. The sanitizer supports different sanitization 'profiles,' that describe what is and is not allowed for a given chunk of text; this means we can, for example, set a profile for the forums that only grants basic text and formatting tags, but set a profile for the journals that grants things like tables and embedded video.
One significant decision is whether to take an inclusive (only the named tags are allowed, everything else is removed) approach, or an exclusive (only the named tags are removed, everything else is kept) approach. Inclusive is better in that it's more secure, but it also means that the sanitizer needs to know about every possible tag you might want to use, including the attributes permitted on each. The exclusive approach is much easier to write - I just 'blacklist' the disallowed tags and attributes - but it's much more open to abuse, in that if I forget a tag then we've got problems. Things are complicated further by the way in which children of tags should be removed - if you've used the bold tag and it's not allowed, then the tag should be removed without removing the text within it. , on the other hand...
One thing I'm doing to ease the development burden is to use unit tests. I'm building a collection of bits of malformed or malicious text, coupled with the result that the sanitizer should produce.
This is where you can help. What test cases should I have? What finicky tricks and traps do you think the sanitizer should be watching out for?
Any tag that can import or link to other content - images, hyperlinks, scripts etc... probably endless ways they can cause chaos, but equally may not be the subject of what you're working on now?!
In general though... are you clearing up the GDNet 'tag language'?? I'm perfectly used to it now but it always struck me as being a little bit of an odd mix of HTML and BBCode [lol]
Keep up the good work,
Jack