Word filter?

Started by
12 comments, last by Moomin 13 years, 7 months ago
Say I want to filter from bad words the user may input. Now, I have a wordlist already, but I'm wondering about filtering variations, such as replacing E with 3. I was thinking of keeping the wordlist in a vector of strings and doing an std::find_if with the predicate a custom std::equal that checks each character for possible variations. But this won't work for cases where a character could be replaced by a combination of characters, such as F with PH. So, other than using a regexp library, what's the easiest way to do this?
"But who prays for Satan? Who, in eighteen centuries, has had the common humanity to pray for the one sinner that needed it most?" --Mark Twain

~~~~~~~~~~~~~~~Looking for a high-performance, easy to use, and lightweight math library? http://www.cmldev.net/ (note: I'm not associated with that project; just a user)
Advertisement
Assuming your word list doesn't already include all these variations then my take on it is this:

You define a function that takes as input a word and outputs a set of similarly illegal variants, e.g:

variants("fatty") -> {"fatty", "phatty", "f8tty", ...}

Exactly what algorithms you use to perform the transformations is up to you. My first thought, to keep this manageable, is to define a set of simple transform operators such as:

{f_to_ph, e_to_3, ate_to_8, ...}

Then you apply every permutation of every subset of those operators and add each resultant word to the output set.

Then it's just a matter of:

foreach word in wordlist:    foreach variant in variants(word):        if sentence.contains(variant):            kill(user)


It's not going to be the fastest approach, but you asked for easy.
You could always generate the variants once and just use the resultant list of words, rather than re-generating each time; then it would be fast and easy at the expense of memory required for storing the bigger word list.
One idea is to represent your blacklist as a Context free grammar.

You'd have production rules,

Bad_word_symbol --> Fuck_symbolBad_word_symbol --> Shit_symbol...Fuck_symbol --> F_sound_symbol, U_sound_symbol,...,K_sound_symbol...F_sound_symbol --> 'f'F_sound_symbol --> 'ph'E_sound_symbol --> 'e'E_sound_symbol --> '3'...etc.


and then you could throw the CYK algorithm at it, maybe.

Another idea is to brute-force it as dmatter suggested, but to do this offline to generate a decision tree, which you'd bake into your executable.

[Edited by - Emergent on August 25, 2010 1:42:20 PM]
You could canonicalize your texts.
Search and replace all variations of a spelling into one "normal" form. Now you can simply compare words to a equally canonicalized blacklist.

A whitelist check of original and new words as a last step is recommended though.
I understand the production rules, but I'm interested in efficiency. If I use production rules as above, I'll be comparing words to word variations instead of characters to one or two character variations.

There are a lot of production rules reasonable to consider. From a php script I saw on Stack Overflow:
    $pattern['a'] = '/[a]/'; $replace['a'] = '[a A @]';    $pattern['b'] = '//'; $replace['b'] = '';<br>    $pattern['c'] = '/[c]/'; $replace['c'] = '(?:[c C (]|[k K])';<br>    $pattern['d'] = '/[d]/'; $replace['d'] = '[d D]';<br>    $pattern['e'] = '/[e]/'; $replace['e'] = '[e E <span class="cpp-number">3</span>]';<br>    $pattern['f'] = '/[f]/'; $replace['f'] = '(?:[f F]|[ph pH Ph PH])';<br>    $pattern['g'] = '/[g]/'; $replace['g'] = '[g G <span class="cpp-number">6</span>]';<br>    $pattern['h'] = '/[h]/'; $replace['h'] = '[h H]';<br>    $pattern['i'] = '/<span style="font-weight:bold;">/'; $replace['i'] = '';<br>    $pattern['j'] = '/[j]/'; $replace['j'] = '[j J]';<br>    $pattern['k'] = '/[k]/'; $replace['k'] = '(?:[c C (]|[k K])';<br>    $pattern['l'] = '/[l]/'; $replace['l'] = '[l L <span class="cpp-number">1</span> ! i]';<br>    $pattern['m'] = '/[m]/'; $replace['m'] = '[m M]';<br>    $pattern['n'] = '/[n]/'; $replace['n'] = '[n N]';<br>    $pattern['o'] = '/[o]/'; $replace['o'] = '[o O <span class="cpp-number">0</span>]';<br>    $pattern['p'] = '/<p>/'; $replace['p'] = '[p P]';<br>    $pattern['q'] = '/[q]/'; $replace['q'] = '[q Q <span class="cpp-number">9</span>]';<br>    $pattern['r'] = '/[r]/'; $replace['r'] = '[r R]';<br>    $pattern['s'] = '//'; $replace['s'] = '';<br>    $pattern['t'] = '/[t]/'; $replace['t'] = '[t T <span class="cpp-number">7</span>]';<br>    $pattern['u'] = '/<span style="text-decoration:underline;">/'; $replace['u'] = '';<br>    $pattern['v'] = '/[v]/'; $replace['v'] = '[v V u U]';<br>    $pattern['w'] = '/[w]/'; $replace['w'] = '[w W vv VV]';<br>    $pattern['x'] = '/[x]/'; $replace['x'] = '[x X]';<br>    $pattern['y'] = '/[y]/'; $replace['y'] = '[y Y]';<br>    $pattern['z'] = '/[z]/'; $replace['z'] = '[z Z <span class="cpp-number">2</span>]';<br><br><br></pre></div><!–ENDSCRIPT–><br>I'm not sure how canonicalization would be done when it can be ambiguous.  For example, the word phone would match ph as a variation of f, giving fone.  But it could also actually be phone.  So the mapping is not a function.<br>(By the way, can someone explain the ?:[]|[] stuff? I'm not familiar with this syntax)<br>The most efficient way I came up with is to compare character by character but do a look-ahead of &#111;ne in the input word for those cases where 2-character variations are possible.<br><br>Whitelist definitely won't work as I'm filtering names entered for top score list.<br><br>Also, while I have an English dirty word list, I need a French version as well… any pointers?
"But who prays for Satan? Who, in eighteen centuries, has had the common humanity to pray for the one sinner that needed it most?" --Mark Twain

~~~~~~~~~~~~~~~Looking for a high-performance, easy to use, and lightweight math library? http://www.cmldev.net/ (note: I'm not associated with that project; just a user)
Quote:Original post by Emergent
One idea is to represent your blacklist as a Context free grammar.

You'd have production rules,

Bad_word_symbol --> Fuck_symbolBad_word_symbol --> Shit_symbol...[...]...etc.


and then you could throw the CYK algorithm at it, maybe.


I love this idea.

I'm not familiar with the CYK algorithm but given production rules like the above couldn't you just use them to generate a parser for dirty words? Tokens would be alphanumeric characters, give it a word and it tells you if it parses or not, if it parses then it's a dirty word.
Quote:Original post by jwezorek
I love this idea.

I'm not familiar with the CYK algorithm but given production rules like the above couldn't you just use them to generate a parser for dirty words? Tokens would be alphanumeric characters, give it a word and it tells you if it parses or not, if it parses then it's a dirty word.


Thanks. Yep, "If it parses, it's dirty," is what I was going for. It could even explain to a user why a name was denied by showing the parse tree!

One could also extend the CFG idea a little by using a stochastic context-free grammar and computing a probability that the given word is dirty. I don't think there's any good way to get empirical statistics for how common, say, substituting "3" for "e" is, but at the very least this would give some knobs to tweak, if the parser started being too creative with the substitutions it found.

It doesn't do the stochastic kind, but Bison apparently generates C code for parsers from production rules. I haven't used it myself, but it might be worth looking at.

All this said, to be honest I'd probably just use Konfusius' canonicalization method; it's nice and simple, and probably good enough (Who cares if it confuses "phone" and "fone" so long as it screens out dirty words without too many false positives?).
Quote:Original post by Prune
(By the way, can someone explain the ?:[]|[] stuff? I'm not familiar with this syntax)

Normally is like an if else statement. If the statement before the ? is true, the first [] is used. If it is false, the part after the | is used.
Link

So what is the "statement before the ?"... I see two operands given to a ternary operator
"But who prays for Satan? Who, in eighteen centuries, has had the common humanity to pray for the one sinner that needed it most?" --Mark Twain

~~~~~~~~~~~~~~~Looking for a high-performance, easy to use, and lightweight math library? http://www.cmldev.net/ (note: I'm not associated with that project; just a user)
I'm assuming they use the code that appears before it when it is called for the statement before the ?. Although I'm not entirely sure on that. Do you have the rest of that code.

This topic is closed to new replies.

Advertisement