• 13
• 14
• 27
• 9
• 9

Word filter?

This topic is 2760 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

Say I want to filter from bad words the user may input. Now, I have a wordlist already, but I'm wondering about filtering variations, such as replacing E with 3. I was thinking of keeping the wordlist in a vector of strings and doing an std::find_if with the predicate a custom std::equal that checks each character for possible variations. But this won't work for cases where a character could be replaced by a combination of characters, such as F with PH. So, other than using a regexp library, what's the easiest way to do this?

Share on other sites
Assuming your word list doesn't already include all these variations then my take on it is this:

You define a function that takes as input a word and outputs a set of similarly illegal variants, e.g:

variants("fatty") -> {"fatty", "phatty", "f8tty", ...}

Exactly what algorithms you use to perform the transformations is up to you. My first thought, to keep this manageable, is to define a set of simple transform operators such as:

{f_to_ph, e_to_3, ate_to_8, ...}

Then you apply every permutation of every subset of those operators and add each resultant word to the output set.

Then it's just a matter of:

foreach word in wordlist:    foreach variant in variants(word):        if sentence.contains(variant):            kill(user)

It's not going to be the fastest approach, but you asked for easy.
You could always generate the variants once and just use the resultant list of words, rather than re-generating each time; then it would be fast and easy at the expense of memory required for storing the bigger word list.

Share on other sites
One idea is to represent your blacklist as a Context free grammar.

You'd have production rules,

Bad_word_symbol --> Fuck_symbolBad_word_symbol --> Shit_symbol...Fuck_symbol --> F_sound_symbol, U_sound_symbol,...,K_sound_symbol...F_sound_symbol --> 'f'F_sound_symbol --> 'ph'E_sound_symbol --> 'e'E_sound_symbol --> '3'...etc.

and then you could throw the CYK algorithm at it, maybe.

Another idea is to brute-force it as dmatter suggested, but to do this offline to generate a decision tree, which you'd bake into your executable.

[Edited by - Emergent on August 25, 2010 1:42:20 PM]

Share on other sites
Search and replace all variations of a spelling into one "normal" form. Now you can simply compare words to a equally canonicalized blacklist.

A whitelist check of original and new words as a last step is recommended though.

Share on other sites
I understand the production rules, but I'm interested in efficiency. If I use production rules as above, I'll be comparing words to word variations instead of characters to one or two character variations.

There are a lot of production rules reasonable to consider. From a php script I saw on Stack Overflow:
    $pattern['a'] = '/[a]/';$replace['a'] = '[a A @]';    $pattern['b'] = '//';$replace['b'] = '[b B I3 l3 i3]';    $pattern['c'] = '/[c]/';$replace['c'] = '(?:[c C (]|[k K])';    $pattern['d'] = '/[d]/';$replace['d'] = '[d D]';    $pattern['e'] = '/[e]/';$replace['e'] = '[e E 3]';    $pattern['f'] = '/[f]/';$replace['f'] = '(?:[f F]|[ph pH Ph PH])';    $pattern['g'] = '/[g]/';$replace['g'] = '[g G 6]';    $pattern['h'] = '/[h]/';$replace['h'] = '[h H]';    $pattern['i'] = '//';$replace['i'] = '[i I l ! 1]';    $pattern['j'] = '/[j]/';$replace['j'] = '[j J]';    $pattern['k'] = '/[k]/';$replace['k'] = '(?:[c C (]|[k K])';    $pattern['l'] = '/[l]/';$replace['l'] = '[l L 1 ! i]';    $pattern['m'] = '/[m]/';$replace['m'] = '[m M]';    $pattern['n'] = '/[n]/';$replace['n'] = '[n N]';    $pattern['o'] = '/[o]/';$replace['o'] = '[o O 0]';    $pattern['p'] = '//';$replace['p'] = '[p P]';    $pattern['q'] = '/[q]/';$replace['q'] = '[q Q 9]';    $pattern['r'] = '/[r]/';$replace['r'] = '[r R]';    $pattern['s'] = '/[s]/';$replace['s'] = '[s S $5]';$pattern['t'] = '/[t]/'; $replace['t'] = '[t T 7]';$pattern['u'] = '//'; $replace['u'] = '[u U v V]';$pattern['v'] = '/[v]/'; $replace['v'] = '[v V u U]';$pattern['w'] = '/[w]/'; $replace['w'] = '[w W vv VV]';$pattern['x'] = '/[x]/'; $replace['x'] = '[x X]';$pattern['y'] = '/[y]/'; $replace['y'] = '[y Y]';$pattern['z'] = '/[z]/'; \$replace['z'] = '[z Z 2]';

I'm not sure how canonicalization would be done when it can be ambiguous. For example, the word phone would match ph as a variation of f, giving fone. But it could also actually be phone. So the mapping is not a function.
(By the way, can someone explain the ?:[]|[] stuff? I'm not familiar with this syntax)
The most efficient way I came up with is to compare character by character but do a look-ahead of one in the input word for those cases where 2-character variations are possible.

Whitelist definitely won't work as I'm filtering names entered for top score list.

Also, while I have an English dirty word list, I need a French version as well... any pointers?

Share on other sites
Quote:
 Original post by EmergentOne idea is to represent your blacklist as a Context free grammar.You'd have production rules,Bad_word_symbol --> Fuck_symbolBad_word_symbol --> Shit_symbol...[...]...etc.and then you could throw the CYK algorithm at it, maybe.

I love this idea.

I'm not familiar with the CYK algorithm but given production rules like the above couldn't you just use them to generate a parser for dirty words? Tokens would be alphanumeric characters, give it a word and it tells you if it parses or not, if it parses then it's a dirty word.

Share on other sites
Quote:
 Original post by jwezorekI love this idea.I'm not familiar with the CYK algorithm but given production rules like the above couldn't you just use them to generate a parser for dirty words? Tokens would be alphanumeric characters, give it a word and it tells you if it parses or not, if it parses then it's a dirty word.

Thanks. Yep, "If it parses, it's dirty," is what I was going for. It could even explain to a user why a name was denied by showing the parse tree!

One could also extend the CFG idea a little by using a stochastic context-free grammar and computing a probability that the given word is dirty. I don't think there's any good way to get empirical statistics for how common, say, substituting "3" for "e" is, but at the very least this would give some knobs to tweak, if the parser started being too creative with the substitutions it found.

It doesn't do the stochastic kind, but Bison apparently generates C code for parsers from production rules. I haven't used it myself, but it might be worth looking at.

All this said, to be honest I'd probably just use Konfusius' canonicalization method; it's nice and simple, and probably good enough (Who cares if it confuses "phone" and "fone" so long as it screens out dirty words without too many false positives?).

Share on other sites
Quote:
 Original post by Prune(By the way, can someone explain the ?:[]|[] stuff? I'm not familiar with this syntax)

Normally is like an if else statement. If the statement before the ? is true, the first [] is used. If it is false, the part after the | is used.

Share on other sites
So what is the "statement before the ?"... I see two operands given to a ternary operator