Word Filtering - The "Interceptor"

I've started looking into modifying my older KWIC keyword Extracter / Analyzer to be able to better trap spam, bad words, etc. Basically we ourselves have a need to for this for outr own elgg-based web sites. Most word filtering alogirthms, including those based on the time-honored Bayesian techniques would seem to fall off the cliff with some simple tricks that will bypass such filtering.
e.g.
Let us say we block for a bad word "badword"
users, even kids.. can figure out they will be blocked,
so.. they do not use "badword" anymore
they switch to variations
e.g.
"b.a.d.w.o.r.d"
OR
"b-a-d-w-o-r-d"
OR
"b.a.d.d.w.o.r.d.d"

A human reading this can still read the text and cognize, but our funky word filtering algorithms get a failing grade..!!! ;-)

I did those earlier KWIC routines to extract and classify Elgg's Google group posts so that I could create a nicely indexed archive which I could then search with ease. For that I had used a combination of StopWords + KeyWords -- so that it would ignore the Stopwords, while paying particular attention to the Keywords ( which had special meaning to me, even if these were on the Stopwords list )

So, anyway, the first problem to solve will be :=
How to match the dictionary "badword" with a misspelt "b.a.d.w.o.r.d" and similar...

The end result I hope to achieve is - word filtering for UserID, UserName, Messages content ( Carlos' "Messages Interceptor' ), any other *content posting where there might be a need for *filtering and *safety.

I'm not particularly looking for coding help here, however if you're reading this and have experience with such techniques, feel welcome to ppost and discuss.

 

 

  • Do I gotta cry and beg on my kness...?

    I am really looking for some help here from those of you that will gain some commercially oriented power  for your web-sites with this technology. OR maybe we  "GoofGang" are the only ones staying awake and coding neat shit for y'all ?

    I had really expected al those interested in the ( 200% high-octane ) M.I . PlugIn to be barraging me with their test-data ;-)

    INVITATION FOR COMMUNITY INPUT:
    Sometime soon, as I get that KWIC routine code patched up to do more intelligent "intecepting"   I would like to have some **very **realistic test data with *real *bad language so that I can test my code against that.
    ****************************************************************************************
    It would be nice to see some/ many from the elgg community step up and offer help.
    The actual test data :=
    ****************************************************************************************

     

  • sheesh i need to read and digest first

  • something i have noticed often is that when people try to go around anything that might be governing there input, not only do they seperate their words, but they go phoenitic on you...

     

    but if i were trying to do what you are after, with the whole matching bad words, and thinking about how things are using seperaters is just key in on the seperator... in otherwords, check for every other character being the same... so b.a.d.w.o.r.d runs a flag simply because it has a period inbetween each letter and it matches a list

    and if you were checking like this, it wouldn't matter so much if it were using b_a_d_w_o_r_d or even b a d w o r d

  • YEP ;-)

    Bret Profitt at the ElggCamp2009 Boston mentioned this straight off the bat !!!@ Kids very quickly figured out how to put that *fullstop to obfuscate the real bad language... ? ;-(

    So.. now... am trying to figure out how to extend that mickey mouse KWIC parser code to cater for smarty-arse mis-spelling thar leads to linguistic "phising" whioch bypasses standard traps.. Bit of a headache.. but I reckon I kin do the fixes.. eventually....

  • so you want a file with language "bad" with seperaters but like @Zak say why both defining the seperater.  I am happy to make the file but what is the diference beetween S_O_B and S O B comparing with SOB is this your need in the file?

  • how about allowing parental influence in a simular way to how reported content works, but in this case each additional badword can be automatically added to your database, eventually leading to the possibility to outsourcing the database for more $ because it becomes so comprehensive?

  • how about, "dude need ya to drop a dime at 7pm tonight, the rocks are low man make sure you get here"  "smokie will bring foil full for 65, see you then"  What will you filter here?

  • @Goofers
    You got right idea..
    Just try to beat my code...

     

    @Zaks

    re: "parental influence" ? takes us to a somewhat different arena.. If parents want their influence something like the "Inteceptor" ***algorithms -- they *will have have to talk privately with a web site's Admin(s).

    The major issue comes to a "confluence" because even Beysian algorithms based  ( whole word / preg match based ) techniques do not detect intentional and accidental mis-spellings ==>  THAT is the loop-hole that my "smarty-ass" little brother  Carlos ;-) and I are facing and trying to conquer..

     

  • @...smokie will bring foil full for 65, see you then"  What will you filter here?..

     

    Answer:

     

    I will filter the water in the bong!

     

    ROFLMAO!!!

  • ok I understand,  however I have a question.  what will your out put be and do you want character limits like say the wire.   a couple of dozen or long naratives like an msg?