Word Filtering - The "Interceptor"

I've started looking into modifying my older KWIC keyword Extracter / Analyzer to be able to better trap spam, bad words, etc. Basically we ourselves have a need to for this for outr own elgg-based web sites. Most word filtering alogirthms, including those based on the time-honored Bayesian techniques would seem to fall off the cliff with some simple tricks that will bypass such filtering.
e.g.
Let us say we block for a bad word "badword"
users, even kids.. can figure out they will be blocked,
so.. they do not use "badword" anymore
they switch to variations
e.g.
"b.a.d.w.o.r.d"
OR
"b-a-d-w-o-r-d"
OR
"b.a.d.d.w.o.r.d.d"

A human reading this can still read the text and cognize, but our funky word filtering algorithms get a failing grade..!!! ;-)

I did those earlier KWIC routines to extract and classify Elgg's Google group posts so that I could create a nicely indexed archive which I could then search with ease. For that I had used a combination of StopWords + KeyWords -- so that it would ignore the Stopwords, while paying particular attention to the Keywords ( which had special meaning to me, even if these were on the Stopwords list )

So, anyway, the first problem to solve will be :=
How to match the dictionary "badword" with a misspelt "b.a.d.w.o.r.d" and similar...

The end result I hope to achieve is - word filtering for UserID, UserName, Messages content ( Carlos' "Messages Interceptor' ), any other *content posting where there might be a need for *filtering and *safety.

I'm not particularly looking for coding help here, however if you're reading this and have experience with such techniques, feel welcome to ppost and discuss.

 

 

  • @MJ

    I was writing this while you were writing that ;-)
    I'm copying your notes into here

    = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

    Imposters, Name filtering
    [Malaga Jack]

    There is a desperate need for some king of name filtering system within the elgg administration process and also the display name.

    I have just seen someone with the login name and display name on this community with the absurd name of ElggAdmin This person could do damage by throwing their wait around to unsuspecting newbies and telling them what they can and cannot do

    If I had a site like lets say goofbucket I would not want someone registering with the name of goofbucketAdmin or likewise with their display name.

     

  • GrabWords (strtolower($vP));
    . . .
    function GrabWords ($Sentence)
    {
    global   
            $StopWords,
            $KeyWords,
            $WordTable,
            $CountTable
            ;

    this is the function which currently "grabs" words - I need to modify / extend it to be able to grab essentially fudged, obfuscated and permuted versions of the real bad words...

    Oops, as I'm reading this code.. which I haven't looked at for some months, I realize I actually coded some smarty filtering of non-word characters out before trying to recognize the word.. might be less coding than I was thinking ;-)

     

  • Great minds think alike

    Or

    is it fools seldom differ

    I always get those 2 mixed up

  • I'm not sure about your hangups;l-)
    but I was once told by a 11 year old that
    I was
    " like a 5 year old kid...
    with a car and a credit card..." ;-)

  • INVITATION FOR COMMUNITY INPUT:

    Sometime soon, as I get that KWIC routine code patched up to do more intelligent "intecepting"
    I would like to have some **very **realistic test data with *real *bad language so that I can test my code against that.

    It would be nice to see some/ many from the elgg community step up and offer help.
    The actual test data :=

    You'll need to upload your test data to a URL, e.g. YourDomain.com/testdata.txt
    and make that directory public read

    Make sure you put a comment in the header
    that the file is a test file for security testing in case your ISP kills you
    private message the URL to me...

  • Will do

    lets run it on the other one I'll pm you it's getting close to 4am here.

  • INVITATION FOR COMMUNITY INPUT:
    Sometime soon, as I get that KWIC routine code patched up to do more intelligent "intecepting"
    I would like to have some **very **realistic test data with *real *bad language so that I can test my code against that.

    ****************************************************************************************
    It would be nice to see some/ many from the elgg community step up and offer help.
    The actual test data :=
    ****************************************************************************************

    • You'll need to upload your test data to a URL, e.g. YourDomain.com/testdata.txt
    • Make that directory public read
    • Make sure you put a comment in the header that the file is a test file for security testing in case your ISP kills you
    • Private message the URL to me...

     

     

  • Do you want a list of bad words in different forms uploaded?

  • @Dhrup and Woodward..

    Yeah.. I can contribute with bad words as well... I got many and I use'em all the time..LOL

  • Just the actual full text with supposedly bad conten p- e.g. a sample message with your choice of intelligent "F" words, etc.
    NB: Try to make your sample message as difficult and fubnky to break - so that maybe I can't break it.
    Now..
    [ PLZ do not be offended...]
    I will post a realistic example
    ------------------------------------------------------
    I have been re-thinking my keyword algorithms for use with M.I.
    e.g. a smart kid will type --
    f.u.c.k.y.o.u.
    i.w.i.l.l.s.e.l.l.y.o.u.g.r.a.s.s.f.o.r.5.0.b.u.c.k.s.
    ==> this **will bypass a straight forward key-word driven BigBrother type of blocker.
    The code I've got so far *is word driven, but I will look at extending it to filter out noise characters and then do most likely a preg_match to hunt for "bad" words...