Tuesday 1 April 2008

Patterns appear in our Bayesian Analysis of Spam

I have over the past three years being tailoring our in-house spam filter that sits on an SMTP gateway in-front of our domino servers. The filter uses Bayesian analysis of the message contents to to calculate a probability of any given message being Spam.

As I am sure you are all well aware the way this is done using PR(Spam|words) = Pr(words|spam)*Pr(spam)/Pr(words).

Particular words have particular probabilities of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word "cock" in spam email, but will seldom see it in other email as our business is not chicken related. When we started the filter didn't know these probabilities so we got our users to place any junk mail into their JUNK folders. Once an hour a scheduled job awoke and reviewed the contents of each mail file's JUNK folder and their INBOXs and thus built up a data base of words along with the probability of any word being in a spam message. For all words in each training email both good and bad, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. Over time our filter has learnt that for instance "volutptous" is very unlikely to be used in a mail concerning the electrical characteristics of a thermistor where as "capacitance" has an equally low probability of being a spam email.

Any email's spam probability is computed over ALL words in the email, and if the total exceeds a certain threshold (in our case 94.674%), the filter will mark the email as a spam. Email marked as spam is then quarantined for 2 weeks before being scrapped.

Once a month i have to check the database for any anomalies where a word I would consider to be OK picks up a skewed probability. For example "teenslut" should have a high probability and "dialectric" a low one. I have a table of know "good" words and their scores over time that i use to scan the main database and it flags up words with a value that I should be concerned about.

Recently I have noticed a pattern emerge...as some of the words change. This pattern appears to have been relatively constant for the past 12 weeks other than fact they exist I have no idea why they are there or why just these words and why just these numbers?


WordMonTueWedThuFriSatSun
Carbon842
443231615
Magnet15
42
8443
2316
Medical16
4215844323
Nuclear234216158843
Capacitance434223161544
Domino442
43
2316158


No comments:

Disqus for Domi-No-Yes-Maybe