Data Skeptic27 Jul 2018

Spam Filtering with Naive Bayes

Today's spam filters are advanced data driven tools. They rely on a variety of techniques to effectively and often seamlessly filter out junk email from good email.

Whitelists, blacklists, traffic analysis, network analysis, and a variety of other tools are probably employed by most major players in this area. Naturally content analysis can be an especially powerful tool for detecting spam.

Given the binary nature of the problem ( or ) its clear that this is a great problem to use machine learning to solve. In order to apply machine learning, you first need a labelled training set. Thankfully, many standard corpora of labelled spam data are readily available. Further, if you're working for a company with a spam filtering problem, often asking users to self-moderate or flag things as spam can be an effective way to generate a large amount of labels for "free".

With a labeled dataset in hand, a data scientist working on spam filtering must next do feature engineering. This should be done with consideration of the algorithm that will be used. The Naive Bayesian Classifer has been a popular choice for detecting spam because it tends to perform pretty well on high dimensional data, unlike a lot of other ML algorithms. It also is very efficient to compute, making it possible to train a per-user Classifier if one wished to. While we might do some basic NLP tricks, for the most part, we can turn each word in a document (or perhaps each bigram or n-gram in a document) into a feature.

The Naive part of the Naive Bayesian Classifier stems from the naive assumption that all features in one's analysis are considered to be independent. If and are known to be independent, then . In other words, you just multiply the probabilities together. Shh, don't tell anyone, but this assumption is actually wrong! Certainly, if a document contains the word algorithm, it's more likely to contain the word probability than some randomly selected document. Thus, , violating the assumption. Despite this "flaw", the Naive Bayesian Classifier works remarkably will on many problems. If one employs the common approach of converting a document into bigrams (pairs of words instead of single words), then you can capture a good deal of this correlation indirectly.

In the final leg of the discussion, we explore the question of whether or not a Naive Bayesian Classifier would be a good choice for detecting fake news.

Denne episoden er hentet fra en åpen RSS-feed og er ikke publisert av Podme. Den kan derfor inneholde annonser.

Episoder(601)

Economic Modeling and Prediction, Charitable Giving, and a Follow Up with Peter Backus

Economist Peter Backus joins me in this episode to discuss a few interesting topics. You may recall Linhda and I previously discussed his paper "The Girlfriend Equation" on a recent mini-episode. We s...

19 Des 201423min

[MINI] The Battle of the Sexes

Love and Data is the continued theme in this mini-episode as we discuss the game theory example of The Battle of the Sexes. In this textbook example, a couple must strategize about how to spend their ...

12 Des 201418min

The Science of Online Data at Plenty of Fish with Thomas Levi

Can algorithms help you find love? Many happy couples successfully brought together via online dating websites show us that data science can help you find love. I'm joined this week by Thomas Levi, Se...

5 Des 201458min

[MINI] The Girlfriend Equation

Economist Peter Backus put forward "The Girlfriend Equation" while working on his PhD - a probabilistic model attempting to estimate the likelihood of him finding a girlfriend. In this mini episode we...

28 Nov 201416min

The Secret and the Global Consciousness Project with Alex Boklin

I'm joined this week by Alex Boklin to explore the topic of magical thinking especially in the context of Rhonda Byrne's "The Secret", and the similarities it bears to The Global Consciousness Project...

21 Nov 201441min

[MINI] Monkeys on Typewriters

What is randomness? How can we determine if some results are randomly generated or not? Why are random numbers important to us in our everyday life? These topics and more are discussed in this mini-ep...

14 Nov 20143min

Mining the Social Web with Matthew Russell

This week's episode explores the possibilities of extracting novel insights from the many great social web APIs available. Matthew Russell's Mining the Social Web is a fantastic exploration of the too...

7 Nov 201450min

[MINI] Is the Internet Secure?

This episode explores the basis of why we can trust encryption. Suprisingly, a discussion of looking up a word in the dictionary (binary search) and efficiently going wine tasting (the travelling sal...

31 Okt 201426min