Data Skeptic27 Jul 2018

Spam Filtering with Naive Bayes

Today's spam filters are advanced data driven tools. They rely on a variety of techniques to effectively and often seamlessly filter out junk email from good email.

Whitelists, blacklists, traffic analysis, network analysis, and a variety of other tools are probably employed by most major players in this area. Naturally content analysis can be an especially powerful tool for detecting spam.

Given the binary nature of the problem ( or ) its clear that this is a great problem to use machine learning to solve. In order to apply machine learning, you first need a labelled training set. Thankfully, many standard corpora of labelled spam data are readily available. Further, if you're working for a company with a spam filtering problem, often asking users to self-moderate or flag things as spam can be an effective way to generate a large amount of labels for "free".

With a labeled dataset in hand, a data scientist working on spam filtering must next do feature engineering. This should be done with consideration of the algorithm that will be used. The Naive Bayesian Classifer has been a popular choice for detecting spam because it tends to perform pretty well on high dimensional data, unlike a lot of other ML algorithms. It also is very efficient to compute, making it possible to train a per-user Classifier if one wished to. While we might do some basic NLP tricks, for the most part, we can turn each word in a document (or perhaps each bigram or n-gram in a document) into a feature.

The Naive part of the Naive Bayesian Classifier stems from the naive assumption that all features in one's analysis are considered to be independent. If and are known to be independent, then . In other words, you just multiply the probabilities together. Shh, don't tell anyone, but this assumption is actually wrong! Certainly, if a document contains the word algorithm, it's more likely to contain the word probability than some randomly selected document. Thus, , violating the assumption. Despite this "flaw", the Naive Bayesian Classifier works remarkably will on many problems. If one employs the common approach of converting a document into bigrams (pairs of words instead of single words), then you can capture a good deal of this correlation indirectly.

In the final leg of the discussion, we explore the question of whether or not a Naive Bayesian Classifier would be a good choice for detecting fake news.

Denne episoden er hentet fra en åpen RSS-feed og er ikke publisert av Podme. Den kan derfor inneholde annonser.

Episoder(601)

[MINI] Natural Language Processing

This episode overviews some of the fundamental concepts of natural language processing including stemming, n-grams, part of speech tagging, and th bag of words approach.

17 Apr 201513min

Computer-based Personality Judgments

Guest Youyou Wu discuses the work she and her collaborators did to measure the accuracy of computer based personality judgments. Using Facebook "like" data, they found that machine learning approaches...

10 Apr 201531min

[MINI] Markov Chain Monte Carlo

This episode explores how going wine testing could teach us about using markov chain monte carlo (mcmc).

3 Apr 201515min

[MINI] Markov Chains

This episode introduces the idea of a Markov Chain. A Markov Chain has a set of states describing a particular system, and a probability of moving from one state to another along every valid connected...

20 Mar 201511min

Oceanography and Data Science

Nicole Goebel joins us this week to share her experiences in oceanography studying phytoplankton and other aspects of the ocean and how data plays a role in that science. We also discuss Thinkful ...

13 Mar 201533min

[MINI] Ordinary Least Squares Regression

This episode explores Ordinary Least Squares or OLS - a method for finding a good fit which describes a given dataset.

6 Mar 201518min

NYC Speed Camera Analysis with Tim Schmeier

New York State approved the use of automated speed cameras within a specific range of schools. Tim Schmeier did an analysis of publically available data related to these cameras as part of a project a...

27 Feb 201516min

[MINI] k-means clustering

The k-means clustering algorithm is an algorithm that computes a deterministic label for a given "k" number of clusters from an n-dimensional datset. This mini-episode explores how Yoshi, our lilac c...

20 Feb 201514min