Data Skeptic27 Heinä 2018

Spam Filtering with Naive Bayes

Today's spam filters are advanced data driven tools. They rely on a variety of techniques to effectively and often seamlessly filter out junk email from good email.

Whitelists, blacklists, traffic analysis, network analysis, and a variety of other tools are probably employed by most major players in this area. Naturally content analysis can be an especially powerful tool for detecting spam.

Given the binary nature of the problem ( or ) its clear that this is a great problem to use machine learning to solve. In order to apply machine learning, you first need a labelled training set. Thankfully, many standard corpora of labelled spam data are readily available. Further, if you're working for a company with a spam filtering problem, often asking users to self-moderate or flag things as spam can be an effective way to generate a large amount of labels for "free".

With a labeled dataset in hand, a data scientist working on spam filtering must next do feature engineering. This should be done with consideration of the algorithm that will be used. The Naive Bayesian Classifer has been a popular choice for detecting spam because it tends to perform pretty well on high dimensional data, unlike a lot of other ML algorithms. It also is very efficient to compute, making it possible to train a per-user Classifier if one wished to. While we might do some basic NLP tricks, for the most part, we can turn each word in a document (or perhaps each bigram or n-gram in a document) into a feature.

The Naive part of the Naive Bayesian Classifier stems from the naive assumption that all features in one's analysis are considered to be independent. If and are known to be independent, then . In other words, you just multiply the probabilities together. Shh, don't tell anyone, but this assumption is actually wrong! Certainly, if a document contains the word algorithm, it's more likely to contain the word probability than some randomly selected document. Thus, , violating the assumption. Despite this "flaw", the Naive Bayesian Classifier works remarkably will on many problems. If one employs the common approach of converting a document into bigrams (pairs of words instead of single words), then you can capture a good deal of this correlation indirectly.

In the final leg of the discussion, we explore the question of whether or not a Naive Bayesian Classifier would be a good choice for detecting fake news.

Tämä jakso on lisätty Podme-palveluun avoimen RSS-syötteen kautta eikä se ole Podmen omaa tuotantoa. Siksi jakso saattaa sisältää mainontaa.

Jaksot(601)

Reducing the Impact of Ship Noise on Marine Mammals

Human shipping operations have increased significantly in the past few decades. While that means international trade and cheap goods for humans, it also means the ocean has experienced an increase in...

1 Heinä 202436min

Analysis of Unstructured Data

Robbie Moon from the Georgia Tech Scheller College of Business joins us to discuss the analysis of unstructured data and the application of NLP methodologies towards financial data.

28 Kesä 202427min

iNaturalist

Have you ever participated in citizen science? Do you want to? One of the most popular platforms for crowdsourcing biodiversity data is iNaturalist. In addition to being a great science tool, the iNat...

24 Kesä 202437min

Learn to Code

Do you code or are you interested in learning to code? Join us today and hear from three individuals that are at very different stages of their coding journeys. Becky Hansis-O'Neill (also our co-host ...

18 Kesä 202449min

Animal Computer Interaction

You've heard of Human Computer Interaction (HCI), now get ready for Animal Computer Interaction (ACI). Ilyena has made a career developing computer interfaces for non-human animals. She has worked wit...

10 Kesä 202442min

Ape Gestures

Cat observes great apes in the wild and in the lab to crack the code of their gestural communication. We discussed the challenges and benefits of studying apes in the wild vs in the lab. Cat also shar...

3 Kesä 202449min

Evaluating AI Abilities

In this episode, Kozzy discusses his endeavors to compare the cognitive abilities of humans, animals, and AI programs. Specifically, we discussed object permanence, the ability to understand an object...

27 Touko 202449min

HMMs for Behavior

Théo Michelot has made a career out of tackling tough ecological questions using time-series data. How do scientists turn a series of GPS location observations over time into useful behavioral data? G...

20 Touko 202445min