Data Skeptic27 Jul 2018

Spam Filtering with Naive Bayes

Today's spam filters are advanced data driven tools. They rely on a variety of techniques to effectively and often seamlessly filter out junk email from good email.

Whitelists, blacklists, traffic analysis, network analysis, and a variety of other tools are probably employed by most major players in this area. Naturally content analysis can be an especially powerful tool for detecting spam.

Given the binary nature of the problem ( or ) its clear that this is a great problem to use machine learning to solve. In order to apply machine learning, you first need a labelled training set. Thankfully, many standard corpora of labelled spam data are readily available. Further, if you're working for a company with a spam filtering problem, often asking users to self-moderate or flag things as spam can be an effective way to generate a large amount of labels for "free".

With a labeled dataset in hand, a data scientist working on spam filtering must next do feature engineering. This should be done with consideration of the algorithm that will be used. The Naive Bayesian Classifer has been a popular choice for detecting spam because it tends to perform pretty well on high dimensional data, unlike a lot of other ML algorithms. It also is very efficient to compute, making it possible to train a per-user Classifier if one wished to. While we might do some basic NLP tricks, for the most part, we can turn each word in a document (or perhaps each bigram or n-gram in a document) into a feature.

The Naive part of the Naive Bayesian Classifier stems from the naive assumption that all features in one's analysis are considered to be independent. If and are known to be independent, then . In other words, you just multiply the probabilities together. Shh, don't tell anyone, but this assumption is actually wrong! Certainly, if a document contains the word algorithm, it's more likely to contain the word probability than some randomly selected document. Thus, , violating the assumption. Despite this "flaw", the Naive Bayesian Classifier works remarkably will on many problems. If one employs the common approach of converting a document into bigrams (pairs of words instead of single words), then you can capture a good deal of this correlation indirectly.

In the final leg of the discussion, we explore the question of whether or not a Naive Bayesian Classifier would be a good choice for detecting fake news.

Denne episoden er hentet fra en åpen RSS-feed og er ikke publisert av Podme. Den kan derfor inneholde annonser.

Episoder(601)

Auditing LLMs and Twitter

Our guests, Erwan Le Merrer and Gilles Tredan, are long-time collaborators in graph theory and distributed systems. They share their expertise on applying graph-based approaches to understanding both ...

29 Jan 202540min

Fraud Detection with Graphs

In this episode, Šimon Mandlík, a PhD candidate at the Czech Technical University will talk with us about leveraging machine learning and graph-based techniques for cybersecurity applications. We'll l...

22 Jan 202537min

Optimizing Supply Chains with GNN

Thibaut Vidal, a professor at Polytechnique Montreal, specializes in leveraging advanced algorithms and machine learning to optimize supply chain operations. In this episode, listeners will learn how ...

15 Jan 202538min

The Mystery Behind Large Graphs

Our guest in this episode is David Tench, a Grace Hopper postdoctoral fellow at Lawrence Berkeley National Labs, who specializes in scalable graph algorithms and compression techniques to tackle massi...

10 Jan 202547min

Customizing a Graph Solution

In this episode, Dave Bechberger, principal Graph Architect at AWS and author of "Graph Databases in Action", brings deep insights into the field of graph databases and their applications. Together w...

16 Des 202438min

Graph Transformations

In this episode, Adam Machowczyk, a PhD student at the University of Leicester, specializes in graph rewriting and its intersection with machine learning, particularly Graph Neural Networks. Adam expl...

9 Des 202432min

Networks for AB Testing

In this episode, the data scientist Wentao Su shares his experience in AB testing on social media platforms like LinkedIn and TikTok. We talk about how network science can enhance AB testing by accoun...

25 Nov 202436min

Lessons from eGamer Networks

Alex Bisberg, a PhD candidate at the University of Southern California, specializes in network science and game analytics, with a focus on understanding social and competitive success in multiplayer o...

18 Nov 202437min