
ContentMine
ContentMine is a project which provides the tools and workflow to convert scientific literature into machine readable and machine interpretable data in order to facilitate better and more effective access to the accumulated knowledge of human kind. The program's founder Peter Murray-Rust joins us this week to discuss ContentMine. Our discussion covers the project, the scientific publication process, copywrite, and several other interesting topics.
28 Aug 201553min
![[MINI] Structured and Unstructured Data](https://cdn.podme.com/podcast-images/9D1D3140321C522F3AACFE7871FE0308_small.jpg)
[MINI] Structured and Unstructured Data
Today's mini-episode explains the distinction between structured and unstructured data, and debates which of these categories best describe recipes.
21 Aug 201513min

Measuring the Influence of Fashion Designers
Yusan Lin shares her research on using data science to explore the fashion industry in this episode. She has applied techniques from data mining, natural language processing, and social network analysis to explore who are the innovators in the fashion world and how their influence effects other designers. If you found this episode interesting and would like to read more, Yusan's papers Text-Generated Fashion Influence Model: An Empirical Study on Style.com and The Hidden Influence Network in the Fashion Industry are worth reading.
14 Aug 201524min
![[MINI] PageRank](https://cdn.podme.com/podcast-images/060A068348B45EA62D0B42E4A8C350E1_small.jpg)
[MINI] PageRank
PageRank is the algorithm most famous for being one of the original innovations that made Google stand out as a search engine. It was defined in the classic paper The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Larry Page. While this algorithm clearly impacted web searching, it has also been useful in a variety of other applications. This episode presents a high level description of this algorithm and how it might apply when trying to establish who writes the most influencial academic papers.
7 Aug 20158min

Data Science at Work in LA County
In this episode, Benjamin Uminsky enlightens us about some of the ways the Los Angeles County Registrar-Recorder/County Clerk leverages data science and analysis to help be more effective and efficient with the services and expectations they provide citizens. Our topics range from forecasting to predicting the likelihood that people will volunteer to be poll workers. Benjamin recently spoke at Big Data Day LA. Videos have not yet been posted, but you can see the slides from his talk Data Mining Forecasting and BI at the RRCC if this episode has left you hungry to learn more. During the show, Benjamin encouraged any Los Angeles residents who have some time to serve their community consider becoming a pollworker.
29 Juli 201541min
![[MINI] k-Nearest Neighbors](https://cdn.podme.com/podcast-images/F335620D9E26AF3276FA79BAFF590EEA_small.jpg)
[MINI] k-Nearest Neighbors
This episode explores the k-nearest neighbors algorithm which is an unsupervised, non-parametric method that can be used for both classification and regression. The basica concept is that it leverages some distance function on your dataset to find the $k$ closests other observations of the dataset and averaging them to impute an unknown value or unlabelled datapoint.
24 Juli 20158min

Crypto
How do people think rationally about small probability events? What is the optimal statistical process by which one can update their beliefs in light of new evidence? This episode of Data Skeptic explores questions like this as Kyle consults a cast of previous guests and experts to try and answer the question "What is the probability, however small, that Bigfoot is real?"
17 Juli 20151h 24min
![[MINI] MapReduce](https://cdn.podme.com/podcast-images/99E5B4C49CC9487AB4880B5C8DF050F0_small.jpg)
[MINI] MapReduce
This mini-episode is a high level explanation of the basic idea behind MapReduce, which is a fundamental concept in big data. The origin of the idea comes from a Google paper titled MapReduce: Simplified Data Processing on Large Clusters. This episode makes an analogy to tabulating paper voting ballets as a means of helping to explain how and why MapReduce is an important concept.
10 Juli 201512min




















