Annotator Bias
Data Skeptic23 Marras 2019

Annotator Bias

The modern deep learning approaches to natural language processing are voracious in their demands for large corpora to train on. Folk wisdom estimates used to be around 100k documents were required for effective training. The availability of broadly trained, general-purpose models like BERT has made it possible to do transfer learning to achieve novel results on much smaller corpora.

Thanks to these advancements, an NLP researcher might get value out of fewer examples since they can use the transfer learning to get a head start and focus on learning the nuances of the language specifically relevant to the task at hand. Thus, small specialized corpora are both useful and practical to create.

In this episode, Kyle speaks with Mor Geva, lead author on the recent paper Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets, which explores some unintended consequences of the typical procedure followed for generating corpora.

Source code for the paper available here: https://github.com/mega002/annotator_bias

Jaksot(590)

[MINI] Gini Coefficients

[MINI] Gini Coefficients

The Gini Coefficient (as it relates to decision trees) is one approach to determining the optimal decision to introduce which splits your dataset as part of a decision tree. To pick the right feature to split on, it considers the frequency of the values of that feature and how well the values correlate with specific outcomes that you are trying to predict.

18 Marras 201615min

Unstructured Data for Finance

Unstructured Data for Finance

Financial analysis techniques for studying numeric, well structured data are very mature. While using unstructured data in finance is not necessarily a new idea, the area is still very greenfield. On this episode,Delia Rusu shares her thoughts on the potential of unstructured data and discusses her work analyzing Wikipedia to help inform financial decisions. Delia's talk at PyData Berlin can be watched on Youtube (Estimating stock price correlations using Wikipedia). The slides can be found here and all related code is available on github.

11 Marras 201633min

[MINI] AdaBoost

[MINI] AdaBoost

AdaBoost is a canonical example of the class of AnyBoost algorithms that create ensembles of weak learners. We discuss how a complex problem like predicting restaurant failure (which is surely caused by different problems in different situations) might benefit from this technique.

4 Marras 201610min

Stealing Models from the Cloud

Stealing Models from the Cloud

Platform as a service is a growing trend in data science where services like fraud analysis and face detection can be provided via APIs. Such services turn the actual model into a black box to the consumer. But can the model be reverse engineered? Florian Tramèr shares his work in this episode showing that it can. The paper Stealing Machine Learning Models via Prediction APIs is definitely worth your time to read if you enjoy this episode. Related source code can be found in https://github.com/ftramer/Steal-ML.

28 Loka 201637min

[MINI] Calculating Feature Importance

[MINI] Calculating Feature Importance

For machine learning models created with the random forest algorithm, there is no obvious diagnostic to inform you which features are more important in the output of the model. Some straightforward but useful techniques exist revolving around removing a feature and measuring the decrease in accuracy or Gini values in the leaves. We broadly discuss these techniques in this episode.

21 Loka 201613min

NYC Bike Share Rebalancing

NYC Bike Share Rebalancing

As cities provide bike sharing services, they must also plan for how to redistribute bicycles as they inevitably build up at more popular destination stations. In this episode, Hui Xiong talks about the solution he and his colleagues developed to rebalance bike sharing systems.

14 Loka 201629min

[MINI] Random Forest

[MINI] Random Forest

Random forest is a popular ensemble learning algorithm which leverages bagging both for sampling and feature selection. In this episode we make an analogy to the process of running a bookstore.

7 Loka 201612min

Election Predictions

Election Predictions

Jo Hardin joins us this week to discuss the ASA's Election Prediction Contest. This is a competition aimed at forecasting the results of the upcoming US presidential election competition. More details are available in Jo's blog post found here. You can find some useful R code for getting started automatically gathering data from 538 via Jo's github and official contest details are available here. During the interview we also mention Daily Kos and 538.

30 Syys 201621min

Suosittua kategoriassa Tiede

rss-mita-tulisi-tietaa
utelias-mieli
tiedekulma-podcast
rss-lihavuudesta-podcast
hippokrateen-vastaanotolla
sotataidon-ytimessa
rss-radplus
filocast-filosofian-perusteet
menologeja-tutkimusmatka-vaihdevuosiin
rss-poliisin-mieli
docemilia
rss-bios-podcast
rss-totta-vai-tuubaa
rss-duodecim-lehti
rss-ammamafia
rss-astetta-parempi-elama-podcast
rss-ilmasto-kriisissa
rss-tervetta-skeptisyytta
rss-metsanomistaja-podcast