DataBytes Podcast

Episode 49: Extreme Classification: Going at MACH Speed (Part 1)

Fri, 17 Jan 2020 00:00:00 +0000

In this episode, Dr. Derek Feng drops by to chat about a recent paper on a divide-and-conquer approach (Merged-Averaged Classifiers via Hashing) to massive classification problems. In part 1 (of 2 episodes), we describe the general problem solved by and strategy taken by MACH, wherein the original large classification problem is broken down into smaller-sized classification problems. Next week in the second episode, we talk about more technical details of how the division of labor works, and why it works.

Sources

NeurIPS 2019: MACH paper

Episode 48: Where Moneyball Meets Footy

Fri, 13 Dec 2019 00:00:00 +0000

We’ve long heard about the waves that statistics has made in baseball. But what about soccer? In this episode, we summarize a few applications of statistics in European football (or American soccer).

Sources

Episode 47: Domoic Acid Testing -- A Crabshoot?

Sat, 30 Nov 2019 00:00:00 +0000

Domoic acid has plagued shellfish and other wildlife along the Pacific coastline in recent years. Testing for domoic acid concentration in crabs on a regular basis has become important for determining when crabs and their viscera can be safely consumed. Unlike many other common hypothesis tests, the setup used for domoic acid testing is based on the sample maximum rather than the sample mean. In this episode, we critique the testing methodology.

Sources

Episode 46: Finding Your (Niche) Board Games

Fri, 08 Nov 2019 00:00:00 +0000

In this episode, we discuss how two statisticians used data from BoardGameGeek.com to put together their own board game recommendation engine, specifically designed to stay away from mainstream recommendations.

Sources

Episode 45: Learning Publicly, with Private Data

Fri, 01 Nov 2019 00:00:00 +0000

In this episode, Dr. Derek Feng discusses the general issue of data privacy in the age of big data, including topics of differential privacy and federated learning.

Sources

Episode 44: A Conversation with Jon Krohn

Fri, 25 Oct 2019 00:00:00 +0000

We sit down with Dr. Jon Krohn to chat about his work as a Chief Data Scientist at untapt, his newly published bestseller “Deep Learning Illustrated”, and his teaching/research.

Link

Deep Learning Illustrated website

Episode 43: To Google and Back

Fri, 04 Oct 2019 00:00:00 +0000

In this episode, Professor Albert Y. Kim of Smith College describes his post-PhD journey, which included a stint at Google Adwords before academic posts at Reed College, Middlebury College, Amherst College, and Smith College.

Episode 42: Black in the Box

Fri, 27 Sep 2019 00:00:00 +0000

Dr. Derek Feng joins us again to discuss the two metrics by which we align all statistical/machine learning methods – interpretability versus predictive ability. In a world where black box methods reign supreme, what does learning mean?

Sources

Episode 41: What to do with Outliers

Fri, 20 Sep 2019 00:00:00 +0000

Guest Dylan O’Connell joins us today to talk about a recent surprising, but legitimate Democratic primary poll result done by Monmouth University. We discuss different perspectives on how to approach a data point that doesn’t fit in with the others.

Sources

Episode 40: Making a DIY ML-Controlled Cat Door

Fri, 13 Sep 2019 00:00:00 +0000

Outdoor-cat owners know all too well the unpleasantries of dealing with what the cat dragged in. A self-proclaimed machine learning novice proves that you don’t need to be a pro to set up a smart cat door that prevents the cat from bringing prey into your home.

Sources

Episode 39: Rolling in the Deep Patient

Fri, 06 Sep 2019 00:00:00 +0000

We take a deep dive into the poster child for black-box machine learning methods, namely Deep Patient: an unsupervised learning method that uses denoising auto-encoders as the means for extracting salient features in electronic health records, which in turn can then be used to predict health outcomes. We do our best to explain what on earth the previous sentence meant.

Sources

Deep Patient article

Episode 38: The Misuse of Statistics in Court

Fri, 30 Aug 2019 00:00:00 +0000

In this episode, we talk about how a statistical concept that you would learn about in an introductory course was misused in court. The error led to dire consequences in the case of Sally Clark who was charged in the deaths of two of her children.

Sources

Episode 37: Susan Starts a New Job

Fri, 23 Aug 2019 00:00:00 +0000

In this episode, we talk about Susan’s new job as a Data Scientist! She recently transitioned from academia to industry and we discuss her experience with searching for positions, interviewing, and her first few weeks in her new role.

Episode 36: What's New in Machine Learning Startups

Fri, 16 Aug 2019 00:00:00 +0000

In this episode, we talk about some machine learning startups to pay attention to this year.

Sources

Episode 35: You Look How You Sound

Fri, 09 Aug 2019 00:00:00 +0000

Deep learning has been useful for lots of applications when it comes to prediction. Yet another is the use of a short sound clip of speech to predict the face of the speaker.

Sources

Episode 34: Protecting Kids' Digital Privacy

Fri, 02 Aug 2019 00:00:00 +0000

In this episode, we talk about protecting kids’ digital privacy.

Sources

Episode 33: Statisticians Hate Post-Hoc Power

Fri, 26 Jul 2019 00:00:00 +0000

Statistics is key to demonstrating the effectiveness of new advancements in science and medicine, but when statistical significance is not achieved, is post-hoc power a valid justification?

Sources

Episode 32: Amazon's 3D Body Scan Study

Fri, 19 Jul 2019 00:00:00 +0000

In this episode, we talk about Amazon’s 3D body scan study.

Sources

Episode 31: What Data Visualizations Do You Care About? It's Personal

Fri, 12 Jul 2019 00:00:00 +0000

In this episode, we talk about how data are personal for those in a rural Pennsylvania community.

Sources

Episode 30: Some Like It Hot -- What Gender Reveals About Our Temperature Preferences

Fri, 05 Jul 2019 00:00:00 +0000

Word on the street is that women prefer warmer temperatures than men do. Researchers designed an experiment to investigate whether this is actually true, specifically, considering how men and women perform on various cognitive tasks under different temperature scenarios. In this episode, we dissect the study so you can judge whether you believe the results.

Sources

Plos One article: Battle for the thermostat: Gender and the effect of temperature on cognitive performance

Episode 29: Jeopardy! Meets Statistics

Fri, 28 Jun 2019 00:00:00 +0000

Jeopardy! is a weeknightly televised trivia game show. In recent months, one player, James Holzhauer has taken the Jeopardy! fandom by storm with his unusual style of play and his long run of big wins. In this episode, we discuss how statistics can help explain his betting tactics, and we discuss how some other Jeopardy! players have used statistics to help up their game.

Sources

Episode 28: Facial Recognition Technology Update and Rating Trustworthiness of AI-Generated Airbnb Profiles

Fri, 21 Jun 2019 00:00:00 +0000

In this episode, we discuss a number of miscellaneous news updates regarding facial recognition technology (concerning San Francisco, Amazon, and pandas!). And then, we talk about how much we trust AI-generated profiles for Airbnb.

Sources

Episode 27: Does Uber/Lyft Help Or Hurt Traffic Congestion and Machine Learning Interpretability

Fri, 14 Jun 2019 00:00:00 +0000

In this episode, we look at a study about whether ride-sharing services contribute to increased or decreased traffic congestion in San Francisco. We then discuss some strategies to build interpretable machine learning models.

Sources

Episode 26: Household Electronics That See and Google's Reservation AI

Fri, 07 Jun 2019 00:00:00 +0000

In this episode, we talk about a new innovation that enables household electronics to see what’s around them. We then discuss Google Duplex, an AI designed to happily make reservations and appointments for you.

Sources

Episode 25: DataFest 2019 and Measuring Migrations from Hurricane Maria

Fri, 31 May 2019 00:00:00 +0000

Susan recently served as a judge at a local DataFest competition (a weekend-long data competition for undergraduates). She shares her experiences and recommendations for future contestants. We then discuss how Facebook data might be helpful for counting the number of people how migrated from Puerto Rico to the mainland U.S. as a result of Hurricane Maria.

Sources

Episode 24: Predictive Power of Early Polling and Did a TV Show Result in Higher Teenage Suicides?

Fri, 24 May 2019 00:00:00 +0000

In this episode, we discuss FiveThirtyEight.com’s analysis of primary election polling over the past 40 years. In particular, we consider whether early polling is helpful for predicting election outcomes. And then, we talk about a study that potentially blames Netflix for a surge in teenage suicides in 2017.

Sources

Episode 23: Offline Song Identification and Perceptions about AI

Fri, 17 May 2019 00:00:00 +0000

In this episode, we discuss how Google’s Now Playing feature can identify songs that are playing around you, using embeddings. We then talk about a study that reports on America’s perceptions about artificial intelligence – who can we trust to develop AI responsibly?

Sources

Episode 22: Betting on the Game of Thrones and the Misfortune of Lefthandedness

Fri, 10 May 2019 00:00:00 +0000

In this episode, we discuss how bookmakers price/take bets on outcomes in the Game of Thrones. We then discuss a study that claimed that lefthanded people have shorter life expectancies than righthanded people. Spoiler alert: lefthanders have nothing to worry about!

Sources

Episode 21: Pitch Call Accuracy and Predicting the Outcome of the Champions League

Fri, 03 May 2019 00:00:00 +0000

Buckle up for a sports-filled episode! We discuss a study that analyzes the accuracy of umpire calls about strikes vs. balls and take a deep dive into FiveThirtyEight.com’s statistical methods for predicting the winner of the Champions League.

Sources

Episode 20: Thinking Like Computers and Text Mining the Mueller Report

Fri, 26 Apr 2019 00:00:00 +0000

In this episode, we discuss a study that recruits human researchers to try to predict how computers classify images. We then highlight a number of examples of natural language processing techniques applied to the Mueller Report.

Sources

Episode 19: Seeing with AI and Detecting Exoplanets

Fri, 19 Apr 2019 00:00:00 +0000

In this episode, we discuss Microsoft’s handy phone application for scanning and reporting on our surroundings, as a way of helping vision impaired individuals better interact with the world around them. We then talk about how AI can be useful in detecting exoplanets (or extrasolar planets).

Sources

Episode 18: Statistical Anxiety and the Fight Against Statistical Significance

Fri, 12 Apr 2019 00:00:00 +0000

We discuss a survey designed to analyze the extent and root cause of statistical anxiety in the classroom, discussing the methods/limitations of the study. We then talk about yet another crusade against hypothesis testing, this time around the concept of “statistical significance”.

Sources

Episode 17: How Theranos Sinned Statistically

Fri, 05 Apr 2019 00:00:00 +0000

In this episode, Susan Wang is joined by guest Natalie Doss to consider the statistical sins committed by Theranos, the former blood testing unicorn. From arbitrary data manipulation to inappropriate data aggregation, we discuss what they did and why these practices were particularly bad. Then, we weigh in on how Theranos could have done worse, making it harder for the public to find out about their faulty tests.

Sources

Bad Blood

Episode 16: Machine-Generated Faces/Text, and Relating Health Outcomes to Skin Tone

Fri, 15 Mar 2019 00:00:00 +0000

We discuss NVIDIA’s AI-generated faces that look incredibly authentic, and relatedly, OpenAI’s text generator that is so capable that it has to be kept under wraps. We then assess the study design of a recent research article that considered how health outcomes vary amongst African Americans of different skin tones.

Sources

Episode 15: Deep Learning to Fold Proteins and Automated Journalism

Fri, 08 Mar 2019 00:00:00 +0000

We discuss opportunities for machines and humans in the prediction of protein structures, a necessary task in new drug discovery. Google’s DeepMind has taken the prize in the recent iteration of CASP, a protein folding prediction challenge. We also discuss how AI has begun to revolutionize journalism.

Sources

Episode 14: A Personality Test that Makes Sense and What Does Spotify Know?

Fri, 01 Mar 2019 00:00:00 +0000

FiveThirtyEight.com has provided a free, online personality test that might make more sense than your typical online clickbaity quiz. We talk about why it calls itself the only personality test that isn’t junk science. We then discuss the results of a recent study on Spotify data. Does it know too much about us (and you)? We’ll let you know.

Sources

Episode 13: IBM's Debate Machine and Adopting a 'Data Culture' in Companies

Fri, 22 Feb 2019 00:00:00 +0000

On February 11, IBM showcased its Project Debater in a face-off against debate champion Harish Natarajan. We talk about how this machine vs. human competition went. Then, we discuss a Harvard Business Review article citing a survey that discovered companies are not becoming data-oriented quickly enough.

Sources

Episode 12: Super Bowl Stats, Confidence Intervals, and Data Sources

Fri, 15 Feb 2019 00:00:00 +0000

Three topics are featured in this episode: first, statistics about Super Bowl LIII, including what was in the bowls as the game happened; second, a fun activity for teaching confidence intervals; finally, we present some online sources for data.

Sources

Questions for Confidence Interval Activity

The questions below were asked in the podcast. The answers are provided in line. Answers that are not immediately obvious through a Google search are linked to their sources.

What’s the average distance from the Earth to Mars in kilometers?” 225 mil km
What’s the height of Denali (formerly Mt. McKinley) in feet? 20,310ft to 20,320 ft
What’s the minimum number of moves required to solve any Rubik’s cube? 20.
What year was the first toothpaste tube invented? 1873
How many men signed the Declaration of Independence? 56
How many milligrams of caffeine on average are in a shot of Starbucks espresso? 89 mg
What percentage of American adults is estimated to own a smartphone , as of 2018? 81%
What is the greatest amount of snow to fall in a single US location over 24 hour period, in inches? 75.8”
In 2017, how much beef did Americans consume per person on average (in lbs)? 56.9 lbs
As of Feb 1, 2019, how many bitcoins are there in circulation? 17,516,000

Episode 11: How Machines Might be Biased and the Job Market for Data Scientists

Fri, 08 Feb 2019 00:00:00 +0000

AI and ML algorithms are growing popular – but they can actually perpetuate cognitive biases in our daily lives. We discuss the state of the problem and possible solutions. We also present a favorable job outlook for aspiring (or continuing!) data scientists.

Sources

Episode 10: AI in Medicine and Racial Bias in College Admissions

Fri, 01 Feb 2019 00:00:00 +0000

Artificial intelligence is starting to make waves in medicine; we look at how technology might potentially change how medical testing works. We also bring in some statistical reasoning in the debate of whether or not there is racial discrimination in Harvard’s college admissions process.

Sources

Episode 9: Lessons Learned from Making a Fitbit Data Visualization Shiny App

Fri, 25 Jan 2019 00:00:00 +0000

Dynamic data visualization widgets can be pretty cool, but it takes more than just statistical chops to build an online visualization app that supports input data from users. In this episode, we describe the journey that Susan took to build a visualization app for Fitbit data. The app can be found at http://fitbitvizwiz.ddns.net.

Sources

Fitbit Viz Wiz Shiny App

Episode 8: The French Revolution and the Challenge of Reproducibility

Fri, 18 Jan 2019 00:00:00 +0000

What can machine learning tell us about the French Revolution? This episode describes a brief history lesson of the digital humanities. Then, why do we constantly hear about the word “reproducibility” in the context of scientific research? We’ll explore what this means and why it seems to keep happening.

Sources

Episode 7: The Virtual Maestro and the Most Influential Movie

Fri, 11 Jan 2019 00:00:00 +0000

Have you ever wanted to try your hand at conducting an orchestra? Now you can, with Google’s Semi-Conductor online app. We’ll talk about how this browser-based app analyzes your body positions in real time to translate your actions into Mozart music. We also talk about a way to use network analysis to determine the most influential movie ever made to date. Be sure to tune in to find out which movie takes the prize.

Sources

Episode 6: Probability Games and Amazon's Own Self-Driving Car

Fri, 04 Jan 2019 00:00:00 +0000

What are the odds that a toss of a 10-sided die, followed by a toss of a 20-sided die, and then a toss of a 30-sided die land in increasing order? If you know the answer within a few seconds, you might have an edge in Borel, a game that is all about probability. We’ll also talk about DeepRacer, Amazon’s soon-to-come programmable self-driving car.

Sources

Episode 5: The Do's and Don'ts of Data Visualization

Fri, 28 Dec 2018 00:00:00 +0000

Data visualization is an integral pre-cursor to data analysis, providing a way to visually inspect the data for surprising trends and uncover potential errors in variable coding. In this episode, we cover some guiding principles of data visualization. A brief summary (including promised links to examples) is included below.

A summary of do's and don'ts:

Do: Label everything, concisely. (This includes specifying the units!)
Do: Make sure the plot suits the kind of variable(s). Univariate plots –- histograms or barplots. Bivariate plots -– boxplots, scatterplots, stacked barplots, etc.
Do: Report/show plots that tell you something interesting -- and then say why it is interesting in words. Maybe it uncovers an unusual feature of the data, or outliers, or it suggests some kind of underlying associations that warrant further investigation.
Do: Carefully choose how many bins to use in your histograms. Or use density plots. (Example of how bin width choice matters)
Don't: Use too many colors (see bad pie chart below).
Don't: Use smoothing procedures with more flexibility than you can accommodate with the number of observations that you have.

Example mentioned in the podcast.

To the right, we show another example taken from a Fitbit app screenshot (from one of our co-hosts). Note how despite an absence of data points between 2016 and 2018 (shown in the white line) does not prevent the app from interpolating with some parabolic spline in the region (blue curve). The blue curve fabricates an upward trend from 2016 to 2017 and then a downward trend from 2017 to 2018.
Do: Jitter scatterplots to minimize overplotting. Or use translucency features of your plotting tool. (Fixing overplotting in Python | Fixing overplotting in R)
Do: Sort your categorical variables (barplot categories based on heights or boxplot categories based on medians). (Example)
Don't: Use pie charts. More on that below!

The Issue with Pie Charts

Source: WTF Visualizations.

This image just about encapsulates everything that is wrong with pie charts.

There are too many categories, making it difficult to discern which category is associated with which slice.
The angling of the pie makes it hard to visually process the relative sizes of each slice.
The pie chart occupies a lot of space for the information it is trying to convey.

You can find multiple posts online that expand on why pie charts are often bad ways of visualizing data -- yes, there are longer rants on pie charts than the one we've given, such as the one here. We particularly like an example of why even simple pie charts (with few categories displayed) can fail:

The example provided suggests that A, B, and C are pie charts that represent polling results for a 5-candidate race in a local election, done at 3 different time points. Can you tell who's in the lead at each time point?

The bar charts here represent the same data. Not only can you tell who's in the lead at any of the 3 time points, but you can also easily tell the trajectory of how each candidate fared as time progressed.

This leads us to our favorite pie chart:

Source: imgur.

The Naughty List

Now we present a list of some other truly horrific charts (that are not pie charts).

Source: The 9 Worst Data Visualizations Ever Created

Pie charts aren't the only culprits. Here, the y-axis scale is deceiving.
Source: Junk Charts

This plot, cut out of the Wall Street Journal, requires prolonged staring to realize how the bar lengths might possibly reflect any of the proportions provided. There's conditioning involved, in which case, a mosaicplot will probably do the data better justice, as suggested at the source website.
Source: Bad Infographics: 11 Mistakes You Never Want to Make

Too many colors can make it challenging to differentiate various categories.

The Nice List

Finally, this discussion wouldn’t be complete with some stellar examples of thoughtful graphics.

Source: Reddit post

This plot showcases how American pastimes differ by household income, using the American Time Use Survey dataset.
Source: Fivethirtyeight.com

Ah, a plot of residuals (points per shot against expectations) to showcase the superiority of basketball players. Minor quibble: Not too sure why Curry, Thompson, Barnes, and Green have the same colors whereas other highlighted players are in different colors.
Source: Gapminder
Indeed, this example, from the late great Hans Rosling's organization, Gapminder, is an example of where arguably a lot of information is fit into a small space. However, this is one of those plots that is actually meant to encourage zooming in and taking a good look (the source link above will lead you to the full pdf image). Even in this shrunken view, the color-coding by region helps provide a big-picture look at the relationship between health and wealth of nations around the world. The sizes of the circles additionally showcase the population size of each nation.

Looking for more?

There’s a whole subreddit (/r/dataisbeautiful) full of them. We use these sometimes as inspiration – but of course, because Reddit is a free-for-all platform, not every post will be golden.

Episode 4: Meet the Co-hosts (Part 2)

Fri, 21 Dec 2018 00:00:00 +0000

This week, we learn about Jessi Cisewski-Kehe’s background to find out how she went from a Math major to an actuarial analyst, then to grad school in statistics, followed by a three-year visiting assistant professor position at Carnegie Mellon where she got into Astrostatistics, and finally to her current position as an assistant professor at Yale.

Episode 3: Meet the Co-hosts (Part 1)

Wed, 12 Dec 2018 00:00:00 +0000

This week, we learn about Susan Wang’s background to find out how she went from an Applied Math major to actuarial consulting, then to a weather derivatives start-up firm, then to grad school in statistics, finally landing at Yale as a lecturer.

Episode 2: Biometric Technology at Airports, Google Smart Replies, Bestselling Books

Tue, 04 Dec 2018 00:00:00 +0000

In this episode, we discuss biometric technology used at airports, Google Smart Replies (and letting AI compose our emails/texts for us), and an analysis of New York Times Bestsellers list data.

Sources

Episode 1: Thanksgiving, College Football, International Prize in Statistics

Thu, 29 Nov 2018 00:00:00 +0000

The first episode of the DataBytes Podcast where we discuss popular topics related to data, statistics, data science, machine learning, artificial intelligence. In this episode, we discuss Thanksgiving food, the College Football Playoff selection, and the winner of the International Prize in Statistics.

Sources

Cool project 1

Wed, 01 Jan 2014 00:00:00 +0000

Cool project 1

Cool project 2

Thu, 01 May 2014 00:00:00 +0000

Cool project 2

Cool project 3

Sun, 01 Jun 2014 00:00:00 +0000

Cool project 3

Cool project 4

Sat, 01 Oct 2016 00:00:00 +0000

Cool project 4