Ethics, stories, and curiosity
Content for Thursday, April 20, 2023
Readings
This looks like a lot, but most of these are quite short!
Causal inference and data science
- Miguel A. Hernán, “The C-Word: Scientific Euphemisms Do Not Improve Causal Inference From Observational Data” (Hernán 2018)
- Hannah Fresques and Meg Marco, “‘Your Default Position Should Be Skepticism’ and Other Advice for Data Journalists From Hadley Wickham,” ProPublica, June 10, 2019
Storytelling
- Chapter 14 in Impact Evaluation in Practice (Gertler et al. 2016)
- Martin Krzywinski and Alberto Cairo, “Storytelling”
- Ben Wellington, “Making data mean more through storytelling”
- Will Schoder, “Every Story is the Same”
Ethics
Keep in mind throughout all these readings that an “algorithm” in these contexts is typically some fancy type of regression model where the outcome variable is something binary like “Safe babysitter/unsafe babysitter,” “Gave up seat in past/didn’t give up seat in past”, or “Violated probation in past/didn’t violate probation in past”, and the explanatory variables are hundreds of pieces of data that might predict those outcomes (social media history, flight history, race, etc.).
Data scientists build a (sometimes proprietary and complex) model based on existing data, plug in values for any given new person, multiply that person’s values by the coefficients in the model, and get a final score in the end for how likely someone is to be a safe babysitter or how likely someone is to return to jail.
- DJ Patil, “A Code of Ethics for Data Science” (if your’re interested in this, also check out Mike Loukides, Hilary Mason, and DJ Patil, Ethics and Data Science
- “AI in 2018: A Year in Review”
- “How Big Data Is ‘Automating Inequality’”
- “In ‘Algorithms of Oppression,’ Safiya Noble finds old stereotypes persist in new media”
- 99% Invisible, “The Age of the Algorithm”: Note that this is a podcast, or a 20ish minute audio story. Listen to this.
- On the Media, “Biased Algorithms, Biased World”
- “Wanted: The ‘perfect babysitter.’ Must pass AI scan for respect and attitude.”
- “Companies are on the hook if their hiring algorithms are biased”
- “Courts use algorithms to help determine sentencing, but random people get the same results”
- David Heinemeier Hansson’s rant on the Apple Card
Slides
The slides for today’s lesson are available online as an HTML file. Use the buttons below to open the slides either as an interactive website or as a static PDF (for printing or storing for later). You can also click in the slides below and navigate through them with your left and right arrow keys.
Fun fact: If you type ? (or shift + /) while going through the slides, you can see a list of special slide-specific commands.
Videos
Videos for each section of the lecture are available at this YouTube playlist.
- Introduction
- What did we just learn?
- Ethics of data analyitcs (a)
- Ethics of data analytics (b)
- Ethics of data analytics (c)
- Ethics of storytelling (a)
- Ethics of storytelling (b)
- Ethics of storytelling (c)
- Ethics of storytelling (d)
- Curiosity
You can also watch the playlist (and skip around to different sections) here:
Synthetic data code from class
library(tidyverse)
library(broom)
<- 5003
n_people
<- tibble(id = 1:n_people) %>%
fake_data # Make exogenous stuff
mutate(time_of_day = sample(c("Morning", "Evening"), n_people, replace = TRUE)) %>%
# Make endogenous things
mutate(prob_atlanta = 50 + ifelse(time_of_day == "Morning", 13, 0),
prob_charlotte = 15 + ifelse(time_of_day == "Morning", 36, 0),
prob_philly = 20 + ifelse(time_of_day == "Evening", 60, 10),
prob_la = 15 + ifelse(time_of_day != "Morning", -5, 20)) %>%
rowwise() %>%
mutate(city = sample(c("Atlanta", "Charlotte", "Philadelphia", "Los Angeles"),
1, replace = TRUE,
prob = c(prob_atlanta, prob_charlotte, prob_philly, prob_la))) %>%
ungroup() %>%
mutate(liking_blue_baseline = rnorm(n_people, mean = 0.5, sd = 0.05),
liking_blue_morning_boost = ifelse(time_of_day == "Morning", 0.2, 0),
prob_liking_blue = liking_blue_baseline + liking_blue_morning_boost) %>%
rowwise() %>%
mutate(like_blue = sample(c(TRUE, FALSE), 1, replace = TRUE,
prob = c(prob_liking_blue, 1 - prob_liking_blue))) %>%
ungroup() %>%
mutate(cookies_baseline = rnorm(n_people, 3, 1),
cookie_time_effect = ifelse(time_of_day == "Morning", 2, 0),
cookie_blue_effect = ifelse(like_blue == TRUE, rnorm(n_people, 3, 0.5), 0),
cookie_city_effect = case_when(
== "Atlanta" ~ 5,
city == "Philadelphia" ~ 1,
city == "Los Angeles" ~ -3,
city == "Charlotte" ~ 0
city
),cookies = cookies_baseline + cookie_time_effect +
+ cookie_city_effect) %>%
cookie_blue_effect mutate(happiness_baseline = rnorm(n_people, mean = 47, sd = 9),
happiness_time_effect = case_when(
== "Morning" ~ rnorm(n_people, 0, 1),
time_of_day == "Evening" ~ rnorm(n_people, 8, 1)
time_of_day
),happiness_blue_effect = ifelse(like_blue == TRUE, 4, 0),
happiness_cookie_effect = cookies * 1.3,
happiness_city_effect = case_when(
== "Atlanta" ~ 10,
city == "Philadelphia" ~ 30,
city == "Los Angeles" ~ -10,
city == "Charlotte" ~ 5
city
),happiness = happiness_baseline + happiness_time_effect + happiness_blue_effect +
+ happiness_city_effect)
happiness_cookie_effect
<- fake_data %>%
real_data select(id, city, time_of_day, like_blue, cookies, happiness) %>%
mutate(cookies_rounded = round(cookies, 0),
happiness = round(happiness, 0))
ggplot(fake_data, aes(x = like_blue)) +
geom_bar()
%>%
real_data group_by(city) %>%
summarize(avg_cookies = mean(cookies),
avg_happiness = mean(happiness))
ggplot(fake_data, aes(x = happiness_baseline)) +
geom_density()
ggplot(real_data, aes(x = cookies, y = happiness)) +
geom_point() +
geom_smooth(method = "lm")
<- lm(happiness ~ cookies + time_of_day + like_blue, data = real_data)
model_thing tidy(model_thing)