I love the Grateful Dead. As a musician, I think they’re absolutely brilliant, but mostly I love rabbit holes. The conventional wisdom that “every show is different” misses the mark, I think—the fun of the Dead is seeing how songs evolve between shows, over years. How many times did they play this one two-chord sequence in Dark Star? (Many times.) What about when that riff shows up randomly in a different track the next show? You can kind of triangulate what’s going on in their minds, get into Jerry’s head a little bit. The online community of deadheads is also a blast. Here’s just the tip of the iceberg of how much work archivists put into that. I’ve read this post a dozen times, and a dozen more just like it. It’s a great thing to hyperfixate on, and a real goldmine when you find the one random recording from a random place with a minute-long gold nugget in the middle of an otherwise underwhelming 45 minute jam.
There’s also a pandemic happening. I’m working from home and usually have one show or another on in the background. I try and go on a nice walk every day and usualy have a show, or bits and pieces that I cycle through. I’m consuming more music now than I ever have.
The hallmark of every Grateful Dead lyrical composition is when they scream the title, full volume, slightly off-kilter during the chorus. The only song I knew of that didn’t do this, off the top of my head, was The Eleven (which feels like cheating—the song name is the key signature. Does that count?) This led me to my research question: In each Grateful Dead song, how many times do they say the song title?
Well, dear reader, let’s find out. I scraped all of the 300ish listed lyrics from Mark Leone’s lyrics archive and sanitized them.
Let’s scrape all the lyrics:
# Get the index page
page <- readLines('https://www.cs.cmu.edu/~mleone/dead-lyrics.html')
# Remove lines that aren't links to pages
page <- str_subset(page, pattern = "A HREF=\"gdead/dead-lyrics/")
# Convert to a two-column df, one column for song name and one for the link address
page <- tibble(
song_name = str_extract(page, regex('(?<=\\">)(.*)(?=</A>)')),
link = str_extract(page, regex('(?<=HREF=\")(.*)(?=\\">)'))
) %>%
mutate(link = paste0('https://www.cs.cmu.edu/~mleone/', link))
# Retrieve the lyrics from each page
for (i in 1:nrow(page)) {
message("Retriving lyrics for ", pull(page[i, "song_name"]), ", ", i , " of ", nrow(page))
page[i, "lyrics"] <- paste(readLines(pull(page[i, "link"])), collapse = " ")
}
# Basic processing: make everything lowercase, remove punctuation, remove 0-9, remove parentheticals
page <- page %>%
mutate(
song_name_clean = tolower(song_name),
song_name_clean = str_remove_all(song_name_clean, "(.*?)"), # remove parentheticals
song_name_clean = str_remove_all(song_name_clean, "[^\\w]"), # remove nonalphanumeric
) %>%
mutate(
lyrics_clean = tolower(lyrics),
lyrics_clean = str_remove_all(lyrics_clean, "(.*?)"), # remove parentheticals
lyrics_clean = str_remove_all(lyrics_clean, "[^\\w]") # remove nonalphanumeric
)
# Count the number of times the song title appears in the lyrics
page <- page %>%
rowwise() %>%
mutate(song_name_count = str_count(lyrics_clean, song_name_clean)- 1,
song_name_count_fuzzy = list(agrep(lyrics_clean, song_name_clean)))
dead_lyrics <- page
rm(page)
Here’s a histogram of the number of times they say the song name in each song.
dead_lyrics %>%
ggplot() +
geom_histogram(aes(song_name_count), binwidth = 1)
Is that a poisson distribution I see?
ll <- function(lambda) {-sum(dpois(dead_lyrics$song_name_count, lambda, log = TRUE))}
p <- optim(par = 3, f = ll, lower = 0)
ggplot() +
geom_histogram(data = dead_lyrics, mapping = aes(song_name_count), binwidth = 1) +
geom_line(aes(x = 0:30, y=nrow(dead_lyrics)*dpois(0:30, lambda = p$par)), col = "red")
Woah woah woah. I think this would be a great time to do a totally nuts regression. Like… one for that website on spurious correlations? (a negative binomial distribution actually fit better, but that’s not the matter now)
What if we look at the number of times each song was played? I love deadheads so much—it’s already been tabulated. I used the “fuzzyjoin” package and it worked super well:
play_freq <- read_csv("dead_play_freq.csv")
play_freq <- play_freq %>%
mutate(
song_name_clean = tolower(`SONG TITLE`),
song_name_clean = str_remove_all(song_name_clean, "(.*?)"), # remove parentheticals
song_name_clean = str_remove_all(song_name_clean, "[^\\w]"), # remove nonalphanumeric
) %>%
select(song_name_clean, times_played = `Times\nPlayed`)
agrepl(dead_lyrics$song_name_clean, play_freq$song_name_clean)
dead_lyrics <- stringdist_inner_join(dead_lyrics, play_freq, by = "song_name_clean", max_dist = 10, distance_col ="distance_col") %>%
group_by(song_name) %>%
filter(distance_col == min(distance_col)) %>%
filter(n() ==1)
Fit a hilariously bad poisson distribution:
ll <- function(lambda) {-sum(dpois(dead_lyrics$times_played, lambda, log = TRUE))}
p <- optim(par = 3, f = ll, lower = 0)
ggplot() +
geom_histogram(data = dead_lyrics, mapping = aes(times_played), binwidth = 10) +
geom_line(aes(x = 0:600, y=nrow(dead_lyrics)*dpois(0:600, lambda = p$par)), col = "red")
Take a look at the data:
ggplot(dead_lyrics) +
geom_point(aes(song_name_count, times_played))
And then do a poisson regression on the number of times each song was played and the number of times they say the song title in the lyrics:
reg <- glm(times_played ~ song_name_count, data = dead_lyrics, family = poisson(link = "log"))
summary(reg)
##
## Call:
## glm(formula = times_played ~ song_name_count, family = poisson(link = "log"),
## data = dead_lyrics)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -17.974 -13.832 -6.283 8.314 29.603
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.921637 0.006819 721.75 <2e-16 ***
## song_name_count 0.016880 0.001141 14.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 39957 on 242 degrees of freedom
## Residual deviance: 39755 on 241 degrees of freedom
## AIC: 41183
##
## Number of Fisher Scoring iterations: 5
Nuts! Holy signficance, batman! We can conclude that the number of times Jerry or Bobby says the song name in the song lyrics CAUSES it to be played more… right? And since we used a fancy poisson regression, it must be true ;)
I also ran a GAM and a random forest but they weren’t signficant, so we can discard them.
Note to future employers: this is all sarcastic. Please!
Back to the question at hand. What is the song that says the name the most number of times?
dead_lyrics %>%
dplyr::select(song_name, song_name_count, times_played) %>%
arrange(-song_name_count) %>%
filter(song_name_count > 10) %>%
knitr::kable()
song_name | song_name_count | times_played |
---|---|---|
Might As Well | 34 | 111 |
Good Lovin’ | 25 | 428 |
To Lay Me Down | 20 | 63 |
He’s Gone | 18 | 328 |
Money, Money | 15 | 3 |
Kansas City | 14 | 334 |
Lazy Lightnin’ | 14 | 111 |
Sugaree | 14 | 357 |
Wake Up Little Susie | 13 | 14 |
Deal | 12 | 423 |
Pretty Peggy O | 12 | 265 |
Ship of Fools | 12 | 225 |
Heaven Help The Fool | 11 | 7 |
Looks like it is Might as Well, which, yeah. But that is a late-discog add. Here’s a thought: the song title that has been said the most by the band, i.e, the number of times the name is said per song * the number of plays.
dead_lyrics <- dead_lyrics %>%
mutate(total_name_times = song_name_count * times_played)
dead_lyrics %>%
dplyr::select(song_name, song_name_count, times_played, total_name_times) %>%
arrange(-total_name_times) %>%
head(15) %>%
knitr::kable()
song_name | song_name_count | times_played | total_name_times |
---|---|---|---|
Good Lovin’ | 25 | 428 | 10700 |
He’s Gone | 18 | 328 | 5904 |
Deal | 12 | 423 | 5076 |
Sugaree | 14 | 357 | 4998 |
Kansas City | 14 | 334 | 4676 |
Truckin’ | 9 | 519 | 4671 |
Might As Well | 34 | 111 | 3774 |
The Harder They Come | 6 | 583 | 3498 |
Pretty Peggy O | 12 | 265 | 3180 |
Run for the Roses | 5 | 583 | 2915 |
Mama Tried | 9 | 302 | 2718 |
Ship of Fools | 12 | 225 | 2700 |
Not Fade Away | 5 | 531 | 2655 |
Goin’ Down The Road Feelin’ Bad | 9 | 293 | 2637 |
Tennessee Jed | 6 | 433 | 2598 |
Amazing. Over a 30-year touring history, the phrase “Good Lovin’” is said over 10,000 times. I’m so glad we know this now.
I started this because I was genuinely curious what the most-said Dead song title was, but this became a neat little exercise in model fitting. Just because something is poisson distributed doesn’t mean fitting a poisson linear regression is the right move, especially if you can’t identify a causal mechanism there. I fall into this trap all the time with spatial data and more ‘serious’ statistical things. But I had fun here. Thanks Amanda for helping me.