Deconstructing Humor using Text Mining on /r/Jokes

I recently turned in a project for an elective class I’m taking on humor where I analyzed a bunch of jokes posted to the /r/Jokes subreddit. I thought that some of the results were applicable to this blog, so I’ll summarize the interesting ones here:

1) Donald Trump jokes are funnier than average

Jokes that included the word “Trump” received an average of 141 more net upvotes than jokes that didn’t involve the name (mean score 258 vs. 117, p=0.0016). Moreover, jokes that used both the word “Trump” and the word “orange” in the same post scored on average 837 more net upvotes than jokes with just “Trump,” but this result wasn’t statistically significant because only 57 jokes used both words (mean score 1074 vs. 237, p=0.27).

For comparison, Obama jokes are not significantly more upvoted than average (mean score 149 vs. 118, p=0.45, n=588).

2) Chickens, deer, turkeys, cows, and elephants are the funniest animals

These are chosen by comparing their frequency in jokes against their frequencies in everyday English. Frogs, monkeys, and ducks tend to appear in jokes pretty often as well.

3) “He” and “man” are substantially more common than “she” and “woman,” but “wife” and “girlfriend” appear more than “husband” and “boyfriend

So men are the subject of more jokes, unless the joke happens to be about a man-woman relationship?

Also, jokes using female-gendered words tend to repeat them more times than jokes using male-gendered words.

The data for this analysis was from Taivo Pungas’ public Github repository at My code is available here. You can download the report as a PDF too, if you’re interested in other findings or methodological details.

Corruption of the Youth! Profanity in Music, 1958–2019

There are many neat studies on the internet of lyrical content in popular songs. People have claimed that music is getting more repetitive and sexual, among many other things. Here’s a cool analysis of word sophistication, n-grams, and lyric sentiments.

Wanting to do something similar, I scraped the weekly Billboard 100 list of songs since its 1958 conception, and then—through many hours, request throttling configurations, and changed IP addresses—scraped the lyrics to most of these songs from three different popular lyrics websites.

There’s a lot more to mine from the data I collected, but what would be an interesting and quick first analysis? What sorts of words or lyrical trends would show the most dramatic changes between the early rock-and-roll of the 60’s and today?

Hint: to keep this site kid-friendly, I included only the first letter of those six key words in the visualizations below.

By comparing the two graphs, we can also see which words tend to be repeated multiple times within songs that use them!

Contact me if you’re interested in using the data or code from this analysis! I’m not posting them publicly for lyric copyright reasons.

A Timeline Visualization Tool from the Future

Quick follow-up to my last post: my father’s been developing a professional tool called TimeStory, built for visualizing timelines easily and aesthetically. It’s perfect for the Billboard analysis I was doing:

Timelines represent the span between artists’ first and last weeks on the Billboard 100, up to last month. Artists are ranked by their total presence on the Billboard, which makes Drake’s reign at the top even more impressive considering how short he’s been around.

If you have TimeStory, you can download the file from my Google Drive and scroll through the list of all top 200 artists to compare. If you don’t, you can still see or download the raw data as a spreadsheet.

Lil Nas X vs. The Beatles? The Billboard 100’s Most Successful Artists

There are many ways to judge an artist, and the best ways are probably entirely nonstatistical. But add up all these millions of ways that different people rate music, and you get something pretty close to metrics like total album sales, YouTube streams, and other quantitative popularity measures.

In this post, and hopefully in a few follow-up posts where I’ll dive deeper into this wealth of data, I’m looking at the Billboard Top 100 as the “most popular” songs each week. The Billboard Top 100 ranks songs based on sales, radio time, and streaming volume. I downloaded the Top 100 for every week between the weeks of August 4, 1958 (presumably when the Billboard started?) and September 28, 2019 (a week ago, at the time of writing).

The question of how to rank artists based on this data remains thorny: do we count the number of Number 1 hits they produced? The number of weeks any of their songs were at Number 1? The total number of Top-100 songs? The total number of song-weeks in the Top 100? I started with just these four, to keep it simple. (You could devise many other systems, probably more valuable ones, e.g. by weighting higher-placed hits more than lower ones, etc.)

The following graphs rank the top 30 most successful Billboard artists, by time spent on top of the Billboard at #1 (left) and by total time anywhere on the Billboard (right).

And, if you’re wondering what songs launched some of these artists to the top, here are the top 30 songs, both by time at #1 and time anywhere on the Billboard 100:

If you ask me, this suggests that Drake is the greatest artist in history, and clearly he’s not done yet. I have to say, I’m disappointed in the legends of old whom I kept telling people would be far more popular than today’s celebrities, although artists like Aretha and the Supremes do pretty well for themselves. But I wonder how Billboard-dominant Elvis would be with today’s internet/streaming/viralness?

Some questions: Is there a difference between fanbases of various genres? How powerful is “meme” value? How much of this is driven by clubs, movies, etc.? Factors that inform the trajectory of songs and artists are super interesting and unclear.

Complete tables of ranked artists and songs, beyond the top 30, are available here in Google Sheets, along with the full raw data I used as a CSV download. Contact me if you’re interested in the code I used for data collection or visualization; it’s not published here for lyrics copyright reasons.

Keep an eye out for future posts on this topic—I’ve got a few ideas bouncing around my head about studying what characteristics might make a song or artist popular!

Gender Diversity Among Professional Musicians

As a younger student I spent two years in a group of dancing violinists called Allegro!!!. Each year the directors tried to recruit exactly 8 boys and 8 girls to the ensemble, but somehow, it always seemed a challenge to find enough willing and able boys. In my second year with the group, those numbers were relaxed to become 4 boys (including me) and 12 girls. I suppose I’ve always tacitly assumed that this imbalance was due to the dancing aspect—but what if it was the violins that really did the discriminating?

Fast-forward to my fourth year in my university’s symphony orchestra, and I’ve noticed that, of the 8 or so randomly assigned stand partners I’ve had, only one has identified as male. At a glance, the violin sections seem majority women, and this makes me wonder: Are parents more likely to sign their daughters up for violin class than their sons? Are women simply more likely to stick with it? In a world where women are underrepresented in many artistic and other disciplines, is the reverse true among professional orchestra players? And is this a phenomenon of violinists in particular, or do other instruments follow similar patterns?

To tackle this question, I looked up the rosters of 8 of the top professional symphony orchestras in the US. I recorded the names of all 797 permanent members of these orchestras, identifying their gender from the pronouns in their bios. (The data I collected is here, if you’re interested.) Counts are presented below:

As it turns out, top violinists are more likely to be women, although this appears to be pretty unique among instruments (I left out the ones with fewer than 8 total musicians, such as keyboard or piccolo). Men tend to dominate most sections, particularly among the double bass, clarinet, brass, and percussion.

For the more statistically-minded, here’s another way of visualizing the proportion of each instrument section made up by women:

In the above plot, error bars signify 95% confidence intervals, calculable via your favorite statistical software or online calculator (see below for note on multiple testing adjustment). Loosely speaking, under the reasonable assumption that my collected data is demographically representative of all top orchestral musicians in the US, the red bars indicate that we can be at least 95% confident that these instruments are mostly played by men in top professional orchestras. The violin is the sole instrument with a significant women majority; the black bars represent cases where the data isn’t conclusive. The observed percentages of women in our sample of 8 orchestras are shown by magenta dots. In our sample, women outnumber men only in the violins, flutes, and harps.

Finally, I did a quick plot of the 8 orchestras I used, to see the overall representation of women in each:

The New York Phil seems to be doing the best at equal gender representation, although it doesn’t take much to out-represent the others. In total, of the 797 musicians in these orchestras, 502 are men and 295 are women.

What do these figures mean? Gender discrimination is unfortunately nothing new when it comes to hiring, but I would probably put the imbalances here down to earlier stages in life simply based on my own anecdotal experiences of youth/school orchestras and their similar ratios. I know some exceptional women double bass players, but I’m willing to bet that in society men get encouraged to pick up giant upright basses or heavy, lung-intensive trombones far more than women do. On the flip side, people might perceive violins or flutes to be lighter and prettier, finesse instruments rather than power ones, and subconsciously direct young women more often toward those tracks. But there are myriad possible explanations.

To me, this analysis begs several questions worth further study. Some can be answered with data: Are principal-stand musicians more likely to be men than the rest of their sections? (Some suggest so.) Do men get paid more than women even in women-heavy violin sections? Do professional orchestras have similar gender distributions to music conservatories, or even youth orchestras? If young students start out equally represented, when in the talent pipeline does the balance shift? Other questions maybe can’t be answered with data: To what extent do societal perceptions cement these patterns vs. the other way around? Why exactly are certain instruments more preponderantly men than others? How does this relate to representations of gender in other fields?

My data and code to produce the plots are available here.

Bonus Note on Multiple Hypotheses, for Those Interested

There is good reason to be skeptical any time multiple confidence intervals are all presented together and a few are singled out as “significant” while others are left as “insignificant.” While the confidence interval is a powerful and valid tool for any particular hypothesis, selecting the significant intervals from a list of them is a statistical fallacy for two main reasons:

  1. In theory, confidence intervals are usually built so they have equal probabilities of under- and overestimation of the true value. However, selected (i.e. “significant”) intervals are more likely to be overestimates rather than underestimates, simply because underestimates are less likely to be statistically significant. Therefore the selected intervals will be biased upward (away from the true value).
  2. If the true effect size is small—e.g. if in reality, 52% of violinists are women vs. the 50% we might expect as a null baseline—then “correct” confidence intervals are also likely to contain the null value. Thus, by selecting the significant intervals (the ones that don’t contain the null value), we’re ensuring that these small effect sizes tend to be less covered by the selected intervals.

All this results in the following phenomenon: Say we choose the standard significance level of α=0.05, i.e. we expect 95% of our confidence intervals to include the true value (this is the meaning of “95% confident”). If we then select the intervals that come out significant (i.e. those that do not contain the null value of 50%), less than 95% of those intervals will contain the true value.

There are several ways to correct for this, but here I used the simplest way: just make the intervals wider. This means it’s harder to be significant, so it seems like we’re losing statistical power, but ultimately we can be assured that (over many trials) our 95% confidence intervals are truly 95%. I performed the analog to a Bonferroni correction for multiple hypothesis testing, where instead of each interval being constructed at the 95% level, I used (1-α/n)-level confidence intervals. This ensures that the “false coverage proportion” equals 0 with probability 1-α (i.e. 95%), regardless of how the confidence intervals are selected from among each other. Thus, the expected “false coverage proportion” is bounded above by α (i.e. 5%), so the procedure is valid.

This method is perhaps over-conservative, but with n=16 instrument sections it doesn’t make a big difference here.

Does Data Outperform Intuition in Fantasy Football?

This is one I might regret, come May and the end of the season.

First of all, this post is about soccer football, not football football (which I don’t know the first thing about). My friends and I play the popular online Fantasy Premier League game for English soccer teams. I’m a pretty avid fan of the league, at least as Americans go—but every year, my fantasy team ends up doing much worse than my passion and intellect deserve. This year, I’m putting hard data to the test in the age-old question: Is modern data analysis really better than old-fashioned experience and human intuition?

Here’s the scoop. The Fantasy Premier League (FPL) website gives you £100.0 at the start of the season to build a roster of 15 players, 11 starters and 4 subs. Better players cost more; worse players are cheaper. Points are assigned to each player after every matchday for things like scoring goals, getting assists, keeping clean sheets, or being man of the match. The key to this game—and the “human intuition” argument—is to identify the “hidden gems,” the undervalued players ready to take the league by a storm and start banging in goals when no one expected them to.

I created two teams this year, my intuition team and my data-optimized team. Due to laziness and procrastination, both were created late: my intuition team after the first matchday and my data team after the second matchday (so the data team has some catching-up to do, already). I downloaded all of last year’s players and their final point tallies and prices, and let my optimization algorithm (see below) do the work. Then I went to China for two weeks with my family. When I got back, it was matchday 5 and I reprogrammed the algorithm to use point values amassed during the current season instead. Each week this season, I plan to run the optimization and update my data team according to its dictates; I will also maintain my intuition team as independently as I can, as a best-attempt at a control.

Interesting Results

The most interesting results on this one might have to wait until the end of the season, when I will surely post an update here. However, even my initial optimized team was strongly defender-focused, a very surprising result for me. Traditional wisdom generally favors as attacking a team as possible, with formations like 3-4-3 or 3-5-2, to maximize the team’s lucrative goals count. Yet my initial optimizations returned teams in 5-4-1 or 4-5-1 formations, highly defensive organizations. It appears that FPL assigns higher prices per point to attackers, and the real play might be to go for solid, dependable defenders. Or maybe it was just a fluke? My sample size is, after all, 1—for the time being.

My code and data can be found here. Feel free to use it.

For the Interested: How the Optimization Works

Okay, I don’t believe that this operation actually requires such a heavy-duty optimization algorithm, or maybe even any sort of optimization algorithm: you could probably achieve the same result by just calculating a points per dollar ratio for each player and choosing the top 11 that satisfy the relatively simple requirements. But where’s the fun in that? Here’s what I implemented instead.

Any optimization algorithm needs some value to optimize: to compare better vs. worse. In this case, it’s pretty simple: total points of the starting 11 players. The reason this could be a nontrivial optimization problem is because of the restrictions placed upon rosters: (1) Total cost must not exceed £100.0; (2) Roster must be comprised of 2 goalkeepers, 5 defenders, 5 midfielders, and 3 attackers; (3) No more than 3 players from the same real-life team; and (4) The starting 11 (a.k.a. the point-scoring subset of the team) must be one of 7 valid formations (e.g. you can’t play 2 defenders, 5 mids, and 3 forwards).

Checking all possible rosters for the absolute best one would require 8.4*10^26 iterations. Even at a million iterations a second, that would take over 26 trillion years to complete. This is where optimization comes in: instead of checking each possibility, start with a random roster, then swap one or more random players in the team for random new players. If the new team is better than the old one, keep the swap; otherwise, revert back to the old team. Do this a few million times, and you should get pretty close to the best possible team.

This is the basic idea behind the simplest “hill-climbing” optimization. For this project I implemented something closer to a Metropolis-coupled Markov chain Monte Carlo method (better known as MCMCMC, or MC3). There is a large and highly unnavigable body of literature on these types of optimization methods, and I am severely unqualified to represent it; however, by my understanding, MCMCMC performs this simple optimization procedure multiple times in a “chain” of parallel rosters, with varying “temperatures” or degrees of acceptance of worse outcomes. So the first roster in the chain would be the “cold” one and only ever accept swaps that yield a better team. The subsequent rosters in the chain would be progressively hotter, meaning they have higher probabilities of accepting a swap that yields a worse team. The last and hottest roster, for instance, might just be swapping completely randomly, regardless of whether each swap yields a better or worse team. Then, if ever a hotter roster has a better team than a colder one, the chain is reorganized “bucket-brigade” style and the better roster gets passed down the temperature scale.

Why do this? Having a bunch of hotter rosters to back up the cold one enables it to get out of local maxima, or positions where no single swap yields a better team (so the optimization is “stuck”), but it’s still not the overall best team it could be. By enabling varying temperatures, some rosters are allowed to “regress” a little bit from the local maximum and “restart” from a different, worse point, but which may yield better results in the end.

Why Data?

Welcome to my new blog!

Beginning today I will be posting short, highly approachable stories about how I use large datasets in different ways to approach my everyday activities. Starting out, my hopes for this blog are threefold: (1) to share creative ways to look at publicly available data, (2) to improve my own understanding of techniques and of the world, and (3) to make data manipulation and visualization more accessible to everyone.

I work in a public health sciences lab, and spend most of my time there looking at large datasets. But why bring that home? In short, because “big data” isn’t just available in pre-canned fashion to scientists and is capable of revealing interesting things to without needing PhD-trained researchers and rigorous statistics. While I don’t condone bad statistics, my mission is to show that interesting data lies all around us—and you can learn a lot simply by asking engaging questions, and maybe a few handy skills at the keyboard.

Finally, please do contact me at any time with questions, insights, suggestions, errors with my work, etc. My email is, and be sure to check out the rest of my website as well! I look forward to hearing from you!