Quick thought: Letter frequencies in crossword puzzles

Just a quick follow-up on my last post about Crossword puzzles and our new website, CrossWorthy.net. (New puzzles there every Sunday! Check it out if you haven’t had a chance.)

Our crossword-filling algorithm works in such a way that it maximizes the remaining possibilities left by each word, based on the other letters in that word. For example, the three-letter “APE” would be favored over “AXE”, given the choice, because P is a more versatile letter than X. Thus, common letters would be favored, while uncommon letters—and therefore, the words that contain them, like JACK or OOZE—would be left out.

My thoughts are: (1) does this really happen in my algorithm? and (2) do professional crosswords (e.g. NY Times crosswords) suffer from this bias too? Here’s a bar chart:

The red is a corpus of the 20,000 most-used English words, a “reference point” as if all those words were used uniformly in crossword generation. The green is from 34,505 letters from complete, nonsense-free crosswords from our CrossWorthy algorithm. The blue is from 848 NY Times crosswords (183k letters) since 2018.

Apparently, the CrossWorthy algorithm heavily over-favors S and A. (It would be interesting to come up with a “versatility” metric for letters, with a different implication than “commonness”. A and S must be quite versatile. Compare A to I, for instance, who is just as common in the 20k word corpus. My gut says this has to do with the index(es) in the word where the letter can occur: both A and I are both common as second or third letters, but A is an easier first letter than I. Even the blue NY Times bars seem affected by this letter-versatility thing, at first glance, though not as badly.) As expected, our algorithm also under-utilizes rare letters: Z, Q, J, X, V, W, K, F, Y. The interesting part is that the NY Times crosswords don’t seem to underuse rare letters compared to the corpus… I guess the pros just come up with more interesting crossword words!

Crossworthy Combinations

Wow, it’s been busy lately! May and I are in the middle of moving across the country to Berkeley, California, where I’ll start a new job in an entirely new industry. We’re also finalizing May’s immigration paperwork, while I’m trying to get a last paper out the door to close out my public health research. Amid everything, May and I have decided to become our own cruciverbalists of sorts, launching a new website – crossworthy.net – where anyone can play our original crossword puzzles!

We’ve been interested in crosswords for a while, now, and May even produced a new mini (5×5) crossword for every day in the month of May. I became invested in the project when I realized how fascinatingly difficult it was to create crossword boards. Minis were hard enough, but “Midis” (7×7) and full-size boards (15×15) became nigh impossible.

Take this average, empty crossword grid, for instance:

There are a couple 13-letter words to fill and a couple 12-letter words to fill, so this definitely isn’t a trivial board. And from a large corpus of words I got from several sources (dictionaries, phrases, celebrity names, etc.), there are 1223 three-letter words, 2043 four-letter words, 2734 five-letter words, and so on… meaning there are about 1.7 x 10100 possible ways to arrange the horizontal words only, or more than the number of atoms in the universe (around 1080, apparently).

Given that only a relative handful of these would also give sensical vertical words too, the chances of filling a proper crossword board seem pretty slim. It makes sense to start by inserting words in the hardest spots – otherwise, by the time we get to them, we may be completely out of luck. Most professionals will fill in the longest words first, then build around them. But, as it happens, there are fewer 3-letter words than there are for these long words, so the first (and “hardest”, in a sense) word I’d choose to fill is just:

short for “cascading style sheets,” a ubiquitous web design language.

Why did I choose CSS as opposed to HAM, PJS, or another of the 1,223 three-letter words available? As it turns out, using CSS means that the three vertical, intersecting columns can still be filled by 823 different words in the database, more than it’d be if any of those other 1,222 entries were picked.

Now, the word-slot with fewest remaining possibilities is 5-Down, or “C _ _ _”: only 147 words match this pattern. So the process repeats, and we write in:

because this affords the 3 empty, horizontal “cross”-words 509 total possibilities, more than any other option.

Continuing in this vein, we can automate this algorithm pretty easily – with a few nuances, like giving more weight to longer words, thematic words, etc. Unfortunately, even with some extra optimizations after finishing, the whole process only works about 2 percent of the time. The other 98% get stuck with a few word-slots that just can’t be filled (see the histogram below). The whole process, coded up in Python with some additional optimizations tagged onto the end, takes about 10 seconds per attempted board, working out for an average of one good board every 8 minutes of running time or so.

Then, of course, comes the clue-writing, and then, of course, the playing!

Aside from our website crossworthy.net, we’re also @CrossWorthy on Twitter, and you can also sign up to get an email for every new crossword (every Sunday).

We would love to hear feedback on any of the boards or clues we generate, suggestions, and anything else! We’ve got a special email, crossworthypuzzles@gmail.com, where you can send any thoughts, or, as always, you can get in touch any other method.

Slow and steady… gets published eventually?

Lately, I feel like I’ve been talking about fracking nonstop. I’ve been researching fracking as a public health scientist since January 2018, but in this transitional time for me—graduation looming, the job search ever continuing—it seems that the last two and a half years of fracking research are all flashing by one last (?) time. In job interviews, I frequently find myself pontificating about fracking, sometimes formalized in slide decks. Then there’s my undergraduate honors thesis, which I just defended this morning in front of a faculty committee, where I describe my newest evidence that fracking activity may increase hospitalization rates in local communities.

So there’s a good chance I was speaking about fracking during an interview yesterday when an email popped into my inbox, declaring that my first paper—on fracking, of course—was finally published. (Even better, it’s open access, meaning you can read it without fancy university credentials; that privilege only cost our research fund $5000.) It mostly follows from the realization that fracking companies need not disclose the chemicals they inject into the earth to stimulate greater oil production. My co-authors and I investigated a prominent effort to improve public disclosure of said chemicals, but we concluded that it ultimately fails.

The funny thing is that the work described in the paper is mostly stuff I did two years ago or more. The paper itself was then written, rejected, reworked, rejected, entirely rewritten, rejected, reformatted, rejected, revised, then ultimately accepted. It went through five different journals, the last one requiring three rounds of back-and-forth peer review. Along the way there were such frustrations as:

  • having to redo all the analyses because the submission & revision process had outlasted an entire year of new data (for a while, I was afraid I’d have to do this a second time)
  • duking it out with adamant reviewers who had unbridgeable differences with us about our methodology
  • completely forgetting why I’d done some small detail in a particular manner, despite my best efforts to keep good notes (this one’s a real head-banger!)

The process was so slow and excruciating that, by the final few steps, it was the part of my job I dreaded the most. There I was, having to dredge up two-year-old code, data, and thought processes, just to satisfy some reviewer’s particular inquiry. Now that the paper’s out I do feel some pride, but even that feels rather muted because it’s no longer super relevant to my current research. By the time there was any light on this paper’s publication horizon, I had long since moved on to far more interesting projects. I wonder when will those be published?

I suppose it’s the old 80/20 rule – final delivery is the hardest part! But academia is a particularly slow-paced environment. That’s its great value: researchers should be able to afford a careful, focused, methodical approach with thoughtful feedback cycles, away from the pressures and influences of the corporate world. But to a college senior eager to launch a career, it all just feels so tortoiselike sometimes!

Some AI transcriptions of popular song lyrics

Amazon Transcribe is a Siri-like tool that can write down text from an audio recording. It’s probably useful for closed captioning, etc.

I thought I’d have some fun by feeding it a few famous songs:

Shake It Off (Taylor Swift)

“people, people, but people, people. Hey, just think you’ve been getting down and out about the liars and the dirty, dirty cheats of the world. You could have been getting down to this sick beat his new girlfriend. She’s like, Oh my God, I’m just shaking to go there with you. Come on over, baby with shake, shake, shake”

Piano Man (Billy Joel)

“It’s nine o’clock on Saturday. Oh, regular crowd shuffles. There’s an old man sitting next to me making love to his tonic and gin, he says. You play me a memory. I’m not really sure, but it’s sad and it’s sweet. And I knew it. I wouldn’t worry. Sing us, Sing us well around in the booth. You got a spill. It all right now, John at the bar is a friend of mine. It gets me my drinks for free, and it’s quick with a joke on light Up Your Smoke. But there’s someplace that he’d rather bill. I believe this is killing me. Smile. Run away from his face. My place. Now Paul is a real estate novelist who never had time for a while, and he’s talking with David, was still in the Navy and probably will be practicing politics. Businessmen. So get stuff, Yes, but it’s drinking. It’s a pretty good crowd for a Saturday, and the manager gives me a smile because he knows that it’s me they’ve been coming to see, to forget about life for a while. Sales like, What are you doing”

Lose Yourself (Eminem)

“one moment, sweaty in these weak on heavy. There’s vomit on his sweater already. Mom’s spaghetti. He’s nervous, but on the surface he looks calm and ready to drop bombs. But he keeps on forgetting what he wrote down. The whole crowd goes so loud he opens his mouth. But the words won’t come out jumping. How everybody’s token Down with reality. Oh, there goes, gravity goes so you won’t give up that easy. No backsies. It don’t matter. He knows that he’s so sad that he knows when he goes back to this mobile phone booth. Amusing. Wait, escaping through this hole that is gaping. This world is mine for the taking. Make me King as we move toward a new world order. A normal life is for the post mortem. It only grows. Homie grows hotter, blows no, goes home and barely knows his own nose. See, it goes cold. Cold products booth. The music. Mr No way game’s changed with rage. Cage. I was playing in the beginning, the mood on changed to spit out stays, but I can’t prime step right in the next life. I believe somebody’s paying right people. Life for my family. This man stands for And, you know, I think my life in these times. So it’s getting even. Wanna see being a father and a prima baby Mama Drama. Damn him like a snail of guns Formulated. This has got to go. I cannot wait. And the music”

Thriller (Michael Jackson)

“it’s me. Wait, Thio, You wait Creature froth without the solos All getting down Stand with inside of a corpse Shell way, Way with 40,000 years and ghouls from every tomb you fight to stay alive for yeah”

Where is the Love (Black-Eyed Peas)

“What’s wrong with the world, Mama? People living like Thank God. Oh Mama, I think the whole world’s addicted to the drama only attracted to things and bring the trauma overseas trying to stop terrorism. But we still gotta tell risk here, living in the U. S. Saying a pigsty, I pleasant quips and K k k. But if you have love for your own race, you’re only space ruminate and discriminate only generates hate on when your hate and your band against Ray what you demonstrate. And that’s exactly angle works. Operates. You gotta have loved to set a straight ticket told mind. Meditate so gravitates in love way Same always changed New days of strangers in love and peace is strong while pieces a pump that don’t belong nations dropping bombs, chemical gases, filling lungs of little ones with ongoing suffering as the young. So ask yourself is a loving, really, really what is going wrong in this world that we live in. People keep on making wrong decisions. Only visions of the Nativity respecting each other wars going on. But the reasons under cover the truth is kept secret, swept a little drunk and you never know you never know. I was waiting weight of the world on my shoulder. As I’m getting older, your people get colder. Most of us only care about money making and selfishness. Gotta follow with the wrong direction. Wrong information always shown by the media Negative images is the main criteria. Infecting the young man’s passage from bacteria gets one act like what they see. Whatever happened to the values of humanity? Whatever happened to the fairness and equality instead of spreading them, were spreading animosity, lack of understanding, leading with community. The reason why? Sometimes I really wonder if that’s the reason why. Sometimes I’m feeling down. It’s no wonder why. Sometimes I feel it under cash Way only got wait.”

Carrot Stix (Yours Truly)

“Karen Beans. I turn it green light Brooklyn since my teens, but I ain’t got no means. I ain’t got no money to buy all of my 13. How much was to eat a balanced style with the greens? 40 years of beaten and not a single salad. They see my face is pallet. Well, I say they’re facing invalid but three cardiac procedures and triple bypass since go go. Way to make a man convinced in the garden I’m a king and you can’t take that away And I got no job. Does that prove my food all day and getting me a thing? But it’s on my resume. I don’t get it. I got my environment. So I want to eat some veggies. But I can report that when they tell me. Take it easy, but they know they don’t go slows. Grow him on my own. So I pull a garden. You pull up with hope hope the Garden of Eden in spring, every seating and summer. I’m sweating, but I’m out here with my parents are here needing my breeding, my feeding. So I’m proceeding to 20 me bleed and I’m exceeding a speeding, deceiving and pee in the weeds and at a receding. And nothing can stop me from succeeding because I’ve been reading it. I put in my globe picking my spade. People responds better. Those parts put me up every like a piece of paper on my part was the prince. But I’ll unpopular properly planting my property Probably pretty soon, pulling the piece in the pods he’s produced pie. What are the odds? The bees, I’m afraid, Work to the 10 then in the stands, planting peas like Gregor and Garden King. And you can take that away. I got no because I grew my food all day and getting me a thing. But it’s on my resume. I don’t get a dollar, but I got mine of Ivan. I got the horses in the bag and the horses get me back. Three tons of solid horse manure. I put it on my back. I drag it to the garden for the plants. It’s like the armor. Yeah, just a garden. And I’m a full on farmer. I put in my glove. Oh, this was my labor of love. Shit like heaven above. Don’t eat from the stone. I don’t eat that achieve what I grow. Don’t make me. But you don’t. If you don’t like my garden, Brody And you were straight A meaning. They tell me rice, but I eat.”

SEAGULLS! (Stop It Now) (Bad Lip Reading)

“a penny for your thoughts. I hate Brenda and a bad guy hit me in the shit and I peed on my pants. Uh, it’s nothing a little music can now down to the beach. I’m strong in what she goes poking. My son said she goes, Stop it now. Everyone not to stroll on that beach said, You guys gonna come in and wait? When these Persian When I tried to run this way, Way back by you proven Booth. Hey, show you some dance moves night. Want to Joe Dante? There on the beach. Run those birds. Your psycho wiener. Let me grab my Peter, please. Come on, man. Quick that that’s bank. Put a fish in our basket. You owe me an apology. Just hold your breath, T One time Frank back, you’ll be back by train from Hera. I got your back. Quiet. I understand. On one candidate box. Your special gift. That’s good. Uh huh. One day I was walking into, found the lock and I rolled the log over underneath. It was a time stick, and I was like, Someday when you are older, you duties. Stop it now. Yeah, whatever. You’re sort of pitchy. Do you like? Listen, man, I’m not your friend, Loom, Don’t fall asleep. Stop it now.”

Coronavirus, Disease Burden, and the New York Times

So, UChicago just canceled their entire spring term, migrating all classes to online platforms and sending students home after we take winter quarter finals next week. Suddenly, I have one week to say bye to all the friends and acquaintances I’ve made over the last four years. And graduation? Senior week? The concerts and shows we’ve put together? Friends with on-campus jobs? Friends who can’t fly home? The coronavirus is deadly, but right now it’s hard for students to look past their own uncertain situation.

The most difficult part of comprehension is contextualization. You can’t open the news without being blasted by the hundreds of thousands of cases, and thousands of deaths. I think we tend to judge a particular threat based on how frequently we see it in the news. Global diabetes causes over 1.5 million deaths yearly, road traffic deaths exceed 1.2 million, and suicides are nearly 800,000. Coronavirus is at 5,000 now, and growing. Based on exposure in the news, would you have been able to rank these by their burden?

The university and global communities should be laser-focused on containing COVID-19, but I’m disappointed we don’t see more widely publicized efforts to improve road safety or mental health during “normal times.” A fraction of the bandwidth that the coronavirus receives could revolutionize our awareness of issues that burden our world. A fraction of the containment efforts would save thousands of lives.

I used the awesome New York Times Article Search API to search for different keywords relevant to major causes of death. The number of results among NYT articles (since 1851) are plotted against estimates of global deaths, classified by the keyword.

There is absolutely nothing scientific about this. I only wanted to gain an intuition for the scale of disease deaths vs. news presence. (Remember that the coronavirus has only been on the scene for months, while other keywords have had decades to build up news hits.)

Data and R code is here.

Are We Good at Being Random?

One day in high school, my statistics teacher announced an activity: flip a mental coin 25 times (I don’t recall the actual number) and write down the results. Be as random as you can. Then, flip an actual coin 25 times and write down those results too.

Then she walked out of the room. She clearly thought we’d be bad at being random, and so we set about proving her wrong. When she re-entered, every student presented their two sequences of ‘heads’ and ‘tails’:


The teacher then strolled around the classroom, stopping at every student’s desk and identifying which sequence was from the true coin and which was from the student’s mind. She did this without fail.

Four years on, this feat still impresses me. How hard is it, really, for humans to be random? And how easily can we tell if randomness is fake? So I sent out this form to some contacts, asking them to submit a sequence of mental coin flips. The theory to test: randomness is more “streaky” than we think. For instance, “HHHHH” is more likely to appear in a truly random sequence of coin flips than it would in a mental-random sequence.

If this theory is true, the probability of switching between Heads and Tails is higher for humans than for random coins. For the 11 responses I received, the red points below mark the probabilities of switching between Heads and Tails for each individual mental flip. The white blobs are null distributions—indications of where these dots tend to be after simulating 10,000 random “coins” via the computer. As you can tell, humans tend to have substantially higher rates of H/T switching than random generators.

Second, the longest runs of continuous Heads or continuous Tails might be shorter in our sequences than in truly random sequences. Blue dots show the longest continuous run for each of the respondents, while the null distributions are again generated by 10,000 simulated random flips:

Hmm, seems substantially lower. We tend to not report any continuous runs longer than “HHHHH” or “TTTTT”, but real random generators tend to spit out runs of up to 7 or 10 continuous Heads. (And the more total flips you do, the longer your longest sequence might be.)

So, next time you’re trying to be random, don’t be afraid of streaks!

Here’s a table of unadjusted one-sided p-values obtained from the simulated null distributions for each respondent:

Respondent Number# Flips GivenP-Value for Pr(Switch H/T)P-Value for Longest Run

The data and code for this post are available here. If you’d like to add your own data to the pile, you can submit your mental-random flips here.

To see how streaky random coins are, here are the first 5 random sequences I generated:


Statistics, or Stories?

Politicians make me queasy. So do self-help books, Twitter wars, linguistics professors, even TED talks. It’s not that they lack value—Twitter excepted, of course—it’s that they rely too heavily on examples and stories to persuade.

Why? It’s effective. As Chip and Dan Heath put it in Made to Stick (one of those self-help books), anecdotes—simple, concrete, and relatable—”have the amazing dual power to simulate and to inspire.” But such a quote isn’t as persuasive as the real examples they offer in the book, like how Steve Denning managed to convince the World Bank senior leadership to restructure the organization by sharing the story of a single Zambian health worker searching for malaria treatment info. Stories are more convincing than statistics.

But with billions of people and anecdotes in the world, you can cherry-pick one for almost any argument you’d like. So for every story Elizabeth Warren tells about an immigrant who lost everything to Trumpian cruelty, I’m sure Donald Trump could find a loyal storyteller who lost everything to immigrants. It breaks my heart that stories outpersuade statistics: in other words, that people are human.

The issue, to me, is that people (a) don’t remember numbers, and (b) distrust statistics. Point (a) is probably an irreversible pillar of human nature that will always sway the persuasion pendulum in favor of stories. For point (b), anecdotes often seem more trustworthy because they happened to real, imaginable people, while we generally don’t know where most of the stats we consume come from. The Heath brothers put it better than I can: “tinkering with statistics provides lucrative employment for untold numbers of issue advocates. Ethically challenged people with lots of analytical smarts can, with enough contortions, make almost any case from a given set of statistics.” But the same thing happens with stories! It seems that the problem with statistics is their deceptive pretense to be fact, while apparently stories don’t start off with this claim in the first place.

Of course, it would be impossible and non-productive to qualify every argument with accurate statistics. I didn’t begin this post with, “Seventy-two percent of politicians make me queasy (standard error of 8.2 percentage-points).” But education in statistics and statistical ethics can make us better listeners and communicators. Our statistics should be prospective and inclusive, and we must have the flexibility to accept and share statistics that don’t support our arguments.

If you’re trying to convince someone, stories are your best bet. But to understand an issue, statistics can be skyscrapers built of story-bricks. For some final advice, I’ll turn to the Heath brothers one last time: “when it comes to statistics, our best advice is to use them as input, not output. Use them to make up your mind on an issue. Don’t make up your mind and then go looking for the numbers to support yourself—that’s asking for temptation and trouble.” But don’t do that with stories, either.

Quotes are cherry-picked from the popular book by Chip and Dan Heath, Made to Stick: Why some ideas take hold and others come unstuck (Penguin Random House, 2007), pages 237 and 147.

Deconstructing Humor using Text Mining on /r/Jokes

I recently turned in a project for an elective class I’m taking on humor where I analyzed a bunch of jokes posted to the /r/Jokes subreddit. I thought that some of the results were applicable to this blog, so I’ll summarize the interesting ones here:

1) Donald Trump jokes are funnier than average

Jokes that included the word “Trump” received an average of 141 more net upvotes than jokes that didn’t involve the name (mean score 258 vs. 117, p=0.0016). Moreover, jokes that used both the word “Trump” and the word “orange” in the same post scored on average 837 more net upvotes than jokes with just “Trump,” but this result wasn’t statistically significant because only 57 jokes used both words (mean score 1074 vs. 237, p=0.27).

For comparison, Obama jokes are not significantly more upvoted than average (mean score 149 vs. 118, p=0.45, n=588).

2) Chickens, deer, turkeys, cows, and elephants are the funniest animals

These are chosen by comparing their frequency in jokes against their frequencies in everyday English. Frogs, monkeys, and ducks tend to appear in jokes pretty often as well.

3) “He” and “man” are substantially more common than “she” and “woman,” but “wife” and “girlfriend” appear more than “husband” and “boyfriend

So men are the subject of more jokes, unless the joke happens to be about a man-woman relationship?

Also, jokes using female-gendered words tend to repeat them more times than jokes using male-gendered words.

The data for this analysis was from Taivo Pungas’ public Github repository at https://github.com/taivop/joke-dataset. My code is available here. You can download the report as a PDF too, if you’re interested in other findings or methodological details.

Corruption of the Youth! Profanity in Music, 1958–2019

There are many neat studies on the internet of lyrical content in popular songs. People have claimed that music is getting more repetitive and sexual, among many other things. Here’s a cool analysis of word sophistication, n-grams, and lyric sentiments.

Wanting to do something similar, I scraped the weekly Billboard 100 list of songs since its 1958 conception, and then—through many hours, request throttling configurations, and changed IP addresses—scraped the lyrics to most of these songs from three different popular lyrics websites.

There’s a lot more to mine from the data I collected, but what would be an interesting and quick first analysis? What sorts of words or lyrical trends would show the most dramatic changes between the early rock-and-roll of the 60’s and today?

Hint: to keep this site kid-friendly, I included only the first letter of those six key words in the visualizations below.

By comparing the two graphs, we can also see which words tend to be repeated multiple times within songs that use them!

Contact me if you’re interested in using the data or code from this analysis! I’m not posting them publicly for lyric copyright reasons.

A Timeline Visualization Tool from the Future

Quick follow-up to my last post: my father’s been developing a professional tool called TimeStory, built for visualizing timelines easily and aesthetically. It’s perfect for the Billboard analysis I was doing:

Timelines represent the span between artists’ first and last weeks on the Billboard 100, up to last month. Artists are ranked by their total presence on the Billboard, which makes Drake’s reign at the top even more impressive considering how short he’s been around.

If you have TimeStory, you can download the file from my Google Drive and scroll through the list of all top 200 artists to compare. If you don’t, you can still see or download the raw data as a spreadsheet.