Our crossword-filling algorithm works in such a way that it maximizes the remaining possibilities left by each word, based on the other letters in that word. For example, the three-letter “APE” would be favored over “AXE”, given the choice, because P is a more versatile letter than X. Thus, common letters would be favored, while uncommon letters—and therefore, the words that contain them, like JACK or OOZE—would be left out.
My thoughts are: (1) does this really happen in my algorithm? and (2) do professional crosswords (e.g. NY Times crosswords) suffer from this bias too? Here’s a bar chart:
The red is a corpus of the 20,000 most-used English words, a “reference point” as if all those words were used uniformly in crossword generation. The green is from 34,505 letters from complete, nonsense-free crosswords from our CrossWorthy algorithm. The blue is from 848 NY Times crosswords (183k letters) since 2018.
Apparently, the CrossWorthy algorithm heavily over-favors S and A. (It would be interesting to come up with a “versatility” metric for letters, with a different implication than “commonness”. A and S must be quite versatile. Compare A to I, for instance, who is just as common in the 20k word corpus. My gut says this has to do with the index(es) in the word where the letter can occur: both A and I are both common as second or third letters, but A is an easier first letter than I. Even the blue NY Times bars seem affected by this letter-versatility thing, at first glance, though not as badly.) As expected, our algorithm also under-utilizes rare letters: Z, Q, J, X, V, W, K, F, Y. The interesting part is that the NY Times crosswords don’t seem to underuse rare letters compared to the corpus… I guess the pros just come up with more interesting crossword words!