I was attempting to make an A-Z book for my son, using photos of him and the people and places he knows. For each photo, I wanted a sentence that would largely consist of words starting with a given letter, e.g. Dancing with Daddy, Eating an egg, Finding a Frog etc.. Having read quite a few ABC books, I'd realised how lucky writers of such books are that xylophones and X-rays exist.
“Everyone uses ‘zebra’ as a ‘z for’ word. We’ll never be famous.”— James Martin (@Pundamentalism) November 1, 2015
“Let’s spell it with an ‘x’.”
When trying to think of suitable words, I found that there are certain quite common words that start with relatively uncommon letters, words such as ‘just’, ‘quite’ and ‘your’. In fact, the only letters for which I really struggled to find words was x and z. So that got me thinking: what are the most common words starting which each letter, and how common are they?
Getting the data
I got a list of words and their frequency from [URL]. This has the counts words taken from a corpus of 500 000 words from various sources. It includes quite a lot of misspellings and non-words, such as “’s” and “nhl”.
The first thing I did was to write a Python program to go through the list, collect up the words based on their first letter and print the most common 32 for each letter to a file. Then I had to go through and decide which ones I wanted to consider words.
What is a word?
In most cases, the most common words, such as “she” and “the” are quite obviously words. However, for the less common letters, such as X, the most common word was “x” (5361 counts in 500 000). I can imagine this might appear quite frequently in any text with mathematics, but I wouldn’t consider it a word. Also in the top ten X-words were xi (1106), xii (867) and xvi (831), which presumably are section markers or page numbers. Common x-words also included Xavier (1550), Xerox (1117), XP (1108), Xiaoping (535) and xBRL (356), which are all names of some sort. The top ten for J included John (135 720), James (47 127), Jack (40 624) and Jim (39 864).
I decided to rule out words that were obviously names and proper nouns, but even this isn’t so easy, since another top ten J-word was “Jobs” (46 409). This probably was used to refer to Steve Jobs some of time and jobs in general at other times. There, I chose to leave the word and just accept that counting words in never going to be perfect.
Another issue “jobs” raised was what to do about plurals. Originally I was going to combine the counts for single and plural words. But then there was the question about whether to combine verb forms, such as “count” and “counts” (which could also be singular and plural nouns). If I chose to combine them, then for consistency I should also combine other verb forms such as “am” (the 32nd most common A-word with 108 653 counts), and “was” (the most common W-word with 3 074 262 counts), and I couldn’t think of a good way to combine them across letters. So in the end, for simplicity, I decided not to combine any words.
Plotting the data
Below is an SVG graph I generated with my Python SVG drawing library. It shows the counts for the top four most common words on a log scale. You can mouse over the bars to see the words represented, from “the” (23 014 366 counts), to “xenophobia” (324 counts).
SVG chart here
One of the most obvious, if unsurprising, aspects of the graph is that the X- and Z-words are way less common than the rest. The most common Z-word, “zone” (15 988), is about 4.5 times less common than the most common Q-word, “question” (132 806). The most common X-word, “x-ray” (3456) is 4.5 times less common again. To make things worse for the X-words, their frequency drops off rapidly, so the 4th most common X-word, “xenophobia” is nearly 10 times less common than the 4th most common Z-word, “zoo” (324 vs. 3123).
The only other letter that shows a similar drop-off is J, for which “just” is the most common word (734 686), and “judge” (45 232) is the 4th most common, with 16 times fewer occurrences (compared to a 10 fold drop-off from “x-ray” to “xenophobia”). When comparing the most common words, J is the 11th least common letter; when comparing the 4th most common words, J is the 3rd least common letter.
At the other end of the scale, T wins the most common words, with “the”, “to”, “that” and “this” (23 014 366 to “1 987 129”). The next three most common words all start with vowels: “and” (11 260 177), “of”, (10 968 008), and “in” (7 661 696). While vowels are obviously common in words, they don’t tend to be common at the start. In fact, the most common vowel, E, has the 5th least common word: “even' (462 340). The I-words have the smallest drop-off with “in”, “I”, “it”, and “is” (less than halving from 7 661 696 to 3 878 929). When comparing the 4th most words, I is the winner.
So was this any help with my phonebook? Not really, but I now understand why so many alphabet books resort to “X as in the end of foX”. Alphabet book writers must have been so pleased with the invention of x-rays and the xylophone (112).