After building a Markov chain generator for English words using bigrams (see here), I wanted to improve it. You can improve it by moving to trigrams, but I wanted to try something more interesting. I noticed that certain groups of consonants are fine at the start of the word, but never appear at the end, e.g. CR and DR. While other groups are fine at the end, but not at the beginning, e.g. ND and LD. I wondered if I could identify these trends programmatically and use them to build a better Markov chain word generator.
To get a list of clusters, I through each word my list (the same one I used for the bigram analysis) and split them into groups of consecutive vowels or consonants. For example, the word "consonants" would get split into C, O, NS, O, N, A and NTS.
This method identified 1448 clusters from 62,702 words. Of these, 164 clusters comprised vowels, and 1284 clusters comprised consonants. The most common "block" was the single letter E. In fact, the first seven "clusters" were all a single letter. TH was the first true cluster.
The most common double letter cluster is LL. The only letters that are not doubled are H, J, Q, X and Y. There are no triple letters in English.
The main point of doing this analysis was to find the clusters that appear only at the beginning or ends or words, so here we go.
This is a list of ten most common clusters that appear at the start of a word, but never at the end. The last column shows how frequently they appear at the start of a word as opposed to somewhere in the middle.
In most cases, the cluster is predominantly found at the beginning of the word. When it isn't, it is generally found in a compound word, like somewhere or arrowhead.
It seems in most cases, these clusters end in an R or and L.
These are all the clusters of consonant which demand a word to end. They are often end with D or T.
LLY is also almost guaranteed to end a word since it's a adverb ending, and whilst a double L can follow a vowel, it can't start a word in English. Examples where LLY doesn't end a word are bullying and bellyache.
Longest consonant clusters
Here's a list of the ten longest clusters with one word in which each is found. There are another 26 clusters of five consonants, but ten seemed enough.
Previously I'd learnt that the longest cluster of consonants in English is GHTSBR in Knightsbridge, maybe because it's a proper noun. However, like Knightsbridge, pretty much all of these words consist of two other English words joined together.
If we allow Y to be a consonant, then we have even longer blocks, and entire words since rhythm is now allowed.
Longest vowel clusters
Vowel clusters tend to be smaller than consonant clusters, with four being the maximum. There are eleven of them.
Queue is a interesting word as it is pronounced the same if you remove all the four vowels. There is another example of a word containing UEUE: Ueueteotl (a Mesoamerican deity).
Heiau is an Hawaiian temple - they just love vowels there.
If you include Y as a vowel, then you don't get any longer clusters, but you do get to include words like employee, layout, joyous, buoyant, voyeur, paraguayan, hooey, payee, mayeye, yaya and biyearly.