Bigram is basically a fancy word for a combination of two letters. Once you've looked a single letter frequencies, it's common to look at combinations of letters.
Getting a word list
Initially I started this analysis using the words in the Unix dictionary in /usr/share/dict/words. However, I later decided it would make sense to take into account how frequently words occur. So I used a list of ~500,000 word counts from contemporary American English sources (Davies, Mark. (2011) Word frequency data from the Corpus of Contemporary American English (COCA)), which I got from http://www.wordfrequency.info.
This list contains a lot of words scraped from internet forums, so has a lot of acronyms and typos. So first I filtered it by removing anything that doesn't appear in the Unix dictionary list. I also removed anything containing non-letters such as apostrophes, hyphens or numbers. It still contains some pretty weird words (like tweeg?), but hopefully they occur so infrequently as to not have too much effect on the results. Ideally I should get hold of a dictionary used for spell checking.
Most common bigrams
These are the most common bigrams by my counts. "th" and "he" top the list, due to commonness of words like "the", "then", "they", "there", "these", etc.
The other common bigrams either are common words (in, an, at, on, to), or are part of very common words (e.g. re in "are"; nd in "and").
Least common bigrams
These are the 10 least common (but occurring at least once) bigrams in my list. They each occur in a single, low frequency word.
These bigrams are all pairs of consonants. If I count the occurence of characters in the top 40 least frequent bigrams I see:
- J 10 times
- Z 8 times
- V, K, L, P, W, M 5 times each
It's also pretty clear that some of these words aren't "really English". Words such as nejd, vajra and sovkhoz. We could try to avoid words that aren't really English, but then it's not clear where to stop, since most English words come from other languages.
Some of the other words are compound words, which should arguably be two words, e.g. buffcoat, bushveld and upvalley.
Of the 676 possible bigrams, 99 didn't appear in my list. Of these, one third (33) contain a Q, which is unsurprising given that Q is almost always followed by a U, immediately removing 25 bigrams. There were only two bigrams containing vowels that didn't appear in my list and they were QE and QA.
The next most common letters were, J (24), X (23), V (21) and Z (14). I was quite surprised that J was the second most common letter, but I suppose it doesn't play well with most consonants.
Markov chain generator
By grouping bigrams into those that all start with the same letter and then normalising, so the count for all those in one group sum to one, I got the probability for a letter appearing in a word, given the previous letter. I also expanded the bigrams to include a character representing the start and end of the word, so I know which letters are likely to start and end a word.
Now I can use these probabilities to generate English-like words with a Markov chain.
Note that often "words" end up being a single letter or cluster of letters. This is because Markov chains have no concept of where they are in the chain, so it might think that lots of words start with T, but also lots of words end in T, and so T could be a word. I will improve how this works later.
Since I had a list of normalised probabilities, I took a look at which letter combinations follow a given letter with disproportionate frequency. Maybe you can guess the top result, though the others are less obvious.
I was surprised that QU didn't have a higher frequency. There is apparently a 6% chance that a word with a Q will end with that Q. This is based on three words: Iraq, Q and Tareq. I think the fact that Q is used to represent a question throws the statistics off a bit.
- Calculate disproportionate bigrams by difference from predicted frequency (chi-squared?)
- Merge bigrams and make new Markov chain generator
- Image and image generator showing probability for each letter given a letter.