Bigrams


22 Mar 2018

Bigram is a fancy word for a combination of two letters. Once you've looked a single letter frequencies, it makes sense to look at combinations of letters.

Most common bigrams

These are the most common bigrams in my list. th and he top the list, due to commonness of words like the, then, they, there, these, etc.

The other common bigrams either are common words (in, an, at, on), or are part of very common words (e.g. re in are; nd in and).

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 th he in er an re at on nd en es Frequency (%) Bigrams

Below are graphs showing the most common bigrams consisting of two consonants and of two vowels. It makes sense that a lot of the most common bigrams consist of one consonant and one vowel since you need to have both to make something pronounceable.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 th nd ng st nt ll ch wh rs ns ly Frequency (%) Bigrams
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 ou ea io ee ai ie oo ia ei ue au Frequency (%) Bigrams

Least common bigrams

These are the 10 least common bigrams, that appear at least once on my list. They each occur in a single, low frequency word. They are all pairs of consonants, except qi.

I'm surprised qi is so infrequent as I thought Qi was a word, but it is apparently not allowed in the crossword list I'm using. Nor is Iraqi, since it's a proper noun. Qaid, qindar, qintar and qiviut are all allowed in crossword list, but so are rare that they don't appear in COCA.

Some of these words (jnana, vizsla and sovkhoz) aren't really "English words". We could try to avoid non-English words, but it's not clear where to stop, since most English words come from other languages at some point.

BigramWord
xvpoxvirus
jnjnana
zbwhizbang
vksovkhoz
fpoffprints
bzsubzones
zhmuzhik
zsvizsla
vdhavdalah
qifaqir

Other words are compound words, which should arguably be two words, e.g. poxvirus, offprints and subzones. But again, it's hard to figure out how to remove these or if we even should. We could try splitting the words at each point and seeing if the parts are words in their own right, but then we'd select "real words" like often. It shouldn't make too much difference to any later analysis anyway, since these words appear with such low frequency.

This is table shows the letters that mean most often in the top 40 least frequent bigrams. Unsurprisingly they are all uncommon consonants. There is one letter surprisingly absent.

LetterCountBigrams
j10jn, tj, lj, yj, mj, hj, jj, gj, pj
z9zb, bz, zh, zs, zc, wz, lz, sz, mz
v6xv, vk, vd, yv, tv, kv
m6mk, xm, mg, mj, mq, mz
x5xv, xn, xm, xg, xb
k5vk, mk, bk, kv, tk

Missing bigrams

Of the 650 possible bigrams, 114 didn't appear in my list. About a third of those contain a q. That's hardly surprising given that q is almost always followed by a u, (it can be followed by i in fakir as mentioned above, but that's it). The only missing bigrams that contained a vowel were qa, qe and qo, meaning that vowels can go before or after any other letter in English.

LetterNever beforeNever afterCount
q a, b, c, d, e, f, g, h, j, k, l, m, n, o, p, q, r, s, t, v, w, x, y, z b, f, g, j, k, p, q, t, v, w, y, z 35
j b, c, d, f, g, h, k, l, m, p, q, r, s, t, v, w, x, y, z c, q, v, w, x, z 25
x d, j, k, r, x, z b, c, d, f, g, h, j, k, l, m, p, q, r, s, t, v, w, x, z 24
v b, c, f, g, h, j, l, m, n, p, q, t, w, x, z c, f, g, h, j, p, q, w 23
z f, h, j, k, p, q, v, x d, f, g, j, k, n, q, r, t, x 18

I was quite surprised that j was the second most common letter, but I suppose it doesn't play well with most consonants, as you can see by it taking the top spot in the least common bigrams.

Markov chain generator

By grouping bigrams into those that all start with the same letter and then normalising, so the count for all those in one group sum to one, I got the probability for a letter appearing in a word, given the previous letter. I also expanded the bigrams to include a character representing the start and end of the word, so I know which letters are likely to start and end a word.

Now I can use these probabilities to generate English-like words with a Markov chain.

Note that often "words" end up being a single letter or cluster of letters. This is because Markov chains have no concept of where they are in the chain, so it might think that lots of words start with t, but also lots of words end in t, and so t could be a word. I will improve how this works later.

Disproportionate bigrams

Since I had a list of normalised probabilities, I took a look at which letter combinations follow a given letter with disproportionate frequency.

It should come as no surprise that qu is at the top, making up a whopping 99.99896% of bigrams beginning with q (the remaining percent consist of a few cases of fakir).

When I ran the analysis with a less stringent definition of word, which allowed proper nouns like Iraq and Tareq, plus words like qi and the letter Q by itself to indicate a question, the percentage dropped to 92%.

0 20 40 60 80 qu ve ze he ju ke jo th be in Frequency (%) Bigrams

Some of the other bigrams make sense, like ve and ze, which are uncommon letters in common word endings. I was a bit surprised by ju and jo, but this follow from the fact that not many letters follow j.

TODO

  • Calculate disproportionate bigrams by difference from predicted frequency (chi-squared?)
  • Merge bigrams and make new Markov chain generator
  • Image and image generator showing probability for each letter given a letter.