N-grams


21 Jan 2012

After looking at the most common words, I decided to look at the most common combinations of words that Sal uses. I expected that "and now we are done" would be common.

Bigrams

Bigrams are combinations of two words (or letters). The five most common bigrams Sal uses are (with counts in parentheses), are shown below. Before running my analysis, I had assumed all the most frequent bigrams would be common English word combinations, but actually, I think only "this is" and "of the" are common in most text (though I'd like to check this).

The bigram "going to" is probably quite common in English, but I suspect not as common as here. This must be because Sal frequently introduces what he is about to do before doing it. The bigrams "equal to" and "is equal" are probably relatively rare in English outside of mathematical discussions.

  • going to (11,939)
  • this is (11,182)
  • equal to (10,925)
  • of the (8,462)
  • is equal (8,068)

Trigrams

When we extend our search to trigrams, the most common bigrams all get extended.

You can see that the bigram "going to" is extended in both directions to "(is) going to (be)", while "this is" is extended to "(so) this is (the)". When we move to 4-grams we can see how the "is equal to" fits in.

  • is equal to (7,990)
  • going to be (5,101)
  • is going to (3,108)
  • so this is (2,137)
  • this is the (1,958)

4-grams

More the same, expanding the sentence fragments further. You can see how these could be pieced together to form the fragment "is going to be equal to the same thing as".

Other 4-grams I found interesting are "x is equal to" (1,015), "the square root of" (691), "in the last video" (379) and "with respect to x" (309).

  • is going to be (2,395)
  • to be equal to (1,148)
  • is equal to the (1,096)
  • the same thing as (1,064)
  • going to be equal (1,020)

5-grams

You can probably see where this is going now. Clearly, the clause "this is going to be equal to" is very common. Interestingly, "is the same thing as" is also very common, and essentially the same thing. The fragment "let's see if we can" is very much part of Sal's inclusive style of talking. Other 5-grams include "both sides of this equation" (209), "so let's say I have" (130), "the limit as x approaches" (109) and "let's say I have a" (107).

  • going to be equal to (932)
  • is going to be equal (681)
  • is the same thing as (647)
  • this is going to be (635)
  • let's see if we can (348)

To be continued...

Comments (1)

Benjamin Cuningham on 13 Jun 2012, 8:13 p.m.

Very cool!
http://www.khanacademy.org/profile/BenjaminCuningham/