Word counts

I've written a program that extracts all the individuals word from the subtitle text I have and counts the frequency of each. By my count there are 1798565 words, which works out at an average speaking speed of ~150 words per minute. According to Wikipedia, the recommended rate of speech for audiobooks is 150-160 words per minute, so this seems reasonable.

Most frequent words

The most frequent words Sal uses on the Khan Academy are not particularly interesting; they are pretty much what you would expect from normal English speech. The top ten are: 'the', 'to', 'is', 'of', 'this', 'and', 'so', 'a', 'that' and 'we'. The top ten most common words in English are generally considered to be 'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have' and 'I' (note that these are lemmas, so 'be' includes all forms of the verb, e.g. 'is', 'are', 'were' etc.).


One interesting point I noticed in looking at the top ten most frequent words is that Sal uses the word 'we' more often than normal (10th rather than 27th). This makes sense to me as Sal's manner of speaking is generally very inclusive. So I thought I'd look at how frequently Sal uses each pronoun; here's a table of the results:

we 36561
I 32334
you 31640
he 813
she 160


I was trying to think of an interesting set of words to consider and struck upon colours after watching yet another Khan video. It reflects somewhat Sal's choice of colour when writing (which gives me an idea for more analysis), although obviously he will mention colours for other reasons. The number of times each of the colours I checked (including two spellings of gray/grey) is shown below:


How frequently does "magenta" occur in Salman's speech? 

Sal likes to use the word 'essentially' as well, don't you think?

