I've started to look at the frequency at which Sal uses words. Unsurprisingly, the word he uses most commonly is 'the', which makes up 4.6% of his words, which is seems pretty standard. To find out what is standard I'm using a list of ~500,000 word counts from contemporary American English sources [1], which I got from http://www.wordfrequency.info. According to them, the word 'the' should actually make up 5.6% of words, so it seems Sal under-uses it by quite a margin.
Before getting too much into an in-depth analysis of which words Sal uses, I thought it would be interesting to see which words he never uses. Clearly there will be many words he has never used in the ~1.8 million words in the subtitles, so I narrowed my search to words that you would expect, based on the "normal" word counts, to appear > 300 times in 1.8 million words.
The most frequent "missing words" are all apostrophe "words", such as 's, 'll, 've, which don't show up in my counts because of how I've defined words (so maybe I should redefine them to be consistent). Below are some of the most frequent words that Sal never says in the videos in my analysis.
American
The most frequent real word that Sal never says in the subtitles I have is "American", which I admit I found quite surprising. However, I double-checked, and it seems to be true. He uses the word 'Americans' and 'German-American', but never plain 'American'. I find this quite pleasing, as the lesson are supposed to be for anyone in the world and maths and the sciences should be the same everywhere. If the Finance and History lessons were included then Sal will undoubtedly use the word 'American'. Also in the top ten "missing words" is the word 'America'.
Among
Many of the missing word are also highly context-specific, such as 'Bush', 'York', 'economic' and 'police'. However, one word that didn't follow this pattern was 'among' as it is a fairly every-day word. But I searched the captions, and indeed, Sal only ever uses the word 'amongst' (14 times as it happens), which is a valid alternative, though sometimes considered old-fashioned. This raises the possibility of identifying Sal-impersonators. Presumably it should be possible to generate a word-frequency fingerprint for everyone and see how likely a given piece of text was written by him or her. I'm definitely curious to see what my fingerprint would look like.
qwq
Another word that I found curious was 'qwq', which I initially took as a mistake in the table as I've never seen the "word" before. But according to Urban Dictionary, qwq is analogous to lol. Thank goodness Sal doesn't use either. It does make me wonder where this list of contemporary American English comes from. I suspect it involves a lot of text from the web (which makes sense as it's easy to gather). Other slightly confusing words are 'san' (the Japanese suffix name suffix maybe), and 'wo', which I guess is an exclamation.
[1] Davies, Mark. (2011) Word frequency data from the Corpus of Contemporary American English (COCA).
Comments (1)
Cosecant on 11 Oct 2014, 7:28 p.m.
interesting