Wednesday, 24th August 2011
Sal's missing words
I've started to look at the frequency at which Sal uses words. Unsurprisingly, the word he uses most commonly is 'the', which makes up 4.6% of his words, which is seems pretty standard. To find out what is standard I'm using a list of ~500,000 word counts from contemporary American English sources1, which I got from http://www.wordfrequency.info. According to them, the word 'the' should actually make up 5.6% of words, so it seems Sal under-uses it by quite a margin.
Before getting too much into an indepth analysis of which words Sal uses, I thought it would be interesting to see which words he never uses. Clearly there will be many words he has never used in the ~1.8 million words in the subtitles, so I narrowed my search to words that you would expect, based on the "normal" word counts, to appear > 300 times in 1.8 million words.
The most frequent "missing words" are all apostrophed "words", such as 's, 'll, 've, which don't show up in my counts because of how I've defined words (so maybe I should redefine them to be consistent). Below are some of the most frequent words that Sal never says in the videos in my analysis.
The most frequent real word that Sal never says in the subtitles I have is "American", which I admit I found quite surprising. However, I double-checked, and it seems to be true. He uses the word 'Americans' and 'German-American', but never plain 'American'. I actually find this quite pleasing, as the lesson are supposed to be for anyone in the world and maths and the sciences should be the same everywhere. If the Finance and History lessons were included then Sal will undoubtably use the word 'American'. Also in the top ten "missing words" is the word 'America'.
Many of the missing word are also highly context-specific, such as 'Bush', 'York', 'economic' and 'police'. However, one word that didn't follow this pattern was 'among' as it is a fairly every-day word. But I searched the captions, and indeed, Sal only ever uses the word 'amongst' (14 times as it happens), which is a valid alternative, though sometimes considered old-fashioned. This raises the possibility of identifying Sal-impersonators. Presumably it should be possible to generate a word-frequency fingerprint for everyone and see how likely a given piece of text was written by him or her. I'm definitely curious to see what my fingerprint would look like.
Another word that I found curious was 'qwq', which I initially took as a mistake in the table as I've never seen the "word" before. But according to Urban Dictionary, qwq is analogous to lol. Thank goodness Sal doens't use either. It does make me wonder where this list of contemporary American English comes from. I suspect it involves at lot of text from the web (which makes sense as it's easy to gather). Other slightly confusing words are 'san' (the Japanese suffix name suffix maybe), and 'wo', which I guess is an exclamation.
 Davies, Mark. (2011) Word frequency data from the Corpus of Contemporary American English (COCA).