Salman says

21 Aug 2011


As I've previously written, I'm an avid consumer of videos and exercises at Khan Academy. It currently has over 2400 videos explaining various school and college-level topics, from basic addition to linear algebra, finance, biology and chemistry.

After watching some 400 videos, I found myself idly wondering how often Sal uses certain phrases, such as "now I'll arbitrarily change colour", "I made a mistake there" and "and he has a hat". So, I thought it might be interesting to analyse the transcripts of his videos.

While looking to see how I could write some exercises for Khan here, I found what I hadn't realised I wanted: the subtitles for a number of his videos (1067 videos in 16 subjects, representing 200 hours worth of talking) as text files.

There's no particular aim with this project, and it is somewhat pointless, but I'm quite interested in text analysis (or natural language processing, to give it its fancy title), and may turn out to be helpful for analysing Chinese text, which is another project I'm working on. I've also found that there is a significant crossover between text analysis and bioinformatics (which is essentially analysing strings of characters).