As I've previously written, I'm an avid consumer of videos and exercises at the Khan Academy. At the current count it contains over 2400 videos explaining various school and college-level topics, from basic addition to linear algebra, finance, biology and chemistry. After watching some 400 videos, I found myself idly wondering how often he uses certain phrases, such as "now I'll arbitarily change colour", "I made a mistake there" and "and he has a hat". So, I thought it might be interesting to analyse the transcripts of his videos. While looking to see how I could write some exercises for Khan here, I found what I hadn't realised I wanted: the subtitles for a number of his videos (1067 in 16 subjects, representing 200 hours worth of talking) as text files.
There's no particular aim with this project, and it is somewhat pointless, but I'm quite interested in text analysis (or natural language processing, to give it its fancy title), and may turn out to be helpful for analysing Chinese text, which is another project I'm working on. I've also found that there is a significant crossover between text analysis and bioinformatics (which is essentially analysing strings of characters).