Sunday, 21st August 2011
Simple sentence counts
I've started with some very basic analysis of the subtitle text. Firstly, the number of sentences per video for each of the different topics, which should give a rough measure of the relatively length of videos in each topic. This information is actually available here, but I've recalculated it to test my SRT (SubRip file format) parsing and sentence-splitting (not as easy as you might imagine if you want to avoid spliting decimal numbers).
Mean number of sentence per video
Below is a bar chart of the data created with my Python DrawSVG module (with a manually-added tooltip). Mouseover a bar to see which subject it represents. I drew the graph with a black background to mimic the Khan video style, but I couldn't bring myself to use garish colours.
I was initially quite surprised to see that Developmental Math - the first bar - is so much shorter than the others, but I've looked at the videos and they are all very short, generally answering a single, simple maths question. The top three subjects are Linear Algebra, Chemistry and Biology, which makes sense as they are all relatively advanced topics with long videos. It's a bit odd that Organic Chemisty has noticely fewer sentences and Arithmetic has a relatively high number of sentences per video. I should add some error bars so I can make better comparisons.
Mean number of words per sentence
So sentences per video gives us a very rough measure of subject complexity, but perhaps more informative, is sentence length. Below is a chart of the mean number of words per sentence for each subject. For this analysis words such as "we're" are counted as a single word, but hyphenated words are counted as two words. The order of subjects is the same as in the graph above.
This shows us the, although Arithmetic tend to have many sentences per video (~155), the sentences tend to be quite short (<10 words per sentence). Conversely, geometry, has relatively few sentences per video (~115), the sentences tend to be quite long (~15 words per sentence). This give us a bit more insight into the different subjects, but doesn't really tell us anything interesting.
There are number of other numbers we could look at, such as the number of words per subject, but I'm not sure how useful that is (if you want the answer multiple the values from the previous graphs). However, I have counted the total number of words in all subtitles and found 1,865,687, which will be useful for later analyses.