Simple sentence counts


21 Aug 2011

Introduction

I've started with some very basic analysis of the subtitle text. Firstly, the number of sentences per video for each of the different topics, which should give a rough measure of the relatively length of videos in each topic. This information is actually available here, but I've recalculated it to test my SRT (SubRip file format) parsing and sentence-splitting (not as easy as you might imagine if you want to avoid splitting decimal numbers).

Mean number of sentence per video

This is a bar chart of the data created with my Python DrawSVG module (with a manually-added tooltip). Mouseover a bar to see which subject it represents. I drew the graph with a black background to mimic the Khan video style, but I couldn't bring myself to use garish colours.

image/svg+xml Khan Academy: Mean sentences per video for each subject Peter Collingridge Subjects Sentences / video 0 50 100 150 200 Tooltip

I was initially quite surprised to see that Developmental Math - the first bar - is so much shorter than the others, but I've looked at the videos and they are all very short, generally answering a single, simple maths question. The top three subjects are Linear Algebra, Chemistry and Biology, which makes sense as they are all relatively advanced topics with long videos. It's a bit odd that Organic Chemistry has noticeably fewer sentences and Arithmetic has a relatively high number of sentences per video. I should add some error bars so I can make better comparisons.

Mean number of words per sentence

So sentences per video gives us a very rough measure of subject complexity, but perhaps more informative, is sentence length. Below is a chart of the mean number of words per sentence for each subject. For this analysis words such as "we're" are counted as a single word, but hyphenated words are counted as two words. The order of subjects is the same as in the graph above.

image/svg+xml Khan Academy: Mean sentences per video for each subject Peter Collingridge Subjects Words / Sentence 0 4 8 12 16 Tooltip

This shows us the, although arithmetic videos tend to have many sentences (~155), the sentences tend to be quite short (<10 words per sentence). Conversely, geometry, has relatively few sentences per video (~115), but the sentences tend to be quite long (~15 words per sentence). This give us a bit more insight into the different subjects, but doesn't really tell us anything interesting.

There are number of other numbers we could look at, such as the number of words per subject, but I'm not sure how useful that is (if you want the answer multiply the values from the previous graphs). However, I have counted the total number of words in all subtitles and found 1,865,687, which will be useful for later analyses.