Predicting names


13 Feb 2010

In all the analysis that I've written about, one type of word that consistently skews results and could lead to flawed conclusions is names. For example, in on hanzi tree 爱 has an unusual position due to the fact that it is most commonly found (in my texts), in the name 爱丽丝 (Alice). Often when reading Chinese text, I will struggle with an unknown word only to find that it is a name (though I'm getting better at spotting names now). I had thought to create a list of names for the analyser to check, but that means I would have to read all the texts first to check myself. I would therefore like to have my Chinese reader predict which words, in a given text, are likely to be names.

There are several clues that should help identify names:

  1. Characters appear more frequently than you would expect. For example, though 爱 is a relatively common verb (especially in beginner's texts for some reason), in Alice in Wonderland, it is the 10th most common character, making up 1.4% of characters, compared to 0.14% in all other texts.
  2. Characters consistently appear with other unusual characters. For example, in my analysis, 94% of occurrences of 丽 are preceded by 爱 and followed by 丝.
  3. Words appear before words, such as 先生 or 太太. For example, 丁丁 is followed by 先生 43% of the time. Also, words after 小 or 老 could be surnames.
  4. In fact, rather than create a list of specific words associated with names, my analysis should be able to identify such words. For example, 丝 is followed by either 说 or 想 24% of the time, so is unlikely to mean silk in these circumstances.
  5. Finally, names often consist of phonetic characters, such as 克, 巴, 特 or 尔.