In all the analysis that I've written about, one type of word that consistently skews results and could lead to flawed conclusions is names. For example, in on hanzi tree 爱 has an unusual position due to the fact that it is most commonly found (in my texts), in the name 爱丽丝 (Alice). Often when reading Chinese text, I will struggle with an unknown word only to find that it is a name (though I'm getting better at spotting names now). I had thought to create a list of names for the analyser to check, but that means I would have to read all the texts first to check myself. I would therefore like to have my Chinese reader predict which words, in a given text, are likely to be names.
There are several clues that should help identify names:
- Characters appear more frequently than you would expect. For example, though 爱 is a relatively common verb (especially in beginner's texts for some reason), in Alice in Wonderland, it is the 10th most common character, making up 1.4% of characters, compared to 0.14% in all other texts.
- Characters consistently appear with other unusual characters. For example, in my analysis, 94% of occurrences of 丽 are preceded by 爱 and followed by 丝.
- Words appear before words, such as 先生 or 太太. For example, 丁丁 is followed by 先生 43% of the time. Also, words after 小 or 老 could be surnames.
- In fact, rather than create a list of specific words associated with names, my analysis should be able to identify such words. For example, 丝 is followed by either 说 or 想 24% of the time, so is unlikely to mean silk in these circumstances.
- Finally, names often consist of phonetic characters, such as 克, 巴, 特 or 尔.