Visualising character similarities

8 Feb 2010

In the previous update, I defined the distance between two characters. Using this metric, I could create a matrix of distances between a selection of characters. The problem was then how to display this information. My first thought was to create a diagram similar to a phylogenetic tree, which would have the benefit of forcing me to learn how these trees are generated. I have finally created a program that can create, and draw a tree based on a simple hierarchical clustering algorithm. However, while I was struggling with the inevitable recursion these trees require, I hit upon a much simpler method of displaying the relatedness of character: make them organise themselves on the screen.

To get the characters to spontaneously arrange themselves, I used a particle simulation that I’ve used to display networks. Between each pair of characters, I created an interaction (essentially a spring) with a length proportional to the square of the distance between the characters. Then I tweaked the parameters (such as the interaction strength and the scale of interaction lengths), and let of the simulation run. The characters jiggled and jostled and eventually arranged themselves in what appears to be quite a sensible fashion (see video below).

The characters included those that make up > 0.5% of single characters in the texts I have. Between them, these 31 characters make up just over 31% of characters in the text; the size of characters gives an approximate indication of their relative frequencies. Since every character interacts with every other, displaying the interactions makes it hard to see what's happening. I also prefer not to display the the circles that represent the particles, as in the screen shot. In the video I toggle between the interaction and particle displays. I also experimented with creating a icon for the program, but that’s by-the-by and didn’t work very well.

The characters form consistent, and logical clusters. The most obvious cluster is the pronouns (我, 他, 你 and 她), in the centre of the screen. These words are clearly very related and we would expect them to occur in the same parts of a sentence. Other than that, it’s reassuring to see the adjectives 大 and 小 , and the prepositions, 上 and 下 near one another. You can also see that the verbs are nearly all on the right of the image (one exception is 爱, which, in the image, is separate because the vast majority of the occurrences of 爱 are in the name 爱丽丝). The characters 个, 么 and 地 are on the outskirts and appear to represent characters that appear in unique parts of a sentence. I wonder if it might be easier to see patterns if I coloured characters based on their grammatical function (though deciding on how to categorise characters isn’t always easy).

Below is a video of the cloud once I had solved the problem with 爱丽丝. It shows the jelly-like network of characters arranging themselves. The clusters of pronouns very robust as can be seen when I move the 我 particle. The position of 地, on the other hand, is more flexible; it is on the outside of the cluster.

Analysing Chinese