Analysing Chinese

The idea for this project was born while I was working on my Chinese Reader. I was thinking about how to store words in a dictionary so that I could quickly identify compound characters in a text. I realised that by analysing the relationships between words in a wide range of texts, I might be able to achieve several goals:

To predict the correct pronunciation or meaning of a word that has more than one, based on context. For example, when is 地 pronounced de and when it is pronounced dì? When is 花 a noun meaning 'flower', and when is it a verb meaning 'to spend'?
To identify grammatical patterns, such as Verb-不-Verb, 越来越-Adjective, 连...都.
To identify patterns in words, for example, 人 often follows a country and means a person from that country.
To predict when an unknown string of character is a name.
To determine which verbs are associated with a given noun and vice versa. For example, if I know the word for dream is 梦, but don't know how to say I had a dream then looking up 'had' or 'have' in the dictionary is unlikely to be helpful. However, searching a corpus of text for verbs associated with the noun dream should tell me that in Chinese you can make, 做, or see, 看 dreams.

In essence, what I want to create is an artificial intelligence that can learn to comprehend Chinese. This should be an interesting problem, especially given that I'm far from fully understanding the intricacies of Chinese myself. At the very least, I hope that in attempting to build such AI, I will learn more about how Chinese sentences are organised.

Because I was reading An Introduction to Bioinformatic Algorithms whilst I was thinking about these problems, I ended up applying various bioinformatic algorithms to Chinese with some success. I think bioinformatic algorithms are ideal since they are generally used to identify patterns in or similar between sequences of symbols.