Building a contextual dictionary

The text reader now displays known compounds of characters

Yet another version of my Chinese Reader App (version 1.0.4) is underway. The most obvious difference is the addition of a list that displays the compounds of any single hanzi queried. The second, apparently-cosmetic, difference is that the title is now larger and centred. Both these changes required some quite major changes in the how the program functions, and as a result the code is now a bit more robust, cleaner, and pleasingly, shorter (though that’s partly due to shifting some of the work into a separate module). The app can now deal with compounds more efficiently and format text better, though work still is required to improve both. As a side note, the text in the screen shot is from chapter 0.5 of the excellent A Key to Chinese Speech and Writing, book 1.

One issue I've come across is how to deal with the situation where, for example, 中国 is added to the dictionary, before 中 or 国 are added. In this case, 中 or 国 should not be coloured unless they appear together or are later added separately. The reason for this is that it's useful to learn compounds without necessarily knowing what the individual characters mean. Though in the case of 中国, it is probably useful to know what both 中 and 国 mean.

However, when learning say, 自然, knowing what 自 and 然 mean individually is not very useful.

Instead, I would like to understand the use of 然 by building up a list of words containing it: 自然 (nature), 突然 (sudden), 虽然 (although), 当然 (of course) and 然后 (afterwards). Similarly, by seeing that 国 is found in 美国 (America), 法国 (France), 中国 (China) and 国际 (international), should give a pretty good idea of what 国 might mean. This seems a more natural way of learning Chinese.

This then raises the problem of when to consider a collection of hanzi as a compound or a string of individual hanzi. For example, 吃饭 could be seen as a verb (to eat) plus a noun (rice), or it could be seen as a compound meaning to eat. Similarly, 结婚 could be seen as a verb (to tie) + noun (marriage) or simply a compound verb (to marry). As in the case of 自然, it seems sensible to learn the compounds. But it also might be useful to learn what other nouns can be used with 吃 or 结. Nouns could then be classified by what verbs they can be used with. (Another way to classify nouns would be by what measure word is used with them). All this would require quite a lot of information to be attached to each word.

As another example, should the dictionary store the words 中国人 and 美国人, or store the rule that 人 can follow a country to mean 'a person from that country'. (And in the latter case how this could be recorded in a dictionary?). Steve Pinker writes about the different ways humans learn words (in particular, regular and irregular verbs) in his excellent book Words and Rules. Incidentally, in English, however, we still have to learn lots of words (an 'American', a 'Finn', a 'Swede', a 'Netherlander' an 'English person'). In some respects, I think the hardest part of making a Chinese reader that can translate Chinese text to English might actually be dealing with English grammar.

Thinking about these problems has given me another project, analysing Chinese text, looking for patterns.

On a very distantly related note, I got my user name and password for the governments data site (which you can read about here). It seems that the data is modelled using RDF, which I must admit I know nothing about, and can be searched using SPARQL, which I also no nothing about. Now might be a good time to learn – it may prove useful for organising language data. I will certainly have to organise the data in a more sophisticated structure.

