Chinese Reader

The aim of this project initially was to create a program in which I display Chinese text, and which would allow me to easily look words up and add them to a list to study. To create the app, I used Python and the Tkinter module, which helps create a graphical user interface. After getting a little way into this project, I started thinking about how to display all the possible meanings of a character, and whether I get the dictionary to predict the correct meaning based on the context. This led to a side project of analysing Chinese sentence structure and character frequencies.

Displaying text

I've started yet another potentially huge programming project, this one inspired by my return to learning Chinese (mainly on iKnow (aka Smart.fm) and Lingq). This program returns to one I started a while ago, which allowed me to search an open source Chinese-English dictionary (cedict) and select words to create a list of vocab. What I want to create now is a program that allows me to store all the Chinese words I "know" and quantify to what degree I know them (do I know the pinyin, the tone, a meaning, all the meanings in all contexts, etc.?). There are many online applications that do something similar (including iKnow and Lingq), but they never seem to function quite as I would like. Maybe that’s because I'm not sure exactly what I want.

One thing I’d like is to have a collection of sentences that I can attempt to read, and if I get stuck, I want to be able to find out what a word is quickly. But I would also like the opposite in that: if I look up a word, I would like to see all the sentences in my collection that contain it, thus showing the word in context. One idea I've been considering is to build a network of hanzi linked by relationships such as tone, pinyin (perhaps broken down into initial and ending), radical, grammatical function (i.e. noun, verb, adjective etc.), vocab type (e.g. animal, relative, food etc.), where I have come across the word (e.g. iKnow, Harry Potter, a Go book etc.). This should help solidify the network that is being built my brain. The network might also be useful in identifying words that I’m likely to confuse.

Chinese Reader v1

Above is an screen shot of a Tkinter app that I've created. So far it just displays some text (in this case, the first three lines of a translation of Harry Potter) as an array of labels on a tkinter grid. I’m not sure whether this is a particularly good idea and may switch to using Pygame to display the text.  It should be relatively quick to add a display of word meaning (which are currently output to the command line) using cedict. I think one major challenge will be to get the program to identify names in the text, which again, will require some understanding of Chinese grammar.

 

Adding a dictionary

My text reader displaying the first few lines of Harry Potter

I have made some progress on my Chinese text reader program over the last couple of days. Now, when the mouse is moved over a character, that character's pinyin is displayed in the first box on the left and its meaning in the box below. The app also searches for compounds consisting of the selected character and the character following it. The screen shot shows how the program identifying the word 先生 in the text. As a side note, I had to add an arrow to the screen shot of the application because it doesn’t appear when you use Print Screen. I got the image of the arrow from here, which was very helpful.

I realised that the previous idea of displaying characters as an array of labels was quite a stupid idea so had to think of another method. I wanted the display the hanzi in such a way that I could display information about the character when it was moused over or clicked. Despite being more familiar with Pygame, I decided to stick with Tkinter, because I want to add the ability to input text and perhaps have lists, both of which are much simpler in Tkinter. I first changed the program (version 1.01, I guess) to use an array of buttons, which output information about the hanzi they displayed when pressed, but again, this seemed an overly complicated way of doing things.

Finally, I discovered that I could bind events to individual items on a Tkinter canvas, so now the hanzi are displayed one-by-one onto a canvas. In the hope that this blog might be useful to someone put a section of code below. The code shows two function of my App object (which inherits the Frame object). The code displays each of the characters that belong to the App's document object (originally called text, until I realised overwrote a Tkinter object) on a canvas and when the mouse is moved over (which is the "<Enter>" event) any hanzi (which are tagged as 'hanzi'), that hanzi is looked up in the App’s dictionary.

def createCanvas(self):
    self.canvas = Canvas(self, width=self.width, height=self.height, bg='white')
    self.canvas.grid()
    x = y = 20

    for hanzi in self.document.characters:
        item = self.canvas.create_text((x, y), text=hanzi, font=("Arial", 12), tags='hanzi')

        x += self.char_width
        if x > self.width:
            x = 20
            y += self.char_height
            if y > self.height:
                break
    self.canvas.tag_bind('hanzi', "", self.mouseoverHanzi)

def mouseoverHanzi(self, event):
    n = event.widget.find_closest(event.x, event.y)[0]-1
    hanzi = self.dictionary.search(self.document.character[n])

Things to improve

  • I'd like to be able to look up compounds of more than two characters, which I think will require improvements in the way the dictionary stores information, so it can retrieve compounds efficiently. Maybe I’ll learn how to use SQL or something, which I’ve been planning for a while now.
  • The program should also look up compounds in which the character under the mouse is not the first. For example, if a user mouses over (there must be a better verb) 生 in the image text, the program should offer 先生.
  • I'd like to add a visual aspect to the display, for example, highlighting the character or compound under selection, or by creating the option to display the pinyin under each hanzi.
  • A major functionality that I intend to add is the ability to update the dictionary, so for example, I could add 女贞 (which means Ligustrum lucidum, the Chinese Privet (Harry Potter lives on Privet Drive)), which isn’t in cedict. This again may require changing the way I store information in the dictionary.
  • In fact, I would like to have a separate list, which would contain words I know or would like to learn. The program could then highlight which words are not currently in my list.
  • I also need to fix an issue with words like 号, which have two readings and corresponding meanings. Currently, when 号 is selected, the program displays it’s pinyin as "hao2", and its meaning as "roar; cry", because this is the first entry in the dictionary, coming before the other pinyin and meaning (which is the correct one in this context), "hao4" and "day of a month; (ordinal) number". This could, yet again, be overcome by altering the way the dictionary stores information.

Adding some colours

I've made a bit of progress with my app; most of my effort went into trying to understand unicode and generally getting confused by the various encodings (specifically, UTF-8 and unicode). The upshot is that the app can now save lists of hanzi and handles text much better, not breaking when given text with English words, numbers or punctuation. In fact, the app can now handle punctuation and formatting much better, so doesn’t, for example, attempt to look up punctuation in the dictionary, and colours it differently (see image).

I worked out how to colour characters different, so I can keep track of which hanzi I know (whatever 'know' means). Currently, I can save a new list of words by clicking on hanzi and then on the ‘Add to list’ button. Once added to the list, all incidences of the hanzi change from red to blue. The pinyin and meaning boxes are now TKinter Entry widgets, which means I can edit the pinyin and meaning before adding the hanzi to the list. I can also click and drag on the canvas to select multiple hanzi (which will be searched for in the dictionary), and then add this compound to my list. I could, if I so wished, also select an entire sentence, write its pinyin and meaning, then add this to the list.

Reader v1.03

However there are a few problems. Because text is selected by clicking and dragging to create an invisible rectangle, sentences that span more than one line can't be selected properly. Moreover, it is possible to select to the first hanzi of every line, in which case they will be concatenated into a single meaningless string. Also, colouring hanzi currently only works when single hanzi are added to the dictionary, not compounds or sentences. I’m still pondering how to handle single hanzi, compounds and sentences. Finally, I would like the app to load saved word lists when starting, and use this as a dictionary before tries the main cedict-based one.

Incidentally, the text in the image is from Ellie’s Secret Diary, which I got as a dual text from the excellent children’s section of Oxford’s public library. I’m slowly building up my own library of texts for my app to use. I’d like to further improve the way the app deals with formatting, so things such as titles can be displayed differently and I’d also like to create multiple pages for different chapters. I’ve added a scrollbar for the text, but it seems to confuse the app’s ability to locate text on the page, so mousing-over one character displays another. Finally, I need to think of a good way to efficiently work out how high the canvas should be. So quite a bit more still to be done.

Building a contextual dictionary

The text reader now displays known compounds of characters

Yet another version of my Chinese Reader App (version 1.0.4) is underway. The most obvious difference is the addition of a list that displays the compounds of any single hanzi queried. The second, apparently-cosmetic, difference is that the title is now larger and centred. Both these changes required some quite major changes in the how the program functions, and as a result the code is now a bit more robust, cleaner, and pleasingly, shorter (though that’s partly due to shifting some of the work into a separate module). The app can now deal with compounds more efficiently and format text better, though work still is required to improve both. As a side note, the text in the screen shot is from chapter 0.5 of the excellent A Key to Chinese Speech and Writing, book 1.

One issue I've come across is how to deal with the situation where, for example, 中国 is added to the dictionary, before 中 or 国 are added. In this case, 中 or 国 should not be coloured unless they appear together or are later added separately. The reason for this is that it's useful to learn compounds without necessarily knowing what the individual characters mean. Though in the case of 中国, it is probably useful to know what both 中 and 国 mean.

However, when learning say, 自然, knowing what 自 and 然 mean individually is not very useful.

Instead, I would like to understand the use of 然 by building up a list of words containing it: 自然 (nature), 突然 (sudden), 虽然 (although), 当然 (of course) and 然后 (afterwards). Similarly, by seeing that 国 is found in 美国 (America), 法国 (France), 中国 (China) and 国际 (international), should give a pretty good idea of what 国 might mean. This seems a more natural way of learning Chinese.

This then raises the problem of when to consider a collection of hanzi as a compound or a string of individual hanzi. For example, 吃饭 could be seen as a verb (to eat) plus a noun (rice), or it could be seen as a compound meaning to eat. Similarly, 结婚 could be seen as a verb (to tie) + noun (marriage) or simply a compound verb (to marry). As in the case of 自然, it seems sensible to learn the compounds. But it also might be useful to learn what other nouns can be used with 吃 or 结. Nouns could then be classified by what verbs they can be used with. (Another way to classify nouns would be by what measure word is used with them). All this would require quite a lot of information to be attached to each word.

As another example, should the dictionary store the words 中国人 and 美国人, or store the rule that 人 can follow a country to mean 'a person from that country'. (And in the latter case how this could be recorded in a dictionary?). Steve Pinker writes about the different ways humans learn words (in particular, regular and irregular verbs) in his excellent book Words and Rules. Incidentally, in English, however, we still have to learn lots of words (an 'American', a 'Finn', a 'Swede', a 'Netherlander' an 'English person'). In some respects, I think the hardest part of making a Chinese reader that can translate Chinese text to English might actually be dealing with English grammar.

Thinking about these problems has given me another project, analysing Chinese text, looking for patterns.

On a very distantly related note, I got my user name and password for the governments data site (which you can read about here). It seems that the data is modelled using RDF, which I must admit I know nothing about, and can be searched using SPARQL, which I also no nothing about. Now might be a good time to learn – it may prove useful for organising language data. I will certainly have to organise the data in a more sophisticated structure.