HSK self-check tool

Effectiveness of Automatic Sentence Structure Analysis in Chinese Comprehension

Xiaoling Mo and Masato Hagiwara

Presented at 2011 International Conference on Chinese Textbook and New Teaching Resources, Columbia University, New York, 20-21st May, 2011


Introductory learners often find it difficult to identify words and proper nouns when reading Chinese sentences, which make it for them to even simply look up in a dictionary. Even when they can successfully identify words, it is often difficult to accurately understand the sentence structure.

There can be two approaches to this problem — one is a rather traditional way, i.e., through teachers’ explanation. The other is to make the most of computer-based aiding tools.

There have been a large number of computer assisted language learning (CALL) tools for Chinese, some of which are summarized in the following table.

tool segmentation dictionary loookup
(Silicon Valley Language Technologies, LLC.)
Chinese Annotation Tool
(Peterson, Eric)
Chinese Annotation Tool
Popup Chinese
GoChinese Online Tool
DimSum: Chinese Reading Assistant and Dictionary
(Peterson, Eric)
Chinese Toolbox 2011
(Aaron Todd Sherrill)
(ctrans タケウチ)
Words-Chinese Pinyin Dictionary 1.6
(Firefox add-ons, HuangJason)
多言語対応日本語読解支援システム「あすなろ」 Y Y

Past researches have also focused on how computers can help second language reading. Kanda et al. (2007) [1] have demonstrated that presenting English sentences based on “chunks” can enhance the learners’ reading speed. Li (2008) [2] also demonstrated that the “phrase theory” can assist English learners for passage reading comprehensions.

The purpose of this study is, therefore, to answer what kind of automatic linguistic analysis techniques, combined with their visualization styles, are effective for Chinese reading assistant.


In order to answer these questions, we have mainly focused on two grammatical elements — word segmentation and predicate phrases and verbs, because they are supposedly the most important for comprehending sentence structure.

The above figure illustrates the whole architecture of the Chinese reading assistant tool we are developing. The Web server responsible for linguistic analysis works as a back-end, receives plain text and sends it back to the client’s browser. The browser then renders the result and show the result in a graphical interface.

The linguistic analysis here is three-layered — plain text, word segmentation, and syntactic parse tree. We used Stanford Chinese Word Segmenter for word segmentation, which uses CRF (Conditional Random Field as the algorithm) and Chinese Penn Treebank Standard as the tagging standard. [3] Also, we used Stanford statistical parser for pasing, which is based on PCFG (probabilistic context free grammar) and the factored grammar.

Below is an example of these three-layered outputs:

Plain text:

After word segmentation:
一般  来说  ,  人才  市场 一  年  来  有  两  个  旺季  。

Syntactic parse:
    [“NP”, [“NN”, “一般”]],
    [“LC”, “来说”]],
  [“PU”, “,”],
  [“NP”, [“NN”, “人才”], [“NN”, “市场”]]
      [“QP”, [“CD”, “一”], [“CLP”, [“M”, “年”]]],
      [“LC”, “来”]
    [“VP”, [“VE”, “有”],
        [“QP”, [“CD”, “两”], [“CLP”, [“M”, “个”]]],
        [“NP”, [“NN”, “旺季”]]

The above analysis results are rendered by the assistant tool in the following three visualization styles:

Plain text:

After word segmentation:
一般  来说  ,  人才  市场 一  年  来  有  两  个  旺季  。

Syntactic parse:
一般  来说  , 人才  市场 ( 一  年  来    两  个  旺季 ) 。

Here’s how the positions of parentheses and underlines are determined based on the syntactic parse: the out-most (i.e., the shallowest in the parse tree from the root) VP (verb phrases) are parenthesized, unless the whole sentence is VP or the sentence itself is too short, such as choices for questions. The verbs (which are marked by tags starting with a letter “v”) are also underlined, unless its ancestors include “NP”s (noun phrases).


In this section, we describe how we measured the effectiveness of visually showing grammatical elements to Chinese learners. In order to test this, we have set up a tool “HSK self-check tool” (模拟HSK的自测工具) [5]. Using this tool, simulating the real HSK test, users answer a total of 50 questions and then are given a feedback which HSK grade their Chinese reading skill correspond to. The test consists of two parts — synonym distinction (Part 1) and passage comprehension (Part 2), each of which consist of 20 and 30 questions, respectively. The total numbers of questions the users answered correctly and the time taken to finish the test were recorded.

The sentences in Part 1 are always shown in plain text in order to equally distribute the experiment subjects’ language skills. The sentences in Part 2, on the other hand, are randomly shown in different format (plain text, word segmentation, and syntactic parse) with a probability of 1/3. In this way, we can measure how the three visualization styles can affect the users’ comprehension WITHOUT letting users notice that they are being tested for the tool.

The following figure shows the results:

The three groups of bars correspond to three ways of visualization styles. “plain,” “seg,” and “bracket” stand for plain text, word segmentation, and syntactic parse. The blue bars (part 1), red bars (part 2), and green bars (speed) represent averaged scores for part1, part 2, and the speed of the reading. The speed is calculated as the number of questions users can answer within 20 minutes, which is proportional to the inverse of the time taken to finish the test).

We can observe the general trend that the richer the visualization styles are, the faster and more accurate users’ reading comprehension is.

Notice that the average score of part 1 for plain is slightly lower than the other two groups. This is mainly due to relatively small size of samples we collected for this test.


We have shown that the richer visualization of Chinese syntactic structure can help users to comprehend sentences in a more accurate and faster way. Also, most of past researches only focused on word segmentation, while this research showed the effectiveness of visualization of syntactic parse for the first time, at least to the authors’ knowledge.

Some of the users feedback include:

  • The questions taken from the HSK intermediate level are relatively difficult for beginners
  • Insertion of parentheses and underlines can negatively affect highly advanced learners
  • Parenthesized phrases could mistakenly recognized as additional information
  • The number of samples is relatively small, strongly biased to Japanese speakers

We are planning to improve the tool based on the feedback and to release to public.


[1] 神田明延, 湯舟英一, 田淵龍二, 鈴木政浩: ソフトウェアのチャンク提示法による速読訓練の効果. 第47 回LET 全国研究大会発表要項, 2007.
[2] 李薇薇: 组块理论在大学生英语阅读理解中作用的研究. In 黑龙江科技信息, 2008.
[3] http://nlp.stanford.edu/software/segmenter.shtml
[4] http://nlp.stanford.edu/software/lex-parser.shtml
[5] http://www.asianharmony.net/hsk-self-check/

One Responseso far.

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>