Abstract
In Chinese, there are many characters which are similar in shape, and this phenomenon usually induces writing errors. As one important issue in spelling automatic correction, shape similarity measurement is still a challenging problem. To address this issue, we propose a component-tree based method in this paper, which is based on the hypothesis “characters are similar if their construction and components are both similar”. Firstly, we decompose each character to a tree recursively, in which the root node is the character and the leaf nodes are atomic parts, called strokes. Then, we align any pair of trees using their minimal super-tree and calculate their similarity from bottom to up based on weighted edit distance. Finally, the cognitive prominence is used to adjust the similarity scores. In text proofreading experiments, our method achieved 97% precision and 95.6% recall, which can be applied in practical systems.
Preview
Unable to display preview. Download preview PDF.
References
Rou, S., Min, L., Shili, G.: Similarity Calculation of Chinese Character Glyph and its Application in Computer Aided Proofreading System. Journal of Chinese Computer Systems 29 (2008)
Lin, M., Song, R.: A Stroke-Segment-Mesh (SSM) Glyph Description Method of Chinese Characters. Journal of Computer Research and Development 47(2) (2010)
Nagata, M.: Japanese OCR error correction using character shape similarity and statistical language model. In: Proceedings of the 17th International Conference on Computational Linguistics (1998)
Chinese Character Coding Group: Shanghai Jiaotong University: Chinese Character Information Dictionary. Science Press, Beijing (1988)
National Languate Committee: GF3001-1997 Chinese Character Component Standard of GB 13000.1 Character Set for Information Processing. Language & Culture Press, Beijing (1997)
Bishop, T., Cook, R.: A Specification for CDL (Character Description Language). http://www.wenlin.com/cdl/cdl_spec_2003_10_32.pdf
ZhiWei, F.: Description of Chinese Character Structure by Context Free Grammar. Lingustic Sciences 5(3), 14–23 (2006)
Xingming, S., Jianping, Y., Huowang, C.: On Mathematical Expression of a Chinese Character. Journal of Computer Research and Development 39(6), 707–711 (2002)
ChuBong-Foo: Handbook of the Fifth Generation of the Cangjie Input Method (2008). http://www.cbflabs.com/book/ocj5/ocj5/index.html
Liu, C.L., Lin, J.H.: Using structural information for identifying similar Chinese characters. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies (2008)
Jing, C., Zhichun, M., Youqian, S.: Computer simulation of the cognition of Chinese characters. Transactions on Intelligent Systems 3 (2008)
Marzal, A., Vidal, E.: Computation of normalized edit distance and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 15 (1993). Ph.D. Dissertation Submitted to UC Berkeley, Department of Linguistics (2003)
Tversky, A.: Preference, Brlief, and Similarity. MIT Press (2003)
Jiang, T., Wang, L., Zhang, K.: Alignment of trees: an alternative to tree edit. Theoretical Computer Science 143(1), 137–148 (1995)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Cao, Y., Wang, S., Cao, C. (2015). Tree Based Shape Similarity Measurement for Chinese Characters. In: Zhang, S., Wirsing, M., Zhang, Z. (eds) Knowledge Science, Engineering and Management. KSEM 2015. Lecture Notes in Computer Science(), vol 9403. Springer, Cham. https://doi.org/10.1007/978-3-319-25159-2_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-25159-2_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25158-5
Online ISBN: 978-3-319-25159-2
eBook Packages: Computer ScienceComputer Science (R0)