Advertisement

Language Resources and Evaluation

, Volume 49, Issue 3, pp 487–519 | Cite as

Vietnamese treebank construction and entropy-based error detection

  • Phuong-Thai Nguyen
  • Anh-Cuong Le
  • Tu-Bao Ho
  • Van-Hiep Nguyen
Original Paper

Abstract

Treebanks, especially the Penn treebank for natural language processing (NLP) in English, play an essential role in both research into and the application of NLP. However, many languages still lack treebanks and building a treebank can be very complicated and difficult. This work has a twofold objective. Firstly, to share our results in constructing a large Vietnamese treebank (VTB) with three levels of annotation including word segmentation, part-of-speech tagging, and syntactic analysis. Major steps in the treebank construction process are described with particular regard to specific Vietnamese properties such as lack of word delimiter and isolation. Those properties make sentences highly syntactically ambiguous, and therefore it is difficult to ensure a high level of agreement among annotators. Various studies of Vietnamese syntax were employed not only to define annotations but also to systematically deal with ambiguities. Annotators were supported by automatic labelling tools, which are based on statistical machine learning methods, for sentence pre-processing and a tree editor for supporting manual annotation. As a result, an annotation agreement of around 90 % was achieved. Our second objective is to present our method for automatically finding errors and inconsistencies in treebank corpora and its application to the construction of the VTB. This method employs the Shannon entropy measure in a manner that the more reduced entropy the more corrected errors in a treebank. The method ranks error candidates by using a scoring function based on conditional entropy. Our experiments showed that this method detected high-error-density subsets of original error candidate sets, and that the corpus entropy was significantly reduced after error correction. The size of these subsets was only about one third of the whole set, while these subsets contained 80–90 % of the total errors. This method can also be applied to languages similar to Vietnamese.

Keywords

Treebank Error detection Entropy 

Notes

Acknowledgments

This paper is supported by the project QGTĐ.12.21 funded by Vietnam National University, Hanoi. We would like to express special thanks to other members of the treebank development team Xuan-Luong Vu and Dr. Thi-Minh-Huyen Nguyen, and linguistic annotators Minh-Thu Dao, Thi-Minh-Ngoc Nguyen, Kim-Ngan Le, Mai-Van Nguyen for the effective cooperation. We also would like to express thanks to Assoc. Prof. Dinh Dien for his comments and discussions during the early stages of the treebank development.

References

  1. Awate, S. P., & Whitaker, R. T. (2006). Unsupervised, information-theoretic, adaptive image filtering for image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 364–376.CrossRefGoogle Scholar
  2. Berger, A., Pietra, S. D., & Pietra, V. D. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–71.Google Scholar
  3. Black, E., Abney, S., Flickenger, D., Gdaniec, C., Grishman, R., Harrison, P., et al. (1991). A procedure for quantitatively comparing the syntactic coverage of English grammars. In Proceedings of DARPA speech and natural language workshop.Google Scholar
  4. Cao, X.-H. (2007). The Vietnamese language: Phonetics, syntax, and semantics [in Vietnamese]. Cambridge: Education Press.Google Scholar
  5. Chiang, D., & Bikel, D. M. (2002). Recovering latent information in treebanks. In Proceedings of COLING.Google Scholar
  6. Collins, M. (1999). Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania.Google Scholar
  7. Cover, T. M., & Thomas, J. A. (2006). Elements of information theory. New York: Wiley.Google Scholar
  8. Dickinson, M., & Meurers, W. D. (2003). Detecting errors in part-of-speech annotation. In Proceedings of EACL.Google Scholar
  9. Dickinson, M. (2006). From detecting errors to automatically correcting them. In Proceedings of EACL.Google Scholar
  10. Dickinson, M. (2008). Ad hoc treebank structures. In Proceedings of ACL.Google Scholar
  11. Diep, Q.-B. (2005). Vietnamese syntax [in Vietnamese]. Cambridge: Education Press.Google Scholar
  12. Han, C., Han, N., Ko, E., & Palmer, M. (2002). Development and evaluation of a Korean treebank and its application to NLP. In Proceedings of LREC.Google Scholar
  13. Johnson, M. (1998). PCFG models of linguistic tree representation. Computational Linguistics, 24, 613–632.Google Scholar
  14. Jurafsky, D., & Martin, J. H. (2009). Speech and language processing: An introduction to natural language processing., Computational linguistics and speech recognition New Jersey: Prentice Hall.Google Scholar
  15. Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of ACL.Google Scholar
  16. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML.Google Scholar
  17. Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19, 313–330.Google Scholar
  18. Mitchell, T. M. (1997). Machine learning. Maidenhead: McGraw-Hill.Google Scholar
  19. Miyao, Y., & Tsujii, J. (2008). Feature forest models for probabilistic HPSG parsing. Computational Linguistics, 34, 35–80.CrossRefGoogle Scholar
  20. Nguyen, V.-H. (2009). Vietnamese syntax [in Vietnamese]. Cambridge: Education Press.Google Scholar
  21. Nguyen, T.-M.-H., Vu, X.-L., Le, & H.-P. (2003). A case study of the probabilistic tagger QTAG for tagging Vietnamese texts [in Vietnamese]. In Proceedings of ICT.rda.Google Scholar
  22. Nguyen, T.-C. (2004). Vietnamese syntax [in Vietnamese]. Hanoi: Vietnam National University Press.Google Scholar
  23. Nguyen, P.-T., Vu, X. L., Nguyen, T. M. H., Nguyen, V. H., & Le, H. P. (2009). Building a large syntactically-annotated corpus of Vietnamese. In Proceedings of LAW-3, ACL-IJCNLP.Google Scholar
  24. Nguyen, V.-H. (2009). The history of approaches in describing Vietnamese syntax. Journal of the Research Institute for World Languages, (1), 19–34Google Scholar
  25. Novak, V., & Razimova, M. (2009). Unsupervised detection of annotation inconsistencies using apriori algorithm. In Proceedings of LAW-3, ACL-IJCNLP.Google Scholar
  26. Pajas, P., & Stepanek, J. (2008). Recent advances in a feature-rich framework for treebank annotation. In Proceedings of COLING.Google Scholar
  27. Phuong, L. H., Huyen, N. T. M., Azim, R., & Vinh, H. T. (2008). A hybrid approach to word segmentation of vietnamese texts. In Proceedings of the 2nd international conference on language and automata theory and applications. Springer LNCS 5196, Tarragona, Spain, 2008.Google Scholar
  28. Rambow, O. (2010). The simple truth about dependency and phrase structure representations: An opinion piece. In Proceedings of NAACL.Google Scholar
  29. Santorini, B. (1990). Part-of-speech tagging guidelines for the Penn Treebank Project. In Treebank-3 Documents. Linguistic Data Consortium.Google Scholar
  30. Sciullo, A. M. D., & Williams, E. (1987). On the definition of word. Cambridge: The MIT Press.Google Scholar
  31. Steedman, M., Osborne, M., Sarkar, A., Clark, S., Hwa, R., Hockenmaier, J., et al. (2003). Bootstrapping statistical parsers from small datasets. In Proceedings of EACL.Google Scholar
  32. Thompson, L. C. (1987). A Vietnamese reference grammar. Hawaii: University of Hawaii Press.Google Scholar
  33. van Halteren, H. (2000). The detection of inconsistency in manually tagged text. In Proceedings of LINC.Google Scholar
  34. Xue, N., Xia, F., Chiou, F.-D., & Palmer, M. (2005). The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11, 207–238.CrossRefGoogle Scholar
  35. Yamada, H., & Matsumoto, Y. (2003). Statistical dependency analysis with support vector machines. In Proceedings of IWPT.Google Scholar
  36. Yates, A., Schoenmackers, S., & Etzioni, O. (2006). Detecting parser errors using web-based semantic filters. In Proceedings of EMNLP.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2015

Authors and Affiliations

  • Phuong-Thai Nguyen
    • 1
  • Anh-Cuong Le
    • 1
  • Tu-Bao Ho
    • 2
  • Van-Hiep Nguyen
    • 3
  1. 1.University of Engineering and Technology, Vietnam National UniversityHanoiVietnam
  2. 2.Japan Advanced Institute of Science and TechnologyNomiJapan
  3. 3.Institute of LinguisticsVietnam Academy of Social SciencesHanoiVietnam

Personalised recommendations