Advertisement

Language Resources and Evaluation

, Volume 52, Issue 1, pp 269–315 | Cite as

Ensuring annotation consistency and accuracy for Vietnamese treebank

  • Quy T. Nguyen
  • Yusuke Miyao
  • Ha T. T. Le
  • Nhung T. H. Nguyen
Original Paper

Abstract

Treebanks are important resources for researchers in natural language processing. They provide training and testing materials so that different algorithms can be compared. However, it is not a trivial task to construct high-quality treebanks. We have not yet had a proper treebank for such a low-resource language as Vietnamese, which has probably lowered the performance of Vietnamese language processing. We have been building a consistent and accurate Vietnamese treebank to alleviate such situations. Our treebank is annotated with three layers: word segmentation, part-of-speech tagging, and bracketing. We developed detailed annotation guidelines for each layer by presenting Vietnamese linguistic issues as well as methods of addressing them. Here, we also describe approaches to controlling annotation quality while ensuring a reasonable annotation speed. We specifically designed an appropriate annotation process and an effective process to train annotators. In addition, we implemented several support tools to improve annotation speed and to control the consistency of the treebank. The results from experiments revealed that both inter-annotator agreement and accuracy were higher than 90%, which indicated that the treebank is reliable.

Keywords

Vietnamese treebank Quality control Consistent annotation Linguistic challenges 

Notes

Acknowledgements

We would like to thank Assoc. Prof. Dien Dinh and Dr. Ngan L.T. Nguyen for their comments and the discussions we had with them during the early stages of developing the guidelines. We also would like to thank our annotators for their cooperation.

References

  1. Abeillé, A., Clément, L., & Toussenel, F. (2003). Building a treebank for french. In Treebanks (pp. 165–187). New York: Springer.Google Scholar
  2. Allauzen, A., Aufrant, L., Burlot, F., Knyazeva, E., Lavergne, T., & Yvon, F. (2016). Limsi@ wmt’16: Machine translation of news. In Proceedings of the first conference on machine translation (pp. 239–245). Association for Computational Linguistics.Google Scholar
  3. Barr, C., Jones, R., & Regelson, M. (2008). The linguistic structure of English web-search queries. In Proceedings of the conference on empirical methods in natural language processing (pp. 1021–1030). Association for Computational Linguistics.Google Scholar
  4. Bies, A., Ferguson, M., Katz, K., MacIntyre, R., Tredinnick, V., Kim, G., et al. (1995). Bracketing guidelines for treebank II style penn treebank project. Philadelphia: University of Pennsylvania.Google Scholar
  5. Cai, J., Utiyama, M., Sumita, E., & Zhang, Y. (2014). Dependency-based pre-ordering for Chinese–English machine translation. In Proceedings of the 52nd annual meeting of the association for computational linguistics (pp. 155–160). Association for Computational Linguistics.Google Scholar
  6. Chang, P. C., Galley, M., & Manning, C. D. (2008). Optimizing Chinese word segmentation for machine translation performance. In Proceedings of the third workshop on statistical machine translation (pp. 224–232). Association for Computational Linguistics.Google Scholar
  7. Chinkina, M., Kannan, M., & Meurers, D. (2016). Online information retrieval for language learning. In Proceedings of the 54th annual meeting of the association for computational linguistics-system demonstrations (pp. 7–12).Google Scholar
  8. Corp, D.C.S. LacViet. (2011). Vietnamese dictionary. LacViet Corp.Google Scholar
  9. Diep, Q.-B. (2005). Vietnamese grammar. Ha Noi: Vietnam Education Publisher.Google Scholar
  10. Dinh, D., & Vu, T. (2006). A maximum entropy approach for Vietnamese word segmentation. In Proceedings of research, innovation and vision for the future in computing and communication technologies (pp. 248–253). IEEE.Google Scholar
  11. Di Sciullo, A. M., & Williams, E. (1987). On the definition of word (Vol. 14). New York: Springer.Google Scholar
  12. Fang, A. C., & Cao, J. (2010). Enhanced genre classification through linguistically fine-grained pos tags. In Proceedings of paclic (pp. 85–94).Google Scholar
  13. Galitsky, B., Ilvovsky, D. I., Kuznetsov, S. O. & Strok, F. (2013). Matching sets of parse trees for answering multi-sentence questions. In Proceedings of RANLP (pp. 285–293).Google Scholar
  14. Han, C. H., Han, N. R., Ko, E. S., & Palmer, M. (2002). Development and evaluation of a Korean treebank and its application to NLP. In Proceedings of the 3rd international conference on language resources and evaluation (LREC-2002) (pp. 1635–1642).Google Scholar
  15. Hoang, P. (1998). Vietnamese dictionary. Singapore: Scientific & Technical Publishing.Google Scholar
  16. Hoshino, S., Miyao, Y., Sudoh, K., Hayashi, K., & Nagata, M. (2015). Discriminative preordering meets Kendall’s tau maximization. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (short papers) (pp. 139–144). Association for Computational Linguistics.Google Scholar
  17. Jijkoun, V., De Rijke, M., & Mur, J. (2004). Information extraction for question answering: Improving recall through syntactic patterns. In Proceedings of the 20th international conference on computational linguistics (pp. 1284). Association for Computational Linguistics.Google Scholar
  18. Katz-Brown, J., Petrov, S., McDonald, R., Och, F., Talbot, D., Ichikawa, H., Seno, M., & Kazawa, H. (2011). Training a parser for machine translation reordering. In Proceedings of the conference on empirical methods in natural language processing (pp. 183–192). Association for Computational Linguistics.Google Scholar
  19. Le, H. P., Nguyen, T. M. H., & Roussanaly, A. (2012). Vietnamese parsing with an automatically extracted tree-adjoining grammar. In Proceedings of research, innovation and vision for the future in computing and communication technologies (RIVF) (pp. 1–6). IEEE.Google Scholar
  20. Le, A. C., Nguyen, P. T., Vuong, H. T., Pham, M. T., & Ho, T. B. (2009). An experimental study on lexicalized statistical parsing for Vietnamese. In Proceedings of knowledge and systems engineering (pp. 162–167). IEEE.Google Scholar
  21. Le-Hong Phuong, N. T. M., Huyen, A. R., & Vinh, H. T. (2008). A hybrid approach to word segmentation of Vietnamese texts. In Proceedings of the 2nd international conference on language and automata theory and applications.Google Scholar
  22. Le-Hong, P., Roussanaly, A., Nguyen, T. M. H., & Rossignol, M. (2010). An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts. In Traitement Automatique des Langues Naturelles-taln 2010 (pp. 12).Google Scholar
  23. Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2), 313–330.Google Scholar
  24. Miyao, Y., & Tsujii, J. (2008). Feature forest models for probabilistic HPSG parsing. Computational Linguistics, 34(1), 35–80.CrossRefGoogle Scholar
  25. Nghiem, M., Dinh, D., & Nguyen, M. (2008). Improving Vietnamese POS tagging by integrating a rich feature set and support vector machines. In Proceedings of research, innovation and vision for the future in computing and communication technologies (RIVF) (pp. 128–133). IEEE.Google Scholar
  26. Nguyen, T. M. H., Hoang, T. T. L., & Vu, X. L. (2010). Vietnamese word segmentation guidelines. Technical report sp 8.2. Ministry of Education and Training (Vietnam).Google Scholar
  27. Nguyen, Q. T., Miyao, Y., Le, H. T. T., & Nguyen, N. L. T. (2016). Challenges and solutions for consistent annotation of Vietnamese treebank. In Proceedings of the language resources and evaluation conference.Google Scholar
  28. Nguyen, Q. T., Nguyen, N. L. T., & Miyao, Y. (2012). Comparing different criteria for Vietnamese word segmentation. In Proceedings of 3rd workshop on south and southeast asian natural language processing (SANLP) (pp. 53–68). Citeseer.Google Scholar
  29. Nguyen, Q. T., Nguyen, N. L. T., & Miyao, Y. (2013). Utilizing state-of-the-art parsers to diagnose problems in treebank annotation for a less resourced language. In Proceedings of the 7th linguistic annotation workshop & interoperability with discourse (pp. 19–27). Association for Computational Linguistics.Google Scholar
  30. Nguyen, Q. D., Nguyen, Q. D., Pham, B. S., Nguyen, P. T., & Nguyen, L. M. (2014). From treebank conversion to automatic dependency parsing for Vietnamese. In Natural language processing and information systems (pp. 196–207). New York: Springer.Google Scholar
  31. Nguyen, P. T., Le, A. C., Ho, T. B., & Nguyen, V. H. (2015). Vietnamese treebank construction and entropy-based error detection. Language Resources and Evaluation, 49(3), 487–519.CrossRefGoogle Scholar
  32. Nguyen, P. T., Vu, X. L., & Nguyen, T. M. H. (2010a). Vietnamese part-of-speech tagging guidelines. Technical report sp 7.3. Ministry of Education and Training (Vietnam).Google Scholar
  33. Nguyen, P. T., Vu, X. L., Nguyen, T. M. H., Nguyen, V. H., & Le, H. P. (2009). Building a large syntactically-annotated corpus of Vietnamese. In Proceedings of the third linguistic annotation workshop (pp. 182–185). Association for Computational Linguistics.Google Scholar
  34. Nguyen, P. T., Vu, X. L, Nguyen, T. M. H., Dao, M. T., Dao, T. M. N., Le, K. N. (2010b). Vietnamese bracketing guidelines. Technical report sp7.3. Ministry of Education and Training (Vietnam).Google Scholar
  35. Peng, F., & Huang, X. (2007). Machine learning for asian language text classification. Journal of Documentation, 63(3), 378–397.CrossRefGoogle Scholar
  36. Petrov, S., Barrett, L., Thibaux, R., & Klein, D. (2006). Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics (pp. 433–440). Association for Computational Linguistics.Google Scholar
  37. Santorini, B. (1990). Part-of-speech tagging guidelines for the penn treebank project. Pennsylvania: University of Pennsylvania.Google Scholar
  38. SCSSV. (1983). Vietnamese grammar. Social Sciences Publishers.Google Scholar
  39. Socher, R., Bauer, J., Manning, C. D., & Ng, A. Y. (2013). Parsing with compositional vector grammars. In Proceedings of the ACL conference. Citeseer.Google Scholar
  40. Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 conference of the north american chapter of the association for computational linguistics on human language technology (Vol. 1, pp. 173–180). Association for Computational Linguistics.Google Scholar
  41. Tsuruoka, Y., Miyao, Y., & Kazama, J. (2011). Learning with lookahead: Can history-based models rival globally optimized models? In Proceedings of the fifteenth conference on computational natural language learning (pp. 238–246). Association for Computational Linguistics.Google Scholar
  42. Verberne, S., Boves, L., Oostdijk, N., & Coppen, P. A. (2008). Using syntactic information for improving why-question answering. In Proceedings of the 22nd international conference on computational linguistics (Vol. 1, pp. 953–960). Association for Computational Linguistics.Google Scholar
  43. Xia, F. (2000a). The part-of-speech tagging guidelines for the penn Chinese treebank (3.0). Technical report IRCS 00-07. University of Pennsylvania.Google Scholar
  44. Xia, F. (2000b). The segmentation guidelines for the penn Chinese treebank (3.0). Technical report IRCS 00-06. University of Pennsylvania.Google Scholar
  45. Xia, F., Palmer, M., Xue, N., Okurowski, M. E., Kovarik, J., Chiou, F. D., Huang, S., Kroch, T., & Marcus, M. P. (2000). Developing guidelines and ensuring consistency for Chinese text annotation. In Proceedings of the second international conference on language resources and evaluation.Google Scholar
  46. Xue, N., Xia, F., Chiou, F.-D., & Palmer, M. (2005). The penn chinese treebank: Phrase structure annotation of a large corpus. Natural Language Eengineering, 11(02), 207–238.CrossRefGoogle Scholar
  47. Xue, N., Xia, F., Huang, S., & Kroch, A. (2000). The bracketing guidelines for the penn Chinese treebank (3.0). Technical report IRCS 00-08. University of Pennsylvania.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2017

Authors and Affiliations

  1. 1.SOKENDAI (The Graduate University for Advanced Studies)KanagawaJapan
  2. 2.National Institute of InformaticsTokyoJapan
  3. 3.University of Social Sciences and HumanitiesHo Chi Minh CityVietnam
  4. 4.University of ScienceHo Chi Minh CityVietnam

Personalised recommendations