Skip to main content

A Multilingual Named Entity Recognition System Using Boosting and C4.5 Decision Tree Learning Algorithms

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4265))

Abstract

In this paper we introduce a multilingual Named Entity Recognition (NER) system that uses statistical modeling techniques. The system identifies and classifies NEs in the Hungarian and English languages by applying AdaBoostM1 and the C4.5 decision tree learning algorithm. We focused on building as large a feature set as possible, and used a split and recombine technique to fully exploit its potentials. This methodology provided an opportunity to train several independent decision tree classifiers based on different subsets of features and combine their decisions in a majority voting scheme. The corpus made for the CoNLL 2003 conference and a segment of Szeged Corpus was used for training and validation purposes. Both of them consist entirely of newswire articles. Our system remains portable across languages without requiring any major modification and slightly outperforms the best system of CoNLL 2003, and achieved a 94.77% F measure for Hungarian. The real value of our approach lies in its different basis compared to other top performing models for English, which makes our system extremely successful when used in combination with CoNLL modells.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bikel, D.M., Schwartz, R.L., Weischedel, R.M.: An algorithm that learns what’s in a name. Machine Learning 34(1-3), 211–231 (1999)

    Article  MATH  Google Scholar 

  2. Carreras, X., Márques, L., Padró, L.: Named Entity Extraction using AdaBoost. In: Proceedings of CoNLL-2002, Taipei, Taiwan, pp. 167–170 (2002)

    Google Scholar 

  3. Chieu, H.L., Ng, H.T.: Named Entity Recognition with a Maximum Entropy Approach. In: Proceedings of CoNLL-2003, pp. 160–163 (2003)

    Google Scholar 

  4. Chinchor, N.: MUC-7 Named Entity Task Definition. In: Proceedings of Seventh Message Understanding Conference (1998)

    Google Scholar 

  5. Cucerzan, S., Yarowsky, D.: Language-independent named entity recognition combining morphological and contextual evidence. In: Proceedings of Joint SIGDAT Conf. on EMNLP/VLC (1999)

    Google Scholar 

  6. Csendes, D., Csirik, J.A., Gyimóthy, T.: The Szeged Corpus: A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS, vol. 3206, pp. 41–47. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  7. Richárd, F., György, S., András, K.: Named Entity Recognition for Hungarian using various Machine Learning Algorithms (accepted for publication in Acta Cybernetica), http://www.inf.u-szeged.hu/~rfarkas/ACTA2006_hun_namedentity.pdf

  8. Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named Entity Recognition through Classifier Combination. In: Proceedings of CoNLL-2003, pp. 168–171 (2003)

    Google Scholar 

  9. Gábor, K., Héja, E., Mészáros, Á., Sass, B.: Nyílt tokenosztályok reprezentációjának technológiája. In: IKTA-00037/2002, Budapest, Hungary (2002)

    Google Scholar 

  10. Kim, J.-D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the Bio-Entity Task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (2004)

    Google Scholar 

  11. Quinlan, R.: C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco (1993)

    Google Scholar 

  12. Prószéky, G.: Syntax as Meta-Morphology. In: Proceedings of COLING 1996, vol. 2, pp. 1123–1126 (1996)

    Google Scholar 

  13. Shapire, R.E.: The Strength of Weak Learnability. Machine Learnings 5, 197–227 (1990)

    Google Scholar 

  14. Szarvas, G., Farkas, R., Felföldi, L., Kocsor, A., Csirik, J.: A highly accurate Named Entity corpus for Hungarian, In: Proceedings of International Conference on Language Resources and Evaluation (2006)

    Google Scholar 

  15. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL 2003 Shared Task: Language-Independent Named Entity Recognition. In: Proceedings of CoNLL 2003 (2003)

    Google Scholar 

  16. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Szarvas, G., Farkas, R., Kocsor, A. (2006). A Multilingual Named Entity Recognition System Using Boosting and C4.5 Decision Tree Learning Algorithms. In: Todorovski, L., Lavrač, N., Jantke, K.P. (eds) Discovery Science. DS 2006. Lecture Notes in Computer Science(), vol 4265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11893318_27

Download citation

  • DOI: https://doi.org/10.1007/11893318_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-46491-4

  • Online ISBN: 978-3-540-46493-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics