Skip to main content

Dealing with Small, Noisy and Imbalanced Data

Machine Learning or Manual Grammars?

  • Conference paper
Text, Speech and Dialogue (TSD 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5246))

Included in the following conference series:

Abstract

This paper deals with the task of definition extraction with the training corpus suffering from the problems of small size, high noise and heavy imbalance. A previous approach, based on manually constructed shallow grammars, turns out to be hard to better even by such robust classifiers as SVMs, AdaBoost and simple ensembles of classifiers. However, a linear combination of various such classifiers and manual grammars significantly improves the results of the latter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)

    Article  Google Scholar 

  2. Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data. Technical Report 666, University of California, Berkeley (2004), http://www.stat.berkeley.edu/tech-reports/666.pdf

  3. Degórski, Ł., Marcińczuk, M., Przepiórkowski, A.: Definition extraction using a sequential combination of baseline grammars and machine learning classifiers. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008. ELRA, Marrakech (2008) (forthcoming)

    Google Scholar 

  4. Kobyliński, Ł., Przepiórkowski, A.: Definition extraction with balanced random forests. In: 6th International Conference on Natural Language Processing, GoTAL 2008, Gothenburg (2008) (forthcoming)

    Google Scholar 

  5. Piskorski, J., Pouliquen, B., Steinberger, R., Tanev, H. (eds.): Proceedings of the Workshop on Balto-Slavonic Natural Language Processing at ACL 2007, Prague (2007)

    Google Scholar 

  6. Przepiórkowski, A., Degórski, Ł., Wójtowicz, B.: On the evaluation of Polish definition extraction grammars. In: Vetulani, Z. (ed.) Proceedings of the 3rd Language & Technology Conference, Poznań, Poland, pp. 473–477 (2007a)

    Google Scholar 

  7. Przepiórkowski, A., Degórski, Ł., Spousta, M., Simov, K., Osenova, P., Lemnitzer, L., Kuboň, V., Wójtowicz, B.: Towards the automatic extraction of definitions in Slavic. In: [5], pp. 43–50 (2007b)

    Google Scholar 

  8. Saggion, H.: Identifying definitions in text collections for question answering. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC, 2004, ELRA, Lisbon (2004)

    Google Scholar 

  9. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005), http://www.cs.waikato.ac.nz/ml/weka/

    Google Scholar 

  10. Xu, P., Jelinek, F.: Random forests in language modeling. In: Lin, D., Wu, D. (eds.) Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), pp. 325–332. ACL, Barcelona (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Petr Sojka Aleš Horák Ivan Kopeček Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Przepiórkowski, A., Marcińczuk, M., Degórski, Ł. (2008). Dealing with Small, Noisy and Imbalanced Data. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2008. Lecture Notes in Computer Science(), vol 5246. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87391-4_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-87391-4_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-87390-7

  • Online ISBN: 978-3-540-87391-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics