Abstract
This paper deals with the task of definition extraction with the training corpus suffering from the problems of small size, high noise and heavy imbalance. A previous approach, based on manually constructed shallow grammars, turns out to be hard to better even by such robust classifiers as SVMs, AdaBoost and simple ensembles of classifiers. However, a linear combination of various such classifiers and manual grammars significantly improves the results of the latter.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)
Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data. Technical Report 666, University of California, Berkeley (2004), http://www.stat.berkeley.edu/tech-reports/666.pdf
Degórski, Ł., Marcińczuk, M., Przepiórkowski, A.: Definition extraction using a sequential combination of baseline grammars and machine learning classifiers. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008. ELRA, Marrakech (2008) (forthcoming)
Kobyliński, Ł., Przepiórkowski, A.: Definition extraction with balanced random forests. In: 6th International Conference on Natural Language Processing, GoTAL 2008, Gothenburg (2008) (forthcoming)
Piskorski, J., Pouliquen, B., Steinberger, R., Tanev, H. (eds.): Proceedings of the Workshop on Balto-Slavonic Natural Language Processing at ACL 2007, Prague (2007)
Przepiórkowski, A., Degórski, Ł., Wójtowicz, B.: On the evaluation of Polish definition extraction grammars. In: Vetulani, Z. (ed.) Proceedings of the 3rd Language & Technology Conference, Poznań, Poland, pp. 473–477 (2007a)
Przepiórkowski, A., Degórski, Ł., Spousta, M., Simov, K., Osenova, P., Lemnitzer, L., Kuboň, V., Wójtowicz, B.: Towards the automatic extraction of definitions in Slavic. In: [5], pp. 43–50 (2007b)
Saggion, H.: Identifying definitions in text collections for question answering. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC, 2004, ELRA, Lisbon (2004)
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005), http://www.cs.waikato.ac.nz/ml/weka/
Xu, P., Jelinek, F.: Random forests in language modeling. In: Lin, D., Wu, D. (eds.) Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), pp. 325–332. ACL, Barcelona (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Przepiórkowski, A., Marcińczuk, M., Degórski, Ł. (2008). Dealing with Small, Noisy and Imbalanced Data. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2008. Lecture Notes in Computer Science(), vol 5246. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87391-4_23
Download citation
DOI: https://doi.org/10.1007/978-3-540-87391-4_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87390-7
Online ISBN: 978-3-540-87391-4
eBook Packages: Computer ScienceComputer Science (R0)