Improving Supervised Classification Using Information Extraction

  • Mian Du
  • Matthew Pierce
  • Lidia Pivovarova
  • Roman Yangarber
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9103)


We explore supervised learning for multi-class, multi-label text classification, focusing on real-world settings, where the distribution of labels changes dynamically over time. We use the PULS Information Extraction system to collect information about the distribution of class labels over named entities found in text. We then combine a knowledge-based rote classifier with statistical classifiers to obtain better performance than either classification method alone. The resulting classifier yields a significant improvement in macro-averaged F-measure compared to the state of the art, while maintaining comparable micro-average.


Statistical Classifier Information Extraction Feature Selection Method Industry Sector Name Entity Recognition 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Atkinson, M., Piskorski, J., van der Goot, E., Yangarber, R.: Multilingual real-time event extraction for border security intelligence gathering. In: Wiil, U.K. (ed.) Counterterrorism and Open Source Intelligence. Lecture Notes in Social Networks, vol. 2, pp. 355–390. Springer, Vienna (2011)CrossRefGoogle Scholar
  2. 2.
    Bekkerman, R., Allan, J.: Using bigrams in text categorization. Technical Report IR-408, Department of Computer Science, University of Massachusetts, Amherst (December 2004)Google Scholar
  3. 3.
    Cisse, M.M., Usunier, N., Arti, T., Gallinari, P.: Robust bloom filters for large multilabel classification tasks. In: Advances in Neural Information Processing Systems, pp. 1851–1859 (2013)Google Scholar
  4. 4.
    Crammer, K., Dredze, M., Pereira, F.: Confidence-weighted linear classification for text categorization. J. Mach. Learn. Res. 13, 1891–1926 (2012)zbMATHMathSciNetGoogle Scholar
  5. 5.
    Dendamrongvit, S., Vateekul, P., Kubat, M.: Irrelevant attributes and imbalanced classes in multi-label text-categorization domains. Intell. Data Anal. 15(6), 843–859 (2011)Google Scholar
  6. 6.
    Dredze, M., McNamee, P., Rao, D., Gerber, A., Finin, T.: Entity disambiguation for knowledge base population. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 277–285. Association for Computational Linguistics (2010)Google Scholar
  7. 7.
    Du, M., Kangasharju, J., Karkulahti, O., Pivovarova, L., Yangarber, R.: Combined analysis of news and Twitter messages. In: Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction, pp. 41–48 (2013)Google Scholar
  8. 8.
    Du, M., Pierce, M., Pivovarova, L., Yangarber, R.: Supervised classification using balanced training. In: Besacier, L., Dediu, A.-H., Martín-Vide, C. (eds.) SLSP 2014. LNCS, vol. 8791, pp. 147–158. Springer, Heidelberg (2014) CrossRefGoogle Scholar
  9. 9.
    Dhondt, E., Verberne, S., Weber, N., Koster, C., Boves, L.: Using skipgrams and pos-based feature selection for patent classification. In: Computational Linguistics in the Netherlands (2012)Google Scholar
  10. 10.
    Erenel, Z., Altınçay, H.: Improving the precision-recall trade-off in undersampling-based binary text categorization using unanimity rule. Neural Comput. Appl. 22(1), 83–100 (2013)CrossRefGoogle Scholar
  11. 11.
    Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)zbMATHGoogle Scholar
  12. 12.
    Gabrilovich, E., Markovitch, S.: Feature generation for text categorization using world knowledge. IJCAI 5, 1048–1053 (2005)Google Scholar
  13. 13.
    Grishman, R., Huttunen, S., Yangarber, R.: Information extraction for enhanced access to disease outbreak reports. J. Biomed. Inform. 35(4), 236–246 (2003)CrossRefGoogle Scholar
  14. 14.
    Gullo, F., Domeniconi, C., Tagarelli, A.: Projective clustering ensembles. Data Min. Knowl. Disc. 26(3), 452–511 (2013)zbMATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  16. 16.
    Han, X., Sun, L.: An entity-topic model for entity linking. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 105–115. Association for Computational Linguistics (2012)Google Scholar
  17. 17.
    Hatami, N., Chira, C., Armano, G.: A route confidence evaluation method for reliable hierarchical text categorization. arXiv preprint (2012). arXiv:1206.0335
  18. 18.
    Huang, R., Riloff, E.: Classifying message board posts with an extracted lexicon of patient attributes. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1557–1562 (2013)Google Scholar
  19. 19.
    Huttunen, S., Vihavainen, A., Du, M., Yangarber, R.: Predicting relevance of event extraction for the end user. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds.) Multi-source, Multilingual Information Extraction and Summarization. Theory and applications of natural language processing, pp. 163–176. Springer, Berlin (2012)Google Scholar
  20. 20.
    Huttunen, S., Vihavainen, A., von Etter, P., Yangarber, R.: Relevance prediction in information extraction using discourse and lexical features. In: Proceedings of NoDaLiDa: the 18th Nordic Conference on Computational Linguistics. Riga, Latvia (2011)Google Scholar
  21. 21.
    Ji, H., Grishman, R., Dang, H.T., Griffitt, K., Ellis, J.: Overview of the tac 2010 knowledge base population track. In: Third Text Analysis Conference (TAC 2010) (2010)Google Scholar
  22. 22.
    Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. Technical report 1997–75, Stanford InfoLab (February 1997)Google Scholar
  23. 23.
    Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)Google Scholar
  24. 24.
    Liao, S., Grishman, R.: Using document level cross-event inference to improve event extraction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 789–797. Association for Computational Linguistics (2010)Google Scholar
  25. 25.
    Liu, Y., Loh, H.T., Sun, A.: Imbalanced text classification: a term weighting approach. Expert Syst. Appl. 36(1), 690–701 (2009)CrossRefGoogle Scholar
  26. 26.
    Mann, G.S., Yarowsky, D.: Multi-field information extraction and cross-document fusion. In: Proceedings of the 43rd annual meeting on association for computational linguistics, pp. 483–490. Association for Computational Linguistics (2005)Google Scholar
  27. 27.
    Moschitti, A., Ju, Q., Johansson, R.: Modeling topic dependencies in hierarchical text categorization. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1, pp. 759–767. Association for Computational Linguistics (2012)Google Scholar
  28. 28.
    Patwardhan, S., Riloff, E.: Effective information extraction with semantic affinity patterns and relevant regions. EMNLP-CoNLL 7, 717–727 (2007)Google Scholar
  29. 29.
    Piskorski, J., Tanev, H., Atkinson, M., van der Goot, E., Zavarella, V.: Online news event extraction for global crisis surveillance. In: Nguyen, N.T. (ed.) Transactions on Computational Collective Intelligence V. LNCS, vol. 6910, pp. 182–212. Springer, Heidelberg (2011) CrossRefGoogle Scholar
  30. 30.
    Pokkunuri, S., Ramakrishnan, C., Riloff, E., Hovy, E., Burns, G.A.: The role of information extraction in the design of a document triage application for biocuration. In: Proceedings of BioNLP 2011 Workshop, pp. 46–55. Association for Computational Linguistics (2011)Google Scholar
  31. 31.
    Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Monroy, R., Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds.) MICAI 2004. LNCS (LNAI), vol. 2972, pp. 312–321. Springer, Heidelberg (2004) CrossRefGoogle Scholar
  32. 32.
    Puurula, A.: Scalable text classification with sparse generative modeling. In: Anthony, P., Ishizuka, M., Lukose, D. (eds.) PRICAI 2012. LNCS, vol. 7458, pp. 458–469. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  33. 33.
    Rao, D., McNamee, P., Dredze, M.: Entity linking: finding extracted entities in a knowledge base. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds.) Multi-source, pp. 93–115. Multilingual Information Extraction and Summarization. Springer, Heidelberg (2013)Google Scholar
  34. 34.
    Roth, D., Yih, W.t.: Probabilistic reasoning for entity & relation recognition. In: Proceedings of the 19th international conference on Computational linguistics, vol. 1, pp. 1–7. Association for Computational Linguistics (2002)Google Scholar
  35. 35.
    Sil, A., Cronin, E., Nie, P., Yang, Y., Popescu, A.M., Yates, A.: Linking named entities to any database. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 116–127. Association for Computational Linguistics (2012)Google Scholar
  36. 36.
    Sorower, M.S.: A literature survey on algorithms for multi-label learning. Technical report, Oregon State University, Corvallis, OR, USA, December 2010Google Scholar
  37. 37.
    Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehouse. Min. (IJDWM) 3(3), 1–13 (2007)CrossRefGoogle Scholar
  38. 38.
    Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, Heidelberg (2010)Google Scholar
  39. 39.
    Wang, S., Li, D., Zhao, L., Zhang, J.: Sample cutting method for imbalanced text sentiment classification based on BRC. Knowl.-Based Syst. 37, 451–461 (2013)CrossRefGoogle Scholar
  40. 40.
    Yang, Y.: An evaluation of statistical approaches to text categorization. Inf. Retrieval 1(1–2), 69–90 (1999)CrossRefGoogle Scholar
  41. 41.
    Yangarber, R., Jokipii, L.: Redundancy-based correction of automatically extracted facts. In: Proceedings of HLT-EMNLP: Conference on Empirical Methods in Natural Language Processing, Vancouver, Canada, pp. 57–64 (2005)Google Scholar
  42. 42.
    Yangarber, R., Steinberger, R.: Automatic epidemiological surveillance from on-line news in MedISys and PULS. In: Proceedings of IMED-2009: International Meeting on Emerging Diseases and Surveillance, Vienna, Austria (2009)Google Scholar
  43. 43.
    Zhang, W., Yoshida, T., Tang, X.: A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst. Appl. 38(3), 2758–2765 (2011)CrossRefGoogle Scholar
  44. 44.
    Zhuang, D., Zhang, B., Yang, Q., Yan, J., Chen, Z., Chen, Y.: Efficient text classification by weighted proximal SVM. In: Fifth IEEE International Conference on Data Mining (2005)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Mian Du
    • 1
  • Matthew Pierce
    • 1
  • Lidia Pivovarova
    • 1
  • Roman Yangarber
    • 1
  1. 1.Department of Computer ScienceUniversity of HelsinkiHelsinkiFinland

Personalised recommendations