Skip to main content
Log in

Cited text spans identification with an improved balanced ensemble model

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Scientific summarization aims to provide condensed summary of important contributions of scientific papers. This problem has been extensively explored and recent interest has been aroused to taking advantage of the cited text spans to generate summaries. Cited text spans are the texts in the cited paper that most accurately reflect the citation. They can be viewed as important aspects of the cited paper which are annotated by academic community. Hence, identifying cited text spans is of vital importance for providing a different scientific summarization. In this paper, we explore three potential improvements towards our previous work which is a two-layer ensemble model to tackle the cited text spans identification problem. We first view cited text spans identification as an imbalanced classification problem and carry out comparison on preprocessing methods to handle the imbalanced dataset. Then we propose RANdom Sampling Aggregating (RANSA) algorithm to train classifiers in the first ensemble layer model. Finally, an improved stacking framework Hybrid-Stacking is applied to combine the models of the first layer. Our new ensemble model overcomes flaws of the previous work, and shows improved performance on cited text spans identification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://tac.nist.gov/2014/BiomedSumm/index.html.

  2. http://wing.comp.nus.edu.sg/cl-scisumm2016/.

  3. http://wing.comp.nus.edu.sg/~cl-scisumm2017/.

  4. http://wing.comp.nus.edu.sg/~cl-scisumm2018/.

  5. http://clair.eecs.umich.edu/aan/index.php.

  6. http://www.nltk.org/.

  7. https://pypi.org/project/imbalanced-learn/.

  8. http://scikit-learn.org/stable/.

  9. http://aan.how/index.php/home/download.

  10. https://radimrehurek.com/gensim/.

  11. The embeddings are trained with the setting of vector size 400, negative sampling, windows size of 5, minimum count of 5.

  12. BM25 showed the highest performance among all the information retrieval models.

References

  • Abu-Jbara, A., Ezra, J., & Radev, D. (2013). Purpose and polarity of citation: Towards nlp-based bibliometrics. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies, 2013 (pp. 596–606).

  • Abu-Jbara, A., & Radev, D. (2011). Coherent citation-based summarization of scientific papers. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies 2011 (Vol. 1, pp. 500–509): Association for Computational Linguistics.

  • Aggarwal, P., & Sharma, R. (2016). Lexical and syntactic cues to identify reference scope of citance. In Proceedings of the joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL), 2016 (pp. 103–112).

  • Arar, Faruk, & Ayan, M. K. (2015). Software defect prediction using cost-sensitive neural network. Amsterdam: Elsevier.

    Book  Google Scholar 

  • Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20–29.

    Article  Google Scholar 

  • Bonzi, S., & Snyder, H. (1991). Motivations for citation: A comparison of self citation and citation to others. Scientometrics, 21(2), 245–254.

    Article  Google Scholar 

  • Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.

    MATH  Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  MATH  Google Scholar 

  • Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446–3453.

    Article  Google Scholar 

  • Cao, Z., Li, W., & Wu, D. (2016). Polyu at cl-scisumm 2016. In Proceedings of the joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL), 2016 (pp 132–138).

  • Chang, E. Y., Li, B., Wu, G., & Goh, K. (2003). Statistical learning for effective visual information retrieval. In International conference on image processing, 2003. ICIP 2003. Proceedings, 2003 (Vol. 602, pp. III-609–612).

  • Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.

    Article  MATH  Google Scholar 

  • Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. In European conference on principles of data mining and knowledge discovery, 2003 (pp. 107–119): Springer: Berlin.

  • Cheng, Q., Lu, X., Liu, Z., & Huang, J. (2015). Mining research trends with anomaly detection models: the case of social computing research. Scientometrics, 103(2), 453–469.

    Article  Google Scholar 

  • Cohan, A., & Goharian, N. (2017a). Contextualizing citations for scientific summarization using word embeddings and domain knowledge. In Paper presented at the proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval - SIGIR ‘17.

  • Cohan, A., & Goharian, N. (2017b). Scientific document summarization via citation contextualization and scientific discourse. International Journal on Digital Libraries, 19(2–3), 287–303. https://doi.org/10.1007/s00799-017-0216-8.

    Article  Google Scholar 

  • Cohan, A., Soldaini, L., & Goharian, N. (2015). Matching citation text and cited spans in biomedical literature: A Search-Oriented Approach. In Conference of the North American chapter of the association for computational linguistics: human language technologies, 2015 (pp. 1042–1048).

  • da Cunha, I., & Wanner, L. (2005). Towards the Automatic Summarization of Medical Articles in Spanish: Integration of textual, lexical, discursive and syntactic criteria. In Crossing Barriers in Text Summarization Research (RANLP-2005) (pp. 46–51).

  • Davoodi, E., Madan, K., & Gu, J. (2018). CLSciSumm Shared Task: On the contribution of similarity measure and natural language processing features for citing problem. In BIRNDL@ SIGIR, 2018 (pp. 96–101).

  • De Moraes, L. F., Das, A., Karimi, S., & Verma, R. M. (2018) University of Houston@ CL-SciSumm 2018. In BIRNDL@ SIGIR, 2018 (pp. 142–149).

  • De Waard, A., & Maat, H. P. (2012). Epistemic modality and knowledge attribution in scientific discourse: A taxonomy of types and overview of features. In Proceedings of the workshop on detecting structure in scholarly discourse, 2012 (pp. 47–55): Association for Computational Linguistics.

  • Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. New York: CRC Press.

    MATH  Google Scholar 

  • Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D., & Radev, D. (2008). Blind men and elephants: What do citation summaries tell us about a research article? Journal of the American Society for Information Science and Technology, 59(1), 51–62.

    Article  Google Scholar 

  • Estabrooks, A., Jo, T., & Japkowicz, N. (2004). A multiple resampling method for learning from imbalanced data sets. Computational intelligence, 20(1), 18–36.

    Article  MathSciNet  Google Scholar 

  • Felber, T., & Kern, R. (2017). Graz University of Technology at CL-SciSumm 2017: Query Generation Strategies. In BIRNDL@ SIGIR (2), 2017 (pp. 67–72).

  • Freund, Y., & Schapire, R. E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory, 1995 (pp. 23–37).

  • Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.

    Article  MathSciNet  MATH  Google Scholar 

  • Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463–484. https://doi.org/10.1109/tsmcc.2011.2161285.

    Article  Google Scholar 

  • Garzone, M., & Mercer, R. E. (2000). Towards an automated citation classifier. In Conference of the Canadian society for computational studies of intelligence, 2000 (pp. 337–346): Springer: New York.

  • Guo, H., & Viktor, H. L. (2004). Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. ACM SIGKDD Explorations Newsletter, 6(1), 30–39.

    Article  Google Scholar 

  • Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220–239. https://doi.org/10.1016/j.eswa.2016.12.035.

    Article  Google Scholar 

  • Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, 2005 (pp. 878–887).

  • Hart, P. (1968). The condensed nearest neighbor rule (Corresp.). IEEE Transactions on Information Theory, 14(3), 515–516.

    Article  Google Scholar 

  • He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In IEEE international joint conference on neural networks, 2008 (pp. 1322–1328).

  • Hernández-Alvarez, M., & Gomez, J. M. (2016). Survey about citation context analysis: Tasks, techniques, and resources. Natural Language Engineering, 22(3), 327–349.

    Article  Google Scholar 

  • Hoang, C. D. V., & Kan, M. Y. (2010). Towards automated related work summarization. In International conference on computational linguistics: posters, 2010 (pp. 427–435).

  • Hu, S., Liang, Y., Ma, L., & He, Y. (2010). MSMOTE: Improving classification performance when training data is imbalanced. In International workshop on computer science & engineering, 2010 (pp. 13–17).

  • Hu, Y., & Wan, X. (2014). Automatic generation of related work sections in scientific papers: An optimization approach. In Conference on empirical methods in natural language processing, 2014 (pp. 1624–1633).

  • Jaidka, K., Chandrasekaran, M. K., Elizalde, B. F., Jha, R., Jones, C., Kan, M. Y., et al. (2014). The computational linguistics summarization pilot task. In Text analysis conference, 2014.

  • Jaidka, K., Chandrasekaran, M. K., Jain, D., & Kan, M. -Y. (2017). The CL-SciSumm Shared Task 2017: Results and key insights. In BIRNDL@ SIGIR (2), 2017 (pp. 1–15).

  • Jaidka, K., Chandrasekaran, M. K., Rustagi, S., & Kan, M.-Y. (2016). Overview of the CL-SciSumm 2016 shared task. In Proceedings of the joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL), 2016 (pp. 93–102).

  • Jaidka, K., Chandrasekaran, M. K., Rustagi, S., & Kan, M.-Y. (2018). Insights from CL-SciSumm 2016: The faceted scientific document summarization Shared Task. International Journal on Digital Libraries, 19(2–3), 163–171.

    Article  Google Scholar 

  • Jaidka, K., Khoo, C., & Na, J. -C. (2013). Deconstructing human literature reviews–a framework for multi-document summarization. In proceedings of the 14th European workshop on natural language generation, 2013 (pp. 125–135).

  • Jha, R. (2015). NLP driven models for automatically generating survey articles for scientific topics. The University of Michigan: Michigan.

  • Jha, R., Coke, R., & Radev, D. (2015). Surveyor: a system for generating coherent survey articles for scientific topics. In Twenty-Ninth AAAI conference on artificial intelligence, 2015 (pp. 2167–2173).

  • Jha, R., Abu-Jbara, A., & Radev, D. (2013). A system for summarizing scientific topics starting from keywords. In Meeting of the association for computational linguistics, 2013 (pp. 572–577).

  • Jones, K. S., Walker, S., & Robertson, S. E. (2000). A probabilistic model of information retrieval: Development and comparative experiments: Part 2. Information Processing and Management, 36(6), 809–840.

    Article  Google Scholar 

  • Kan, M.-Y., Klavans, J. L., & McKeown, K. R. (2002). Using the annotated bibliography as a resource for indicative summarization. arXiv: Preprint cs/0206007.

  • Klampfl, S., Rexha, A., & Kern, R. (2016). Identifying referenced text in scientific publications by summarisation and classification techniques. In Proceedings of the joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL), 2016 (pp. 122–131).

  • Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: One-sided selection. In International conference on machine learning, 1997 (pp. 179–186).

  • Kupiec, J., Pedersen, J., & Chen, F. (1995). A trainable document summarizer. In Proc. of the 18th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (pp. 68–73).

  • Laurikkala, J. (2001). Improving identification of difficult small classes by balancing class distribution. Berlin: Springer.

    Book  MATH  Google Scholar 

  • Lauscher, A., Glavaš, G., & Eckert, K. (2002) University of Mannheim@ CLSciSumm-17: Citation-based summarization of scientific articles using semantic textual similarity. In CEUR workshop proceedings, 2017 (Vol. 2002, pp. 33–42): RWTH.

  • Li, L., Chi, J., Chen, M., Huang, Z., Zhu, Y., & Fu, X. (2018). CIST@ CLSciSumm-18: Methods for computational linguistics scientific citation linkage, facet classification and summarization. In BIRNDL@ SIGIR, 2018 (pp. 84–95).

  • Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., et al. (2016). Cist system for cl-scisumm 2016 shared task. In Proceedings of the joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL), 2016 (pp. 156–167).

  • Li, L., Zhang, Y., Mao, L., Chi, J., Chen, M., & Huang, Z. (2017). CIST@ CLSciSumm-17: Multiple features based citation linkage, classification and summarization. In Proceedings of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL2017). Tokyo, Japan.

  • Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539–550.

    Google Scholar 

  • Lloret, E., Romá-Ferri, M. T., & Palomar, M. (2013). COMPENDIUM: A text summarization system for generating abstracts of research papers. Data and Knowledge Engineering, 88, 164–175.

    Article  Google Scholar 

  • Ma, S., Xu, J., Wang, J., & Zhang, C. (2017). NJUST@ CLSciSumm-17. In Proceedings of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL2017). Tokyo, Japan (August 2017), 2017 (pp. 1–15).

  • Ma, S., Xu, J., & Zhang, C. (2018). Automatic identification of cited text spans: a multi-classifier approach over imbalanced dataset. Scientometrics, 116, 1303–1330.

    Article  Google Scholar 

  • Mei, Q., & Zhai, C. (2008). Generating impact-based summaries for scientific literature. In Proceedings of ACL-08: HLT (pp. 816–824).

  • Mohammad, S., Dorr, B., Egan, M., Hassan, A., Muthukrishan, P., Qazvinian, V., et al. (2016). Using Citations to Generate Surveys of Scientific Paradigms. In Human language technologies: conference of the North American chapter of the association of computational linguistics, proceedings, Boulder, Colorado, USA, 2016 (pp. 584–592).

  • Moraes, L., Baki, S., Verma, R., & Lee, D. (2016). University of Houston at CL-SciSumm 2016: SVMs with tree kernels and Sentence Similarity. In proceedings of the joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL), 2016 (pp. 113–121).

  • Nanba, H., & Okumura, M. (1999). Towards multi-paper summarization using reference information. In IJCAI, 1999 (Vol. 99, pp. 926–931).

  • Nenkova, A., & McKeown, K. (2011). Automatic summarization. Foundations and Trends® in Information Retrieval, 5(2–3), 103–233.

    Article  Google Scholar 

  • Nomoto, T. (2016). NEAL: A neurally enhanced approach to linking citation and reference. In Proceedings of the joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL), 2016 (pp. 168–174).

  • Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11, 169–198.

    Article  MATH  Google Scholar 

  • Pelayo, L., & Dick, S. (2007). Applying novel resampling strategies to software defect prediction. In NAFIPS 2007-2007 Annual meeting of the North American fuzzy information processing society, 2007 (pp. 69–72). IEEE.

  • Platt, J. C. (1999). Fast training of support vector machines using sequential minimal optimization. New York: MIT Press.

    Google Scholar 

  • Polikar, R. (2006). Ensemble based systems in decision making. IEEE Circuits and Systems Magazine, 6(3), 21–45.

    Article  Google Scholar 

  • Prasad, A. (2017). WING-NUS at CL-SciSumm 2017: Learning from syntactic and semantic similarity for citation contextualization. In Proceedings of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL2017). Tokyo, Japan.

  • Qazvinian, V., & Radev, D. R. (2008). Scientific paper summarization using citation summary networks. In Proceedings of the 22nd international conference on computational linguistics-volume 1, 2008 (pp. 689–696): Association for Computational Linguistics.

  • Quinlan, J. R. (1992). C4.5: programs for machine learning: Morgan Kaufmann Publishers Inc.: Burlington.

  • Radev, D. R., Joseph, M. T., Gibson, B., & Muthukrishnan, P. (2016). A bibliometric and network analysis of the field of computational linguistics. Journal of the Association for Information Science and Technology, 67(3), 683–706.

    Article  Google Scholar 

  • Radev, D. R., Muthukrishnan, P., Qazvinian, V., & Abu-Jbara, A. (2013). The ACL anthology network corpus. Language Resources and Evaluation, 47(4), 919–944.

    Article  Google Scholar 

  • Sándor, Á., & De Waard, A. (2012). Identifying claimed knowledge updates in biomedical research articles. In proceedings of the workshop on detecting structure in scholarly discourse, 2012 (pp. 10–17). Association for Computational Linguistics.

  • Schapire, R. E. (1990). The strength of weak learnability. New York: Kluwer Academic Publishers.

    Book  MATH  Google Scholar 

  • Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2010). RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans, 40(1), 185–197. https://doi.org/10.1109/tsmca.2009.2029559.

    Article  Google Scholar 

  • Shao, L., & Ng, H. T. (2004). Mining new word translations from comparable corpora. In Proceedings of the 20th international conference on Computational Linguistics, 2004 (p. 618): Association for Computational Linguistics.

  • Smyth, P., & Wolpert, D. (1998). Stacked density estimation. In Advances in neural information processing systems, 1998 (pp. 668–674).

  • Spärck Jones, K. (2007). Automatic summarising: a review and discussion of the state of the art. Computer Laboratory: University of Cambridge.

    Google Scholar 

  • Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., & Zhou, Y. (2015). A novel ensemble method for classifying imbalanced data. Pattern Recognition, 48(5), 1623–1637. https://doi.org/10.1016/j.patcog.2014.11.014.

    Article  Google Scholar 

  • Tamura, A., Watanabe, T., & Sumita, E. (2012). Bilingual lexicon extraction from comparable corpora using label propagation. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, 2012 (pp. 24–36): Association for Computational Linguistics.

  • Tao, D., Tang, X., Li, X., & Wu, X. (2006). Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Transactions on Pattern Analysis & Machine Intelligence, 7, 1088–1099.

    Google Scholar 

  • Teufel, S., & Moens, M. (2002). Summarizing scientific articles: experiments with relevance and rhetorical status. Computational Linguistics, 28(4), 409–445.

    Article  Google Scholar 

  • Teufel, S., Siddharthan, A., & Dan, T. (2006). Automatic classification of citation function. In Proceedings of 2006 conference on empirical methods in natural language processing, Sydney, Australia, 2006 (pp. 103–110).

  • Tian, J., Gu, H., & Liu, W. (2011). Imbalanced classification using support vector machine ensemble. Neural Computing and Applications, 20(2), 203–209.

    Article  Google Scholar 

  • Tibshirani, R. (1996). Bias, variance and prediction error for classification rules: Citeseer.

  • Tomek, I. (1976). Two modifications of CNN. IEEE Trans. Systems, Man and Cybernetics, 6, 769–772.

    MathSciNet  MATH  Google Scholar 

  • Wang, P., Li, S., Wang, T., Zhou, H., & Tang, J. (2018). NUDT@ CLSciSumm-18. In BIRNDL@ SIGIR, 2018 (pp. 102–113).

  • Wilson, D. L. (2007). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems Man & Cybernetics SMC, 2(3), 408–421.

    Article  MathSciNet  MATH  Google Scholar 

  • Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259.

    Article  Google Scholar 

  • Wolpert, D. H., & Macready, W. G. (1999). An efficient method to estimate bagging’s generalization error. Machine Learning, 35(1), 41–55.

    Article  MATH  Google Scholar 

  • Xu, L., Krzyzak, A., & Suen, C. Y. (1992). Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on Systems, Man, and Cybernetics, 22(3), 418–435.

    Article  Google Scholar 

  • Yang, Q., & Wu, X. (2006). 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making, 5(04), 597–604.

    Article  Google Scholar 

  • Zhang, H., Fiszman, M., Shin, D., Wilkowski, B., & Rindflesch, T. C. (2013). Clustering cliques for graph-based summarization of the biomedical research literature. BMC Bioinformatics, 14(1), 182.

    Article  Google Scholar 

  • Zhu, Z.-B., & Song, Z.-H. (2010). Fault diagnosis based on imbalance modified kernel Fisher discriminant analysis. Chemical Engineering Research and Design, 88(8), 936–951.

    Article  Google Scholar 

Download references

Acknowledgements

Funding was provided by National Natural Science Foundation of China (Grant Nos. 61303190, 61272146, 61472436, 61532001) and National Key Research and Development Program of China (Grant No. 2018YFB1004502).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shasha Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, P., Li, S., Zhou, H. et al. Cited text spans identification with an improved balanced ensemble model. Scientometrics 120, 1111–1145 (2019). https://doi.org/10.1007/s11192-019-03167-z

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-019-03167-z

Keywords

Navigation