Advertisement

Improvement of TextRank Based on Co-occurrence Word Pairs and Context Information

  • Yang Wang
  • Hua Yin
  • Minwei He
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11344)

Abstract

TextRank, a widely used keyword extraction algorithm, considers the relationship between words based on the graph model. However, Words with high frequency have more opportunities to co-occur with other words. Extracting keywords based on co-occurrence relationships ignores some unrecognized words, and TextRank only constructs a graph model from a single document. It leads to less efficiency in some related documents for missing the context information in the documents collection. In this paper, A smart improvement algorithm for TextRank is promoted. Firstly, for introducing external document features and considering the relationship between documents, all co-occurrence word pairs from the documents collection are extracted by associate rule mining. Then the co-occurrence frequency in TextRank score formula is replaced with the mutual information between the co-occurrence word pairs, which considers some less co-occurrence word pairs. Moreover, the context entropy of the words in the collection are calculated. At last, a new TextRank score formula is constructed, in which the context entropy pluses the replaced score formula with different weights. For testing the effectiveness, an experiment, considering five scoring weights combination, compares the improvement algorithm with the original TextRank and TF-IDF based on two different type of datasets (a public Chinese dataset and a financial dataset crawled from the internet). The experiment results show that with the same weight of the two parts, the improved TextRank algorithm is superior to the others.

Keywords

TextRank Keyword extraction Co-occurrence word pairs Mutual information Context entropy 

Notes

Acknowledgments

This work was supported by Science and Technology Program of Guangzhou, China (No. 201707010495), Project supported by Guangdong Province Universities, China (No. 2015KTSCX046), Foundation for Technology Innovation in Higher Education of Guangdong Province, China (No. 2013KJCX0085) and Foundation for Distinguished Young Talents in Higher Education of Guangzhou, China (No. 2013LYM0032).

References

  1. 1.
    Zhao, Y.Y., Qin, B., Liu, T.: Sentiment analysis. J. Softw. 21(8), 1834 − 1848 (2010). (in Chinese). http://www.jos.org.cn/1000-9825/3832.html
  2. 2.
    Wang, Y., Jia, Y., Liu, D., Jin, X., Cheng, X.: Open web knowledge aided information search and data mining. J. Comput. Res. Dev. 52(02), 456–474 (2015). (in Chinese)Google Scholar
  3. 3.
    Ou, S., Tang, Z.: A question answering method over library linked data. J. Libr. Sci. China 41(06), 44–60 (2015). (in Chinese)Google Scholar
  4. 4.
    Zhao, J., Zhu, Q., Zhou, G., Zhang, L.: Review of research on automatic keyword extraction. J. Softw. 28(09), 2431–2449 (2017). (in Chinese)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Chang, Y., Zhang, Y., Wang, H., Wan, H., Xiao, C.: Features oriented survey of state-of-the-art keyphrase extraction algorithms [J/OL]. J. Softw. 1–25 (2018). (in Chinese)Google Scholar
  6. 6.
    Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. In: Proceedings of EMNLP, pp. 404–411 (2004)Google Scholar
  7. 7.
    Liu, Z.Y., Huang, W.Y., Zheng, Y.B., Sun, M.S.: Automatic keyphrase extraction via topic decomposition. In: Proceedings of EMNLP, pp. 366–376 (2010)Google Scholar
  8. 8.
    Wang, R., Liu, W., Mc Donald, C.: Corpus-independent generic keyphrase extraction using word embedding vectors. In: Proceedings of Software Engineering Research Conference, p. 39 (2014)Google Scholar
  9. 9.
    Xia, T.: Extracting keywords with modified TextRank model. Data Anal. Knowl. Discov. 1(02), 28–34 (2017). (in Chinese)Google Scholar
  10. 10.
    Zhou, J., Cui, X.: Keyword extraction method based on word vector and TextRank [J/OL]. Comput. Appl. Res. 05, 1–5 (2019). (in Chinese)Google Scholar
  11. 11.
    Florescu, C., Caragea, C.: A position-biased pagerank algorithm for keyphrase extraction. In: Proceedings of AAAI, pp. 4923–4924 (2017)Google Scholar
  12. 12.
    Chang, P., Feng, N.: A co-occurrence based vector space model for document indexing. Chin. J. Inf. Process. 26(01), 51–57 (2012). (in Chinese)Google Scholar
  13. 13.
    Huo, S., Zhang, M., Liu, Y., Ma, S.: New words discovery in microblog content. Pattern Recogn. Artif. Intell. 27(02), 141–145 (2014). (in Chinese)Google Scholar
  14. 14.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of International Conference on Very Large Databases, pp. 487–499 (1994)Google Scholar
  15. 15.
    Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)CrossRefGoogle Scholar
  16. 16.
    Luo, Z., Song, R.: An integrated method for Chinese unknown word extraction. In: Proceedings of ACL SIGHAN Workshop (2003)Google Scholar
  17. 17.
    Liu, Z., Chen, X., Zheng, Y., et al.: Automatic keyphrase extraction by bridging vocabulary gap. In: Fifteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, pp. 135–144 (2011)Google Scholar
  18. 18.
    Mikolov, T., Chen, K., Corrado, G., et al.: Efficient estimation of word representations in vector space. Computer Science (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Information SchoolGuangdong University of Finance and EconomicsGuangzhouChina

Personalised recommendations