Advertisement

Automatic keyphrase extraction: a survey and trends

  • Zakariae Alami MerrouniEmail author
  • Bouchra Frikh
  • Brahim Ouhbi
Article
  • 182 Downloads

Abstract

Due to the exponential growth of textual data and web sources, an automatic mechanism is required to identify relevant information embedded within them. The utility of Automatic Keyphrase Extraction (AKPE) cannot be overstated, given its widespread adoption in many Information Retrieval (IR), Natural Language Processing (NLP) and Text Mining (TM) applications, and its potential ability to solve difficulties related to extracting valuable information. In recent years, a wide range of AKPE techniques have been proposed. However, they are still impaired by low accuracy rates and moderate performance. This paper provides a comprehensive review of recent research efforts on the AKPE task and its related techniques. More concretely, we highlight the common process of this task, while also illustrating the various approaches used (supervised, unsupervised, and Deep Learning) and released techniques. We investigate the major challenges that such techniques face and depict the specific complexities they address. Besides, we provide a comparison study of the best performing techniques, discuss why some perform better than others and propose recommendations to improve each stage of the AKPE process.

Keywords

Information retrieval Natural language processing Text mining Automatic keyphrase extraction Supervised approaches Unsupervised approaches Deep learning 

Notes

References

  1. Barker, K., & Cornacchia, N. (2000). Using noun phrase heads to extract document keyphrases. In: conference of the canadian society for computational studies of intelligence, pp. 40–52. Springer.Google Scholar
  2. Berend, G. (2011). Opinion expression mining by exploiting keyphrase extraction. In: Proceedings of the 5th international joint conference on natural language processing. Asian Federation of Natural Language Processing.Google Scholar
  3. Berend, G., & Farkas, R. (2010). SZTERGAK: Feature engineering for keyphrase extraction. In: proceedings of the 5th international workshop on semantic evaluation, pp. 186–189. Association for Computational Linguistics.Google Scholar
  4. Blei, D.M., Ng, A.Y., Jordan, M.I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.zbMATHGoogle Scholar
  5. Bougouin, A., Boudin, F., Daille, B. (2013). TOPICRANK: Graph-based topic ranking for keyphrase extraction. In: International joint conference on natural language processing (IJCNLP), pp. 543– 551.Google Scholar
  6. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1-7), 107–117.CrossRefGoogle Scholar
  7. Bulgarov, F., & Caragea, C. (2015). A comparison of supervised keyphrase extraction models. In: Proceedings of the 24th international conference on World Wide Web, pp. 13–14. ACM.Google Scholar
  8. Chandrasekar, R., James, C.F.I., Watson, E.B. (2006). System and method for query refinement to enable improved searching based on identifying and utilizing popular concepts related to users’ queries. US Patent, 7, 136,845.Google Scholar
  9. Chen, M., Sun, J.T., Zeng, H.J., Lam, K.Y. (2005). A practical system of keyphrase extraction for web pages. In: Proceedings of the 14th ACM international conference on information and knowledge management, pp. 277–278. ACM.Google Scholar
  10. Cho, T., & Lee, J.H. (2015). Latent keyphrase extraction using LDA model. Journal of Korean Institute of Intelligent Systems, 25(2), 180–185.CrossRefGoogle Scholar
  11. Danesh, S., Sumner, T., Martin, J.H. (2015). SGRANK: Combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction. In: Proceedings of the fourth joint conference on lexical and computational semantics, pp. 117–126.Google Scholar
  12. D’Avanzo, E., & Magnini, B. (2005). A keyphrase-based approach to summarization: The LAKE system at DUC-2005. In: Proceedings of DUC.Google Scholar
  13. Do, N., & Ho, L. (2015). Domain-specific keyphrase extraction and near-duplicate article detection based on ontology. In: International conference on computing & communication technologies, research, innovation, and vision for the future (RIVF), pp. 123–126. IEEE.Google Scholar
  14. Dostal, M., & JeŻek, K. (2011). Automatic keyphrase extraction based on NLP and statistical method. In: Dateso Conference. Západoċeská Univerzita v Plzni.Google Scholar
  15. El-Beltagy, S.R., & Rafea, A. (2009). KP-MINER: A keyphrase extraction system for English and Arabic documents. Information Systems, 34(1), 132–144.CrossRefGoogle Scholar
  16. El Idrissi, O., Frikh, B., Ouhbi, B. (2014). HCHIRSIMEX: An extended method for domain ontology learning based on conditional mutual information. In: 3rd IEEE international colloquium in information science and technology (CIST), pp. 91–95. IEEE.Google Scholar
  17. Elman, J.L. (1990). Finding structure in time. Cognitive science, 14(2), 179–211.CrossRefGoogle Scholar
  18. Elovici, Y., Shapira, B., Last, M., Zaafrany, O., Friedman, M., Schneider, M., Kandel, A. (2010). Detection of access to terror-related web sites using an advanced terror detection system (ATDS). Journal of the association for information science and technology, 61(2), 405–418.Google Scholar
  19. Ferrara, F., Pudota, N., Tasso, C. (2011). A keyphrase-based paper recommender system. In: Italian research conference on digital libraries, pp. 14–25. Springer.Google Scholar
  20. Fortuna, B., Grobelnik, M., Mladenić, D. (2006). Semi-automatic data-driven ontology construction system. In: Proceedings of the 9th international multi-conference information society, pp. 223–226.Google Scholar
  21. Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G. (1999). Domain-specific keyphrase extraction. In Proceedings of the 16th international joint conference on artificial intelligence, IJCAI ’99. http://dl.acm.org/citation.cfm?id=646307.687591 (pp. 668–673). San Francisco: Morgan Kaufmann Publishers Inc.
  22. Frantzi, K.T., Ananiadou, S., Tsujii, J. (1998). The C-VALUE/NC-VALUE method of automatic recognition for multi-word terms. In: International conference on theory and practice of digital libraries, pp. 585–604. Springer.Google Scholar
  23. Frikh, B., Djaanfar, A.S., Ouhbi, B. (2011). A new methodology for domain ontology construction from the Web. International Journal on Artificial Intelligence Tools, 20(06), 1157–1170.CrossRefzbMATHGoogle Scholar
  24. Gollapalli, S.D., & Caragea, C. (2014). Extracting keyphrases from research papers using citation networks. In: AAAI, pp. 1629–1635.Google Scholar
  25. Gong, Z., & Liu, Q. (2009). Improving keyword based web image search with visual feature distribution and term expansion. Knowledge and Information Systems, 21(1), 113–132.CrossRefGoogle Scholar
  26. Grineva, M., Grinev, M., Lizorkin, D. (2009). Extracting key terms from noisy and multitheme documents. In: Proceedings of the 18th international conference on World Wide Web, pp. 661–670. ACM.Google Scholar
  27. Gutwin, C., Paynter, G., Witten, I., Nevill-Manning, C., Frank, E. (1999). Improving browsing in digital libraries with keyphrase indexes. Decision Support Systems, 27(1-2), 81–104.CrossRefGoogle Scholar
  28. Haddoud, M. (2014). Abdeddaïm, S.: Accurate keyphrase extraction by discriminating overlapping phrases. Journal of Information Science, 40(4), 488–500.CrossRefGoogle Scholar
  29. Haddoud, M., Mokhtari, A., Lecroq, T. (2015). Abdeddaïm, S.: Accurate keyphrase extraction from scientific papers by mining linguistic information. In: CLBib@ ISSI, pp. 12–17.Google Scholar
  30. Hammouda, K.M., & Kamel, M.S. (2002). Phrase-based document similarity based on an index graph model. In: Proceedings of international conference on data mining (ICDM), pp. 203–210. IEEE.Google Scholar
  31. Hammouda, K.M., Matute, D.N., Kamel, M.S. (2005). COREPHRASE: Keyphrase extraction for document clustering. In: International workshop on machine learning and data mining in pattern recognition, pp. 265–274. Springer.Google Scholar
  32. Han, J., Kim, T., Choi, J. (2007). Web document clustering by using automatic keyphrase extraction. In: 2007 IEEE/WIC/ACM international conferences on web intelligence and intelligent agent technology - workshops, pp. 56–59. IEEE.Google Scholar
  33. Hofmann, T. (1999). Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence, pp. 289–296. Morgan Kaufmann Publishers Inc.Google Scholar
  34. Huang, C., Tian, Y., Zhou, Z., Ling, C.X., Huang, T. (2006). Keyphrase extraction using semantic networks structure analysis. In: 6th international conference on data mining (ICDM’06), pp. 275–284. IEEE.Google Scholar
  35. Hulth, A. (2003). Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 conference on empirical methods in natural language processing, pp. 216–223. Association for Computational Linguistics.Google Scholar
  36. Hulth, A., & Megyesi, B.B. (2006). A study on automatically extracted keywords in text categorization. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 537–544. Association for Computational Linguistics.Google Scholar
  37. Jarmasz, M., & Barriere, C. (2004). Using semantic similarity over tera-byte corpus, compute the performance of keyphrase extraction. Proceedings of CLINE.Google Scholar
  38. Jiang, X., Hu, Y., Li, H. (2009). A ranking approach to keyphrase extraction. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’09.  https://doi.org/10.1145/1571941.1572113 (pp. 756–757). New York: ACM.
  39. Jones, S., & Staveley, M.S. (1999). PHRASIER: A system for interactive document retrieval using keyphrases. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, pp. 160–167. ACM.Google Scholar
  40. Jungiewicz, M., & Łopuszyński, M. (2014). Unsupervised keyword extraction from Polish legal texts. In: International conference on natural language processing, pp. 65–70. Springer.Google Scholar
  41. Kamal Sarkar Mita Nasipuri, S.G. (2010). A new approach to keyphrase extraction using neural networks. arXiv:1004.3274.
  42. Kelleher, D., & Luz, S. (2005). Automatic hypertext keyphrase detection. In: IJCAI, vol. 5, pp. 1608– 1609.Google Scholar
  43. Kim, S.N., & Kan, M.Y. (2009). Re-examining automatic keyphrase extraction approaches in scientific articles. In: Proceedings of the workshop on multiword expressions: identification, interpretation, disambiguation and applications, pp. 9–16. Association for Computational Linguistics.Google Scholar
  44. Kim, S.N., Medelyan, O., Kan, M.Y., Baldwin, T. (2010). SEMEVAL-2010 Task 5: Automatic keyphrase extraction from scientific articles. In: Proceedings of the 5th international workshop on semantic evaluation, pp. 21–26. Association for Computational Linguistics.Google Scholar
  45. Krovetz, R., & Croft, W.B. (1992). Lexical ambiguity and information retrieval. ACM Transactions on Information Systems (TOIS), 10(2), 115–141.CrossRefGoogle Scholar
  46. Kumar, N., & Srinathan, K. (2008). Automatic keyphrase extraction from scientific documents using n-gram filtration technique. In: Proceedings of the eighth ACM symposium on document engineering, pp. 199–208. ACM.Google Scholar
  47. Landauer, T.K., Foltz, P.W., Laham, D. (1998). An introduction to latent semantic analysis. Discourse processes, 25(2-3), 259–284.CrossRefGoogle Scholar
  48. Leake, D.B., Maguitman, A., Reichherzer, T., Cañas, A.J., Carvalho, M., Arguedas, M., Brenes, S., Eskridge, T. (2003). Aiding knowledge capture by searching for extensions of knowledge models. In: Proceedings of the 2nd international conference on knowledge capture, pp. 44–53. ACM.Google Scholar
  49. LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. Nature, 521 (7553), 436.CrossRefGoogle Scholar
  50. Liu, F., Pennell, D., Liu, F., Liu, Y. (2009). Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the Association for Computational Linguistics, pp. 620–628. Association for Computational Linguistics.Google Scholar
  51. Liu, W., Chung, B.C., Wang, R., Ng, J., Morlet, N. (2015). A genetic algorithm enabled ensemble for unsupervised medical term extraction from clinical letters. Health Information Science and Systems, 3(1), 5.CrossRefGoogle Scholar
  52. Liu, Z., Huang, W., Zheng, Y., Sun, M. (2010). Automatic keyphrase extraction via topic decomposition. In: Proceedings of The 2010 conference on empirical methods in natural language processing, pp. 366–376. Association for Computational Linguistics.Google Scholar
  53. Liu, Z., Li, P., Zheng, Y., Sun, M. (2009). Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 conference on empirical methods in natural language processing: vol. 1, pp. 257–266. Association for Computational Linguistics.Google Scholar
  54. Lopez, P., & Romary, L. (2010). HUMB: Automatic key term extraction from scientific articles in GROBID. In: Proceedings of the 5th international workshop on semantic evaluation, pp. 248–251. Association for Computational Linguistics.Google Scholar
  55. Lops, P., De Gemmis, M., Semeraro, G. (2011). Content-based recommender systems: State of the art and trends. In: Recommender Systems Handbook, pp. 73–105. Springer.Google Scholar
  56. Matsuo, Y., & Ishizuka, M. (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13(01), 157–169.CrossRefGoogle Scholar
  57. Matsuo, Y., Mori, J., Hamasaki, M., Nishimura, T., Takeda, H., Hasida, K., Ishizuka, M. (2007). POLYPHONET: An advanced social network extraction system from the web. Web Semantics: Science. Services and Agents on the World Wide Web, 5(4), 262–278.CrossRefGoogle Scholar
  58. Medelyan, O., Frank, E., Witten, I.H. (2009). Human-competitive tagging using automatic keyphrase extraction. In: Proceedings of the 2009 conference on empirical methods in natural language processing, vol. 3, pp. 1318–1327. Association for Computational Linguistics.Google Scholar
  59. Medelyan, O., & Witten, I.H. (2006). Thesaurus based automatic keyphrase indexing. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries, pp. 296–297. ACM.Google Scholar
  60. Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P., Chi, Y. (2017). Deep keyphrase generation. arXiv:1704.06879.
  61. Mihalcea, R., & Tarau, P. (2004). TEXTRANK: Bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing.Google Scholar
  62. Mihalcea, R., Tarau, P., Figa, E. (2004). PageRank on semantic networks, with application to word sense disambiguation. In: Proceedings of the 20th international conference on computational linguistics, p. 1126. Association for Computational Linguistics.Google Scholar
  63. Moldovan, D., Harabagiu, S., Pasca, M., Mihalcea, R., Girju, R., Goodrum, R., Rus, V. (2000). The structure and performance of an open-domain question answering system. In: Proceedings of the 38th annual meeting on Association for Computational Linguistics, pp. 563–570. Association for Computational Linguistics.Google Scholar
  64. Mori, J., Ishizuka, M., Matsuo, Y. (2007). Extracting keyphrases to represent relations in social networks from web. In: IJCAI, vol. 7, pp. 2820–2827.Google Scholar
  65. Newman, D., Koilada, N., Lau, J.H., Baldwin, T. (2012). Bayesian text segmentation for index term identification and keyphrase extraction. Proceedings of COLING, 2012, 2077–2092.Google Scholar
  66. Nguyen, T.D., & Kan, M.Y. (2007). Keyphrase extraction in scientific publications. In: International conference on asian digital libraries, pp. 317–326. Springer.Google Scholar
  67. Nguyen, T.D., & Luong, M.T. (2010). WINGNUS: Keyphrase extraction utilizing document logical structure. In: Proceedings of the 5th international workshop on semantic evaluation, pp. 166–169. Association for Computational Linguistics.Google Scholar
  68. Osiński, S., Stefanowski, J., Weiss, D. (2004). LINGO: Search results clustering algorithm based on singular value decomposition. In: Intelligent information processing and web mining, pp. 359–368. Springer.Google Scholar
  69. Page, L., Brin, S., Motwani, R., Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web, Stanford InfoLab, Tech. rep.Google Scholar
  70. Sarkar, K. (2013). A hybrid approach to extract keyphrases from medical documents. arXiv:1303.1441.
  71. Smatana, M., & Butka, P. (2016). Extraction of keyphrases from single document based on hierarchical concepts. In: IEE 14th international symposium on applied machine intelligence and informatics (SAMI), pp. 93–98. IEEE.Google Scholar
  72. Song, M., Song, I.Y., Allen, R.B., Obradovic, Z. (2006). Keyphrase extraction-based query expansion in digital libraries. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries, pp. 202–209. ACM.Google Scholar
  73. Tomokiyo, T., & Hurst, M. (2003). A language model approach to keyphrase extraction. In: Proceedings of the ACL 2003 workshop on multiword expressions: analysis, acquisition and treatment-volume 18, pp. 33–40. Association for Computational Linguistics.Google Scholar
  74. Turney, P.D. (2000). Learning algorithms for keyphrase extraction. Information Retrieval, 2(4), 303–336.CrossRefGoogle Scholar
  75. Turney, P.D. (2003). Coherent keyphrase extraction via web mining. arXiv:0308033.
  76. Wan, X., & Xiao, J. (2008). Single document keyphrase extraction using neighborhood knowledge. In: AAAI, vol. 8, pp. 855–860.Google Scholar
  77. Wan, X., Yang, J., Xiao, J. (2007). Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp. 552–559.Google Scholar
  78. Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G. (1999). KEA: Practical automatic keyphrase extraction. In: Proceedings of the fourth ACM conference on digital libraries, pp. 254–255. ACM.Google Scholar
  79. Xie, F., Wu, X., Zhu, X. (2017). Efficient sequential pattern mining with wildcards for keyphrase extraction. Knowledge-Based Systems, 115, 27–39.CrossRefGoogle Scholar
  80. Yang, S., Lu, W., Yang, D., Li, X., Wu, C., Wei, B. (2017). KEYPHRASEDS: Automatic generation of survey by exploiting keyphrase information. Neurocomputing, 224, 58–70.CrossRefGoogle Scholar
  81. Yih, W.T., Goodman, J., Carvalho, V.R. (2006). Finding advertising keywords on web pages. In Proceedings of the 15th international conference on World Wide Web, WWW ’06.  https://doi.org/10.1145/1135777.1135813 (pp. 213–222). New York: ACM.
  82. You, W., Fontaine, D., Barthes, J.P. (2009). Automatic keyphrase extraction with a refined candidate set. In: Proceedings of the 2009 IEE/WIC/ACM International joint conference on web intelligence and intelligent agent technology-volume 01, pp. 576–579. IEEE Computer Society.Google Scholar
  83. Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In: SIGIR, vol. 98, pp. 46–54. Citeseer.Google Scholar
  84. Zesch, T., & Gurevych, I. (2009). Approximate matching for evaluating keyphrase extraction. In: Proceedings of the international conference ranLP, pp. 484–489.Google Scholar
  85. Zha, H. (2002). Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In: Proceedings of the 25th annual international acm sigir conference on research and development in information retrieval, pp. 113–120. ACM.Google Scholar
  86. Zhang, D., & Dong, Y. (2004). Semantic, hierarchical, online clustering of web search results. In: Asia-Pacific Web Conference, pp. 69–78. Springer.Google Scholar
  87. Zhang, K., Xu, H., Tang, J., Li, J. (2006). Keyword extraction using support vector machine. In: international conference on web-age information management, pp. 85–96. Springer.Google Scholar
  88. Zhang, Q., Wang, Y., Gong, Y., Huang, X. (2016). Keyphrase extraction using deep recurrent neural networks on Twitter. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 836–845.Google Scholar
  89. Zhang, Y., Zincir-Heywood, N., Milios, E. (2004). World Wide Web site summarization. Web intelligence and agent systems: an international journal, 2(1), 39–53.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.TTI Laboratory, Higher School of Technology (EST)Sidi Mohamed Ben Abdellah UniversityFezMorocco
  2. 2.Mathematical Modeling & Computer Laboratory (LM2I), National Higher School of Arts and Crafts (ENSAM)Moulay Ismail University (UMI)MeknesMorocco

Personalised recommendations