Improved Automatic Keyword Extraction Given More Semantic Knowledge

Yang, Kai; Chen, Zhenhong; Cai, Yi; Huang, DongPing; Leung, Ho-fung

doi:10.1007/978-3-319-32055-7_10

Kai Yang¹⁶,
Zhenhong Chen¹⁶,
Yi Cai¹⁶,
DongPing Huang¹⁶ &
…
Ho-fung Leung¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9645))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1741 Accesses
5 Citations

Abstract

Graph-based ranking algorithm such as TextRank shows a remarkable effect on keyword extraction. However, these algorithms build graphs only considering the lexical sequence of the documents. Hence, graphs generated by these algorithm can not reflect the semantic relationships between documents. In this paper, we demonstrate that there exists an information loss in the graph-building process from textual documents to graphs. These loss will lead to the misjudgment of the algorithm. In order to solve this problem, we propose a new approach called Topic-based TextRank. Different from the traditional algorithm, our approach takes the lexical meaning of the text unit (i.e. words and phrase) into account. The result of our experiments shows that our proposed algorithm can outperform the state-of-the-art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 25–32. ACM (2009)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Burns, N., Bi, Y., Wang, H., Anderson, T.: Extended twofold-LDA model for two aspects in one sentence. In: Greco, S., Bouchon-Meunier, B., Coletti, G., Fedrizzi, M., Matarazzo, B., Yager, R.R. (eds.) Advances in Computational Intelligence. CCIS, vol. 298, pp. 265–275. Springer, Heidelberg (2012)
Chapter Google Scholar
Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J.L., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: Advances in Neural Information Processing Systems, pp. 288–296 (2009)
Google Scholar
Chatterji, S., Pachter, L.: Multiple organism gene finding by collapsed Gibbs sampling. In: Proceedings of the Eighth Annual International Conference on Research in Computational Molecular Biology, pp. 187–193. ACM (2004)
Google Scholar
Chen, Z., Mukherjee, A., Liu, B.: Aspect extraction with automated prior knowledge learning. In: Proceedings of ACL, pp. 347–358 (2014)
Google Scholar
Chen, Z., Mukherjee, A., Liu, B., Hsu, M., Castellanos, M., Ghosh, R.: Discovering coherent topics using general knowledge. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, pp. 209–218. ACM (2013)
Google Scholar
Danilevsky, M., Wang, C., Desai, N., Ren, X., Guo, J., Han, J.: Automatic construction and ranking of topical keyphrases on collections of short documents. In: Proceedings of the SIAM International Conference on Data Mining, 2014 (2014)
Google Scholar
Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984)
Article MATH Google Scholar
Griffiths, T.: Gibbs sampling in the generative model of latent Dirichlet allocation. Technical report, Stanford University (2002)
Google Scholar
Hassan, S., Mihalcea, R., Banea, C.: Random walk term weighting for improved text classification. Int. J. Semant. Comput. 1(04), 421–439 (2007)
Article Google Scholar
Heinrich, G.: Parameter estimation for text analysis. Technical report (2005)
Google Scholar
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 216–223. Association for Computational Linguistics (2003)
Google Scholar
Jagarlamudi, J., Daumé III, H., Udupa, R.: Incorporating lexical priors into topic models. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 204–213. Association for Computational Linguistics (2012)
Google Scholar
Jo, Y., Oh, A.H.: Aspect and sentiment unification model for online review analysis. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 815–824. ACM (2011)
Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951)
Article MathSciNet MATH Google Scholar
Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic keyphrase extraction via topic decomposition. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 366–376. Association for Computational Linguistics (2010)
Google Scholar
Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004, A meeting of SIGDAT, a Special Interest Group of the ACL, held in conjunction with ACL 2004, 25-26 July 2004, Barcelona, Spain, pp. 404–411 (2004)
Google Scholar
Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 262–272. Association for Computational Linguistics (2011)
Google Scholar
Mukherjee, A., Liu, B.: Aspect extraction through semi-supervised modeling. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers. vol. 1, pp. 339–348. Association for Computational Linguistics (2012)
Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999. Previous number = SIDL-WP-1999-0120
Google Scholar
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)
Article Google Scholar
Tomokiyo, T., Hurst, M.: A language model approach to keyphrase extraction. In: Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, vol. 18, pp. 33–40. Association for Computational Linguistics (2003)
Google Scholar
Yan, X., Guo, J., Liu, S., Cheng, X., Wang, Y.: Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: Proceedings of the SIAM International Conference on Data Mining (2013)
Google Scholar

Download references

Acknowledgement

This work is supported by National Natural Science Foundation of China (project no. 61300137), and NEMODE Network Pilot Study: A Computational Taxonomy of Business Models of the Digital Economy, P55805.

Author information

Authors and Affiliations

School of Software Engineering, South China University of Technology, Guangzhou, China
Kai Yang, Zhenhong Chen, Yi Cai & DongPing Huang
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
Ho-fung Leung

Authors

Kai Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhenhong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yi Cai
View author publications
You can also search for this author in PubMed Google Scholar
DongPing Huang
View author publications
You can also search for this author in PubMed Google Scholar
Ho-fung Leung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Cai .

Editor information

Editors and Affiliations

Harbin Institute of Technology, Harbin, China
Hong Gao
Kangwon National University, Kangwon, Korea (Republic of)
Jinho Kim
Kumamoto University, Kumamoto-shi, Japan
Yasushi Sakurai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, K., Chen, Z., Cai, Y., Huang, D., Leung, Hf. (2016). Improved Automatic Keyword Extraction Given More Semantic Knowledge. In: Gao, H., Kim, J., Sakurai, Y. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9645. Springer, Cham. https://doi.org/10.1007/978-3-319-32055-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-32055-7_10
Published: 12 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32054-0
Online ISBN: 978-3-319-32055-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics