Abstract
Constructing theme-related word set is a basic work for establishing theme-oriented information retrieval systems. Nowadays, most of previous studies focus on identifying representative words of a specific document, and few studies pay attention to constructing a word set related to a theme. By analyzing existing keywords extraction methods, this paper proposes a method to automatically construct theme-related word set based on the primary theme-related word set given by domain experts and the well-known websites related to the theme. As the first step, the method uses existing information extraction techniques to obtain the documents from the websites and every document’s keyword set. Then it calculates the correlation degree between the known theme-related word set and the document keyword set, further gets a word set of the document related to the theme based on the document-theme relevance, and merges the word set to the theme-related word set. By using the method, the theme-related word set is supplemented by iteration based on the documents gotten from the theme-related websites. Because there is little research work focusing on this problem and no relevant experimental data set, this paper uses the proposed method to construct theme-related word sets towards two themes “electricity” and “college entrance examination”, and we invite domain experts to evaluate the word sets. The results show that a relatively complete theme-related word set can be obtained based on this method, which shows the feasibility of our methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chen, Y., Liu, Y., Zhou, K., Wang, M., Zhang, M., Ma, S.: Does vertical bring more satisfaction? Predicting search satisfaction in a heterogeneous environment. In: CIKM ‘15 Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1581–1590. Melbourne, Australia (2015). https://doi.org/10.1145/2806416.2806473
Zhou, K., Cummins, R., Lalmas, M., Jose, J.M.: Which vertical search engines are relevant? In: WWW ‘13 Proceedings of the 22nd international conference on World Wide Web, pp. 1557–1568. Rio de Janeiro, Brazil (2013). https://doi.org/10.1145/2488388.2488524
Bokaetf, M.H., Sameti, H., Liu, Y.: Unsupervised approach to extract summary keywords in meeting domain. In: Signal Processing Conference, pp. 1406–1410. IEEE, Nice, France (2015). https://doi.org/10.1109/eusipco.2015.7362615
Hofmann, K., Tsagkias, M., Meij, E., Rijke, M.D.: A comparative study of features for keyphrase extraction in scientific literature. In: Proceedings of the 18th ACM Conference on Information And Knowledge Management, Hong Kong, China (2009)
Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. EMNLP 4, 404–411 (2004)
Wan, X., Xiao, J.: Single document keyphrase extraction using neighborhood knowledge, In AAAI’08 Proceedings of the 23rd national conference on Artificial intelligence, pp. 855–860. Chicago, Illinois (2008)
Gollapalli, S. D., Caragea, C.: Extracting keyphrases from research papers using citation networks. In: AAAI’14 Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pp. 1629–1635. Québec City, Québec, Canada (2014)
Rafiei-Asl, J., Nickabadi, A.: TSAKE: a topical and structural automatic keyphrase extractor. Appl. Soft Comput. 58, 620–630 (2017). https://doi.org/10.1016/j.asoc.2017.05.014
Florescu, C., Caragea, C.: A position-biased pagerank algorithm for keyphrase extraction. In: Proceedings of the 31st American Association for Artificial Intelligence (AAAI 2017), San Francisco, California, USA (2017)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)
Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015)
Sfikas, G., Gatos, B., Nikou, C.: Semicca: A new semi-supervised probabilistic CCA model for keyword spotting. In: 2017 IEEE International Conference on Image Processing, pp. 1107–1111. Beijing, China (2017). https://doi.org/10.1109/icip.2017.8296453
Xie, F., Wu, X., Zhu, X.: Efficient sequential pattern mining with wildcards for keyphrase extraction. Knowl.-Based Syst. 115, 27–39 (2017). https://doi.org/10.1016/j.knosys.2016.10.011
Joorabchi, A., Mahdi, A.E.: Automatic keyphrase annotation of scientific documents using Wikipedia and genetic algorithms. J. Inf. Sci. 39(3), 410–426 (2013). https://doi.org/10.1177/0165551512472138
Sterckx, L., Caragea, C., Demeester, T., Develder, C.: Supervised keyphrase extraction as positive unlabeled learning. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1924–1929. Austin, Texas, USA (2016). https://doi.org/10.18653/v1/d16-1198
Yiqun, C., Ruqi, Z., Weiheng, Z., Mengting, L., Jian, Y.: Mining patent knowledge for automatic keyword extraction. J. Comput. Res. Dev. 53(8), 1740–1752 (2016)
Gollapalli, S.D., Li, X., Yang, P.: Incorporating expert knowledge into keyphrase extraction. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), San Francisco, California USA (2017)
Acknowledgement
This research was supported by the Training plan of Tianjin University Innovation Team (No.TD13-5025), the Natural Science Foundation of Tianjin (No.15JCYBJC46500) and the Major Project of Tianjin Smart Manufacturing (No.15ZXZNCX00050).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Wu, Y., Li, Y., Hao, G. (2018). A Web-Based Theme-Related Word Set Construction Algorithm. In: U, L., Xie, H. (eds) Web and Big Data. APWeb-WAIM 2018. Lecture Notes in Computer Science(), vol 11268. Springer, Cham. https://doi.org/10.1007/978-3-030-01298-4_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-01298-4_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01297-7
Online ISBN: 978-3-030-01298-4
eBook Packages: Computer ScienceComputer Science (R0)