, Volume 114, Issue 3, pp 1031–1068 | Cite as

A domain keyword analysis approach extending Term Frequency-Keyword Active Index with Google Word2Vec model

  • Kai Hu
  • Huayi Wu
  • Kunlun Qi
  • Jingmin Yu
  • Siluo Yang
  • Tianxing Yu
  • Jie Zheng
  • Bo Liu


In bibliometric research, keyword analysis of publications provides an effective way not only to investigate the knowledge structure of research domains, but also to explore the developing trends within domains. To identify the most representative keywords, many approaches have been proposed. Most of them focus on using statistical regularities, syntax, grammar, or network-based characteristics to select representative keywords for the domain analysis. In this paper, we argue that the domain knowledge is reflected by the semantic meanings behind keywords rather than the keywords themselves. We apply the Google Word2Vec model, a model of a word distribution using deep learning, to represent the semantic meanings of the keywords. Based on this work, we propose a new domain knowledge approach, the Semantic Frequency-Semantic Active Index, similar to Term Frequency-Inverse Document Frequency, to link domain and background information and identify infrequent but important keywords. We adopt a semantic similarity measuring process before statistical computation to compute the frequencies of “semantic units” rather than keyword frequencies. Semantic units are generated by word vector clustering, while the Inverse Document Frequency is extended to include the semantic inverse document frequency; thus only words in the inverse documents with a certain similarity will be counted. Taking geographical natural hazards as the domain and natural hazards as the background discipline, we identify the domain-specific knowledge that distinguishes geographical natural hazards from other types of natural hazards. We compare and discuss the advantages and disadvantages of the proposed method in relation to existing methods, finding that by introducing the semantic meaning of the keywords, our method supports more effective domain knowledge analysis.


Keyword extraction Word2Vec Semantic clustering Semantic similarity Frequency Domain knowledge 



This work was supported by the National Natural Science Foundation of China (No. 41371372). Thanks Mr. Stephen C. McClure for helping us with the English revisions.


  1. Bird, S. (2006). NLTK: The natural language toolkit. In Proceedings of the COLING/ACL on interactive presentation sessions, 2006 (pp. 69–72). Association for Computational Linguistics.Google Scholar
  2. Borgatti, S. P. (2005). Centrality and network flow. Social networks, 27(1), 55–71. Scholar
  3. Chen, C. (2006). CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the American Society for Information Science and Technology, 57(3), 359–377.CrossRefGoogle Scholar
  4. Chen, G., & Xiao, L. (2016). Selecting publication keywords for domain analysis in bibliometrics: A comparison of three methods. Journal of Informetrics, 10(1), 212–223.CrossRefGoogle Scholar
  5. Chen, G., Xiao, L., Hu, C.-P., & Zhao, X.-Q. (2015). Identifying the research focus of Library and Information Science institutions in China with institution-specific keywords. Scientometrics, 103(2), 707–724.CrossRefGoogle Scholar
  6. Der Maaten, L. V., & Hinton, G. E. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.MATHGoogle Scholar
  7. Ding, Y., Chowdhury, G. G., & Foo, S. (2001). Bibliometric cartography of information retrieval research by using co-word analysis. Information Processing and Management, 37(6), 817–842.CrossRefMATHGoogle Scholar
  8. Feng, J., Zhang, Y. Q., & Zhang, H. (2017). Improving the co-word analysis method based on semantic distance. Scientometrics, 111(3), 1521–1531.CrossRefGoogle Scholar
  9. Handler, A. (2014). An empirical study of semantic similarity in WordNet and Word2Vec. Citeseer.Google Scholar
  10. Huang, E. H., Socher, R., Manning, C. D., & Ng, A. Y. (2012). Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th annual meeting of the association for computational linguistics: Long papersVolume 1, 2012 (pp. 873–882): Association for Computational Linguistics.Google Scholar
  11. Meng, L., Huang, R., & Gu, J. (2013). A review of semantic similarity measures in wordnet. International Journal of Hybrid Information Technology, 6(1), 1–12.Google Scholar
  12. Mihalcea, R., Corley, C., & Strapparava, C. (2006). Corpus-based and knowledge-based measures of text semantic similarity. In AAAI, 2006 (Vol. 6, pp. 775–780).Google Scholar
  13. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  14. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Neural information processing systems (pp. 3111–3119).Google Scholar
  15. Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM, 38(11), 39–41.CrossRefGoogle Scholar
  16. Newman, M. E. (2008). The mathematics of networks. The New Palgrave Encyclopedia of Economics, 2(2008), 1–12.Google Scholar
  17. Quoniam, L., Balme, F., Rostaing, H., Giraud, E., & Dou, J. M. (1998). Bibliometric law used for information retrieval. [journal article]. Scientometrics, 41(1), 83–91. Scholar
  18. Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. Annals of Mathematical Statistics, 27(3), 832–837.MathSciNetCrossRefMATHGoogle Scholar
  19. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.CrossRefGoogle Scholar
  20. Su, H.-N., & Lee, P.-C. (2010). Mapping knowledge structure by keyword co-occurrence: A first look at journal papers in Technology Foresight. Scientometrics, 85(1), 65–79. Scholar
  21. Wang, Z.-Y., Li, G., Li, C.-Y., & Li, A. (2012). Research on the semantic-based co-word analysis. Scientometrics, 90(3), 855–875.CrossRefGoogle Scholar
  22. Yang, S., Han, R., Wolfram, D., & Zhao, Y. (2016). Visualizing the intellectual structure of information science (2006–2015): Introducing author keyword coupling analysis. Journal of Informetrics, 10(1), 132–150.CrossRefGoogle Scholar
  23. Zhao, R., & Wang, J. (2010). Visualizing the research on pervasive and ubiquitous computing. Scientometrics, 86(3), 593–612.MathSciNetCrossRefGoogle Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2017

Authors and Affiliations

  1. 1.The State Key Laboratory of Information Engineering in Surveying, Mapping and Remote SensingWuhan UniversityWuhanChina
  2. 2.Collaborative Innovation Center of Geospatial TechnologyWuhan UniversityWuhanChina
  3. 3.Faculty of Information EngineeringChina University of Geosciences (Wuhan)WuhanChina
  4. 4.Changjiang Spatial Information Technology Engineering CO., LTDWuhanChina
  5. 5.School of Information ManagementWuhan UniversityWuhanChina
  6. 6.Faculty of GeomaticsEast China Institute of TechnologyNanchangChina

Personalised recommendations