Overlapping thematic structures extraction with mixed-membership stochastic blockmodel
- 11 Downloads
It is increasing important to identify automatically thematic structures from massive scientific literature. The interdisciplinarity enables thematic structures without natural boundaries. In this work, the identification of thematic structures is regarded as an overlapping community detection problem from the large-scale citation-link network. A mixed-membership stochastic blockmodel, armed with stochastic variational inference algorithm, is utilized to detect the overlapping thematic structures. In the meanwhile, in order to enhance readability, each theme is labeled with soft mutual information based method by several topical terms. Extensive experimental results on the astro dataset indicate that mixed-membership stochastic blockmodel primarily uses the local information and allows for the pervasive overlaps, but it favors similar sized themes, which disqualifies this approach from being used to extract the thematic structures from scientific literature. In addition, the thematic structures from the bibliographic coupling network is similar to those from the co-citation network.
KeywordsOverlapping thematic structure Mixed-membership stochastic blockmodel Stochastic variational inference Soft mutual information Cluster labeling
The present study is an extended version of an article (Xu et al. 2017) presented at the 16th International Conference on Scientometrics and Informetrics, Wuhan (China), 16–20 October 2017. The clustering results from this work have been deposited with the other astro-dataset results. Our gratitude also goes to the anonymous reviewers and the editor for their valuable comments. This work was supported partially by the Social Science Foundation of Beijing (Grant No. 17GLB074), Science and Technology Project of Guangdong Province (Grant No. 2017A030303065), and National Natural Science Foundation of China (Grant Nos. 71403255 and 71473237).
- Abbe, E. & Sandon, C. (2015). Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In Proceedings of the 56th IEEE annual symposium on foundations of computer science (pp. 670–688). Washington, DC: IEEE Computer Society. https://doi.org/10.1109/FOCS.2015.47.
- Ananiadou, S. (1994). A methodology for automatic term recognition. In Proceedings of the 15th international conference on computational linguistics (pp. 1034–1038). Stroudsburg, PA: Association for Computational Linguistics. https://doi.org/10.3115/991250.991317.
- Bastian, M., Heymann, S., and Jacomy, M. (2009). Gephi: An open source software for exploring and manipulating networks. In Proceedings of the 3rd international AAAI conference on weblogs and social media (pp. 361–362).Google Scholar
- Chen, P.-Y., & Hero, A. O, I. I. I. (2015). Universal phase transition in community detectability under a stochastic block model. Physical Review E: Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics, 91(3), 032804. https://doi.org/10.1103/PhysRevE.91.032804.MathSciNetCrossRefGoogle Scholar
- Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 269–274). New York, NY: ACM. https://doi.org/10.1145/502512.502550.
- Goswami, S., Murthy, C. A., and Das, A. K. (2016). Sparsity measure of a network graph: Gini index. eprint arXiv:1612.07074.
- Mei, Q., Shen, X., and Zhai, C. (2007). Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 490–499). https://doi.org/10.1145/1281192.1281246.
- Park, Y., Byrd, R. J., and Boguraev, B. K. (2002). Automatic glossary extraction: Beyond terminology identification. In Proceedings of the 19th international conference on computational linguistics, Taipei, Taiwan (pp. 1–7).Google Scholar
- Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). In M. W. Berry & J. Kogan (Eds.), Text mining: Application and theory (pp. 1–20). Hoboken: Wiley.Google Scholar
- Sclano, F. and Velardi, P. (2007). Termextractor: A web application to learn the common terminology of interest groups and research communities. In Proceedings of the 3rd international conference on interoperability for enterprise software and applications.Google Scholar
- Shi, Q., Qiao, X., Xu, S., & Nong, G. (2013). Author-topic evolution model and its application in analysis of research interests evolution. Journal of the China Society for Scientific and Technical Information, 32(9), 912–919.Google Scholar
- Xu, S., Liu, J., & Wang, Z. (2017). Overlapping thematic structures extraction with mixed-membership stochastic blockmodel. In Proceedings of ISSI 2017—the 16th international conference on scientometrics & informetrics (pp. 1007–1012).Google Scholar
- Zhang, Z., Gao, J., & Ciravegna, F. (2016). JATE 2.0: Java automatic term extraction with Apache Solr. In Proceedings of the 10th language resources and evaluation conference (pp. 2262–2269).Google Scholar
- Zhang, Z., Iria, J., Brewster, C., & Ciravegna, F. (2008). A comparative evaluation of term recognition algorithms. In Proceedings of the 6th international conference on language resources and evaluation, Marrakech, Morocco (pp. 2108–2113).Google Scholar