Abstract
Developers nowadays can leverage existing systems to build their own applications. However, a lack of documentation hinders the process of software system reuse. We examine the problem of mining topics (i.e., topic extraction) from source code, which can facilitate the comprehension of the software systems. We propose a topic extraction method, Embedded Topic Extraction (EmbTE), that considers word semantics, which are never considered in mining topics from source code, by leveraging word embedding techniques. We also adopt Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) to extract topics from source code. Moreover, an automated term selection algorithm is proposed to identify the most contributory terms from source code for the topic extraction task. The empirical studies on Github (https://github.com/) Java projects show that EmbTE outperforms other methods in terms of providing more coherent topics. The results also indicate that method name, method comments, class names and class comments are the most contributory types of terms to source code topic extraction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Allamanis, M., Sutton, C.A.: Mining source code repositories at massive scale using language modeling. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR 2013), pp. 207–216, San Francisco, CA, USA, May 2013
Asuncion, H.U., Asuncion, A.U., Taylor, R.N.: Software traceability with topic modeling. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE 2010), pp. 95–104, Cape Town, South Africa, May 2010
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2003), pp. 28–36, Baltimore, Maryland, USA, January 2003
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Haefliger, S., Krogh, G.V., Spaeth, S.: Code reuse in open source software. Manage. Sci. 54(1), 180–193 (2008)
Haiduc, S., Aponte, J., Marcus, A.: Supporting program comprehension with source code summarization. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE 2010), pp. 223–226, Cape Town, South Africa, May 2010
Haiduc, S., Aponte, J., Moreno, L., Marcus, A.: On the use of automated text summarization techniques for summarizing source code. In: Proceedings of the 17th Working Conference on Reverse Engineering (WCRE 2010), pp. 35–44, Beverly, MA, USA, October 2010
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Lukins, S.K., Kraft, N.A., Etzkorn, L.H.: Bug localization using latent Dirichlet allocation. Inf. Softw. Technol. 52(9), 972–990 (2010)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR, abs/1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS 2013), pp. 3111–3119, Lake Tahoe, United States, December 2013
Moreno, L., Aponte, J., Sridhara, G., Marcus, A., Pollock, L.L., Vijay-Shanker, K.: Automatic generation of natural language summaries for java classes. In: Proceedings of the 21st IEEE International Conference on Program Comprehension (ICPC 2013), pp. 23–32, San Francisco, NC, USA, May 2013
Niu, L., Dai, X., Zhang, J., Chen, J.: Topic2Vec: learning distributed representations of topics. In: Proceedings of the International Conference on Asian Language Processing 2015 (IALP 2015), pp. 193–196, Suzhou, China, October 2015
Rama, G.M., Sarkar, S., Heafield, K.: Mining business topics in source code using latent Dirichlet allocation. In: Proceedings of the 1st Annual India Software Engineering Conference (ISEC 2008), pp. 113–120, Hyderabad, India, February 2008
Rodeghero, P., McMillan, C., McBurney, P.W., Bosch, N., D’Mello, S.K.: Improving automated source code summarization via an eye-tracking study of programmers. In: Proceedings of the 36th International Conference on Software Engineering (ICSE 2014), pp. 390–401, Hyderabad, India, June 2014
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (WSDM 2015), pp. 399–408, Shanghai, China, February 2015
Sridhara, G., Pollock, L.L., Vijay-Shanker, K.: Automatically detecting and describing high level actions within methods. In: Proceedings of the 33rd International Conference on Software Engineering (ICSE 2011), pp. 101–110, Waikiki, Honolulu, HI, USA, May 2011
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Zhang, W.E., Sheng, Q.Z., Abebe, E., Babar, M.A., Zhou, A. (2016). Mining Source Code Topics Through Topic Model and Words Embedding. In: Li, J., Li, X., Wang, S., Li, J., Sheng, Q. (eds) Advanced Data Mining and Applications. ADMA 2016. Lecture Notes in Computer Science(), vol 10086. Springer, Cham. https://doi.org/10.1007/978-3-319-49586-6_47
Download citation
DOI: https://doi.org/10.1007/978-3-319-49586-6_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49585-9
Online ISBN: 978-3-319-49586-6
eBook Packages: Computer ScienceComputer Science (R0)