Abstract
In the field of Software Maintenance the definition of effective approaches to partition a software system into meaningful subsystems is a longstanding and relevant research topic. These techniques are very important as they can significantly support a Maintainer in his/her tasks by grouping related entities of a large system into smaller and easier to comprehend subsystems.
In this paper we investigate the effectiveness of combining information retrieval and machine learning techniques in order to exploit the lexical information provided by programmers for software clustering. In particular, differently from any related work, we employ indexing techniques to explore the contribution of the combined use of six different dictionaries, corresponding to the six parts of the source code where programmers introduce lexical information, namely: class, attribute, method and parameter names, comments, and source code statements. Moreover their relevance is estimated on the basis of the project characteristics, by applying a machine learning approach based on a probabilistic model and on the Expectation-Maximization algorithm. To group source files accordingly, two clustering algorithms have been compared, i.e. the K-Medoids and the Group Average Agglomerative Clustering, and the investigation has been conducted on a dataset of 9 open source Java software systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Andreopoulos, B., An, A., Tzerpos, V., Wang, X.: Clustering large software systems at multiple layers. Information & Software Technology 49(3), 244–254 (2007)
Andritsos, P., Tzerpos, V.: Information-theoretic software clustering. IEEE Trans. Software Eng. 31(2), 150–165 (2005)
Anquetil, N., Fourrier, C., Lethbridge, T.C.: Experiments with clustering as a software remodularization method. In: WCRE 1999: Proceedings of the Sixth Working Conference on Reverse Engineering, p. 235. IEEE Computer Society, Washington, DC (1999)
Bittencourt, R.A., Guerrero, D.D.S.: Comparison of graph clustering algorithms for recovering software architecture module views. In: CSMR 2009: Proceedings of the 2009 European Conference on Software Maintenance and Reengineering, pp. 251–254. IEEE Computer Society, Washington, DC (2009)
Bowman, I.T., Holt, R.C., Brewster, N.V.: Linux as a case study: its extracted software architecture. In: ICSE 1999: Proceedings of the 21st International Conference on Software Engineering, pp. 555–563. ACM, New York (1999)
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.: Investigating the use of lexical information for software system clustering. In: 15th European Conference on Software Maintenance and Reengineering (CSMR 2011), pp. 35–44 (2011)
Corazza, A., Di Martino, S., Scanniello, G.: A probabilistic based approach towards software system clustering. In: 14th European Conference on Software Maintenance and Reengineering (CSMR 2010), pp. 89–98 (2010)
De Lucia, A., Scanniello, G., Tortora, G.: Identifying similar pages in web applications using a competitive clustering algorithm: Special issue articles. J. Softw. Maint. Evol. 19(5), 281–296 (2007)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39(1), 1–38 (1977)
Eick, S.G., Graves, T.L., Karr, A.F., Marron, J.s., Mockus, A.: Does code decay? assessing the evidence from change management data. IEEE Transactions on Software Engineering 27, 1–12 (2001)
Enslen, E., Hill, E., Pollock, L.L., Vijay-Shanker, K.: Mining source code to automatically split identifiers for software analysis. In: MSR, pp. 71–80 (2009)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review (1999)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data An Introduction to Cluster Analysis. Wiley Interscience (1990)
Koschke, R., Eisenbarth, T.: A framework for experimental evaluation of clustering techniques. In: IWPC, pp. 201–210. IEEE Computer Society (2000)
Kuhn, A., Ducasse, S., Gîrba, T.: Semantic clustering: Identifying topics in source code. Information & Software Technology 49(3), 230–243 (2007)
Lakhotia, A., Gravley, J.M.: Toward experimental evaluation of subsystem classification recovery techniques. In: Working Conference on Reverse Engineering, pp. 262–269 (1995)
Lehman, M.M.: Program evolution. Inf. Process. Manage. 20(1-2), 19–36 (1984)
Madani, N., Guerrouj, L., Di Penta, M., Guéhéneuc, Y., Antoniol, G.: Recognizing words from source code identifiers using speech recognition techniques. In: 14th European Conference on Software Maintenance and Reengineering (CSMR 2010), pp. 69–78 (2010)
Maletic, J.I., Marcus, A.: Supporting program comprehension using semantic and structural information. In: ICSE, pp. 103–112 (2001)
Mancoridis, S., Mitchell, B.S., Rorres, C., Chen, Y., Gansner, E.R.: Using automatic clustering to produce high-level system organizations of source code. In: IWPC 1998: Proceedings of the 6th International Workshop on Program Comprehension, p. 45. IEEE Computer Society, Washington, DC (1998)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Maqbool, O., Babri, H.: Hierarchical clustering for software architecture recovery. IEEE Trans. Softw. Eng. 33(11), 759–780 (2007)
McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions (Wiley Series in Probability and Statistics), 2nd edn. Wiley Interscience (March 2008)
Mclachlan, J., Krishnan, T.: The EM algorithm and Extensions. Wiley interscience (1996)
Mitchell, B.S., Mancoridis, S.: On the automatic modularization of software systems using the bunch tool. IEEE Trans. Softw. Eng. 32(3), 193–208 (2006)
Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms. The Computer Journal 26(4), 354–359 (1983)
Nierstrasz, O., Ducasse, S., Gîrba, T.: The story of moose: an agile reengineering environment. In: ESEC/SIGSOFT FSE, pp. 1–10 (2005)
Scanniello, G., D’Amico, A., D’Amico, C., Teodora, D.: Using the kleinberg algorithm and vector space model for software system clustering. In: ICPC 2010: Proceedings of the 18th International Conference on Program Comprehension, IEEE Computer Society, Washington, DC (2010)
Scanniello, G., Risi, M., Tortora, G.: Architecture recovery using latent semantic indexing and k-means: an empirical evaluation. In: SEFM 2010: Proceedings of the 2010 IEEE International Conference on Software Engineering and Formal Methods, pp. 103–112. IEEE Computer Society (2010)
Tzerpos, V., Holt, R.C.: Mojo: A distance metric for software clusterings. In: WCRE, pp. 187–193 (1999)
Tzerpos, V., Holt, R.C.: On the stability of software clustering algorithms. In: IWPC 2000: Proceedings of the 8th International Workshop on Program Comprehension, p. 211. IEEE Computer Society, Washington, DC (2000)
van Deursen, A., Hofmeister, C., Koschke, R., Moonen, L., Riva, C.: Symphony: View-driven software architecture reconstruction. In: WICSA, pp. 122–134 (2004)
Wen, Z., Tzerpos, V.: An optimal algorithm for mojo distance. In: IWPC 2003: Proceedings of the 11th IEEE International Workshop on Program Comprehension, p. 227. IEEE Computer Society, Washington, DC (2003)
Wen, Z., Tzerpos, V.: An effectiveness measure for software clustering algorithms. In: IWPC, pp. 194–203. IEEE Computer Society (2004)
Wiggerts, T.A.: Using clustering algorithms in legacy systems remodularization. In: WCRE 1997: Proceedings of the Fourth Working Conference on Reverse Engineering, p. 33. IEEE Computer Society, Washington, DC (1997)
Wu, J., Hassan, A.E., Holt, R.C.: Comparison of clustering algorithms in the context of software evolution. In: ICSM 2005: Proceedings of the 21st IEEE International Conference on Software Maintenance, pp. 525–535. IEEE Computer Society, Washington, DC (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. (2012). Combining Machine Learning and Information Retrieval Techniques for Software Clustering. In: Moschitti, A., Scandariato, R. (eds) Eternal Systems. EternalS 2011. Communications in Computer and Information Science, vol 255. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28033-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-28033-7_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28032-0
Online ISBN: 978-3-642-28033-7
eBook Packages: Computer ScienceComputer Science (R0)