Combining Machine Learning and Information Retrieval Techniques for Software Clustering

Corazza, Anna; Di Martino, Sergio; Maggio, Valerio; Scanniello, Giuseppe

doi:10.1007/978-3-642-28033-7_5

Anna Corazza³,
Sergio Di Martino³,
Valerio Maggio³ &
…
Giuseppe Scanniello⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 255))

Included in the following conference series:

International Workshop on Eternal Systems

324 Accesses
1 Citations

Abstract

In the field of Software Maintenance the definition of effective approaches to partition a software system into meaningful subsystems is a longstanding and relevant research topic. These techniques are very important as they can significantly support a Maintainer in his/her tasks by grouping related entities of a large system into smaller and easier to comprehend subsystems.

In this paper we investigate the effectiveness of combining information retrieval and machine learning techniques in order to exploit the lexical information provided by programmers for software clustering. In particular, differently from any related work, we employ indexing techniques to explore the contribution of the combined use of six different dictionaries, corresponding to the six parts of the source code where programmers introduce lexical information, namely: class, attribute, method and parameter names, comments, and source code statements. Moreover their relevance is estimated on the basis of the project characteristics, by applying a machine learning approach based on a probabilistic model and on the Expectation-Maximization algorithm. To group source files accordingly, two clustering algorithms have been compared, i.e. the K-Medoids and the Group Average Agglomerative Clustering, and the investigation has been conducted on a dataset of 9 open source Java software systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Andreopoulos, B., An, A., Tzerpos, V., Wang, X.: Clustering large software systems at multiple layers. Information & Software Technology 49(3), 244–254 (2007)
Article Google Scholar
Andritsos, P., Tzerpos, V.: Information-theoretic software clustering. IEEE Trans. Software Eng. 31(2), 150–165 (2005)
Article Google Scholar
Anquetil, N., Fourrier, C., Lethbridge, T.C.: Experiments with clustering as a software remodularization method. In: WCRE 1999: Proceedings of the Sixth Working Conference on Reverse Engineering, p. 235. IEEE Computer Society, Washington, DC (1999)
Google Scholar
Bittencourt, R.A., Guerrero, D.D.S.: Comparison of graph clustering algorithms for recovering software architecture module views. In: CSMR 2009: Proceedings of the 2009 European Conference on Software Maintenance and Reengineering, pp. 251–254. IEEE Computer Society, Washington, DC (2009)
Chapter Google Scholar
Bowman, I.T., Holt, R.C., Brewster, N.V.: Linux as a case study: its extracted software architecture. In: ICSE 1999: Proceedings of the 21st International Conference on Software Engineering, pp. 555–563. ACM, New York (1999)
Google Scholar
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.: Investigating the use of lexical information for software system clustering. In: 15th European Conference on Software Maintenance and Reengineering (CSMR 2011), pp. 35–44 (2011)
Google Scholar
Corazza, A., Di Martino, S., Scanniello, G.: A probabilistic based approach towards software system clustering. In: 14th European Conference on Software Maintenance and Reengineering (CSMR 2010), pp. 89–98 (2010)
Google Scholar
De Lucia, A., Scanniello, G., Tortora, G.: Identifying similar pages in web applications using a competitive clustering algorithm: Special issue articles. J. Softw. Maint. Evol. 19(5), 281–296 (2007)
Article Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Article Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
Eick, S.G., Graves, T.L., Karr, A.F., Marron, J.s., Mockus, A.: Does code decay? assessing the evidence from change management data. IEEE Transactions on Software Engineering 27, 1–12 (2001)
Article Google Scholar
Enslen, E., Hill, E., Pollock, L.L., Vijay-Shanker, K.: Mining source code to automatically split identifiers for software analysis. In: MSR, pp. 71–80 (2009)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review (1999)
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data An Introduction to Cluster Analysis. Wiley Interscience (1990)
Google Scholar
Koschke, R., Eisenbarth, T.: A framework for experimental evaluation of clustering techniques. In: IWPC, pp. 201–210. IEEE Computer Society (2000)
Google Scholar
Kuhn, A., Ducasse, S., Gîrba, T.: Semantic clustering: Identifying topics in source code. Information & Software Technology 49(3), 230–243 (2007)
Article Google Scholar
Lakhotia, A., Gravley, J.M.: Toward experimental evaluation of subsystem classification recovery techniques. In: Working Conference on Reverse Engineering, pp. 262–269 (1995)
Google Scholar
Lehman, M.M.: Program evolution. Inf. Process. Manage. 20(1-2), 19–36 (1984)
Article Google Scholar
Madani, N., Guerrouj, L., Di Penta, M., Guéhéneuc, Y., Antoniol, G.: Recognizing words from source code identifiers using speech recognition techniques. In: 14th European Conference on Software Maintenance and Reengineering (CSMR 2010), pp. 69–78 (2010)
Google Scholar
Maletic, J.I., Marcus, A.: Supporting program comprehension using semantic and structural information. In: ICSE, pp. 103–112 (2001)
Google Scholar
Mancoridis, S., Mitchell, B.S., Rorres, C., Chen, Y., Gansner, E.R.: Using automatic clustering to produce high-level system organizations of source code. In: IWPC 1998: Proceedings of the 6th International Workshop on Program Comprehension, p. 45. IEEE Computer Society, Washington, DC (1998)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Book MATH Google Scholar
Maqbool, O., Babri, H.: Hierarchical clustering for software architecture recovery. IEEE Trans. Softw. Eng. 33(11), 759–780 (2007)
Article Google Scholar
McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions (Wiley Series in Probability and Statistics), 2nd edn. Wiley Interscience (March 2008)
Google Scholar
Mclachlan, J., Krishnan, T.: The EM algorithm and Extensions. Wiley interscience (1996)
Google Scholar
Mitchell, B.S., Mancoridis, S.: On the automatic modularization of software systems using the bunch tool. IEEE Trans. Softw. Eng. 32(3), 193–208 (2006)
Article Google Scholar
Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms. The Computer Journal 26(4), 354–359 (1983)
Article MATH Google Scholar
Nierstrasz, O., Ducasse, S., Gîrba, T.: The story of moose: an agile reengineering environment. In: ESEC/SIGSOFT FSE, pp. 1–10 (2005)
Google Scholar
Scanniello, G., D’Amico, A., D’Amico, C., Teodora, D.: Using the kleinberg algorithm and vector space model for software system clustering. In: ICPC 2010: Proceedings of the 18th International Conference on Program Comprehension, IEEE Computer Society, Washington, DC (2010)
Google Scholar
Scanniello, G., Risi, M., Tortora, G.: Architecture recovery using latent semantic indexing and k-means: an empirical evaluation. In: SEFM 2010: Proceedings of the 2010 IEEE International Conference on Software Engineering and Formal Methods, pp. 103–112. IEEE Computer Society (2010)
Google Scholar
Tzerpos, V., Holt, R.C.: Mojo: A distance metric for software clusterings. In: WCRE, pp. 187–193 (1999)
Google Scholar
Tzerpos, V., Holt, R.C.: On the stability of software clustering algorithms. In: IWPC 2000: Proceedings of the 8th International Workshop on Program Comprehension, p. 211. IEEE Computer Society, Washington, DC (2000)
Chapter Google Scholar
van Deursen, A., Hofmeister, C., Koschke, R., Moonen, L., Riva, C.: Symphony: View-driven software architecture reconstruction. In: WICSA, pp. 122–134 (2004)
Google Scholar
Wen, Z., Tzerpos, V.: An optimal algorithm for mojo distance. In: IWPC 2003: Proceedings of the 11th IEEE International Workshop on Program Comprehension, p. 227. IEEE Computer Society, Washington, DC (2003)
Google Scholar
Wen, Z., Tzerpos, V.: An effectiveness measure for software clustering algorithms. In: IWPC, pp. 194–203. IEEE Computer Society (2004)
Google Scholar
Wiggerts, T.A.: Using clustering algorithms in legacy systems remodularization. In: WCRE 1997: Proceedings of the Fourth Working Conference on Reverse Engineering, p. 33. IEEE Computer Society, Washington, DC (1997)
Chapter Google Scholar
Wu, J., Hassan, A.E., Holt, R.C.: Comparison of clustering algorithms in the context of software evolution. In: ICSM 2005: Proceedings of the 21st IEEE International Conference on Software Maintenance, pp. 525–535. IEEE Computer Society, Washington, DC (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Naples “Federico II”, Italy
Anna Corazza, Sergio Di Martino & Valerio Maggio
Università della Basilicata, Potenza, Italy
Giuseppe Scanniello

Authors

Anna Corazza
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Di Martino
View author publications
You can also search for this author in PubMed Google Scholar
Valerio Maggio
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Scanniello
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Engineering and Computer Science, University of Trento, Via Sommarive 14, 38123, Povo, Italy
Alessandro Moschitti
Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001, Heverlee, Belgium
Riccardo Scandariato

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. (2012). Combining Machine Learning and Information Retrieval Techniques for Software Clustering. In: Moschitti, A., Scandariato, R. (eds) Eternal Systems. EternalS 2011. Communications in Computer and Information Science, vol 255. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28033-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-28033-7_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28032-0
Online ISBN: 978-3-642-28033-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics