Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability

Corazza, Anna; Di Martino, Sergio; Maggio, Valerio; Moschitti, Alessandro; Passerini, Andrea; Scanniello, Giuseppe; Silvestri, Fabrizio

doi:10.1007/978-3-642-45260-4_9

Anna Corazza²,
Sergio Di Martino²,
Valerio Maggio²,
Alessandro Moschitti³,
Andrea Passerini³,
Giuseppe Scanniello⁴ &
…
Fabrizio Silvestri⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 379))

Included in the following conference series:

International Workshop on Eternal Systems

517 Accesses
3 Citations

Abstract

In this paper, we investigate some ideas based on Machine Learning, Natural Language Processing, and Information Retrieval to outline possible research directions in the field of software architecture recovery and clone detection. In particular, after presenting an extensive related work, we illustrate two proposals for addressing these two issues, that represent hot topics in the field of Software Maintenance. Both proposals use Kernel Methods for exploiting structural representation of source code and to automate the detection of clones and the recovery of the actually implemented architecture in a subject software system.

The research described in this paper has been partially supported by the European Community’s Seventh Framework Programme (FP7/2007-2013) under the grants #247758: EternalS – Trustworthy Eternal Systems via Evolving Software, Data and Knowledge, and #288024: LiMoSINe – Linguistically Motivated Semantic aggregation engiNes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Anquetil, N., Fourrier, C., Lethbridge, T.C.: Experiments with clustering as a software remodularization method. In: Proceedings of the 6th Working Conference on Reverse Engineering, pp. 235–255. IEEE Computer Society, Washington, DC (1999)
Google Scholar
Baker, B.: On finding duplication and near-duplication in large software systems. In: IEEE Proceedings of the Working Conference on Reverse Engineering (1995)
Google Scholar
Baxter, I.D., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using abstract syntax trees. In: Proceedings of the International Conference on Software Maintenance, pp. 368–377. IEEE Press (1998)
Google Scholar
Bellon, S., Koschke, R., Antoniol, G., Krinke, J., Merlo, E.M.: Comparison and evaluation of clone detection tools. IEEE Trans. Software Eng., 577–591 (September 2007)
Google Scholar
Bittencourt, R.A., Guerrero, D.D.S.: Comparison of graph clustering algorithms for recovering software architecture module views. In: Proceedings of the European Conference on Software Maintenance and Reengineering, pp. 251–254. IEEE Computer Society, Washington, DC (2009), http://portal.acm.org/citation.cfm?id=1545011.1545446
Google Scholar
Bulychev, P., Minea, M.: Duplicate code detection using anti-unification. In: Spring/Summer Young Researcher’s Colloquium (2008)
Google Scholar
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.: Investigating the use of lexical information for software system clustering. In: Proceedings of the 15th European Conference on Software Maintenance and Reengineering, CSMR 2011, pp. 35–44. IEEE Computer Society, Washington, DC (2011), http://dx.doi.org/10.1109/CSMR.2011.8
Chapter Google Scholar
Corazza, A., Di Martino, S., Scanniello, G.: A probabilistic based approach towards software system clustering. In: Proceedings of the European Conference on Software Maintenance and Reengineering, pp. 88–96 (2010)
Google Scholar
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.: A tree kernel based approach for clone detection. In: Proceedings of the 2010 IEEE International Conference on Software Maintenance, ICSM 2010, pp. 1–5. IEEE Computer Society, Washington, DC (2010), http://dx.doi.org/10.1109/ICSM.2010.5609715
Chapter Google Scholar
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.: Combining machine learning and information retrieval techniques for software clustering. In: Moschitti, A., Scandariato, R. (eds.) EternalS 2011. CCIS, vol. 255, pp. 42–60. Springer, Heidelberg (2012)
Chapter Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.49.7546
Article Google Scholar
Doval, D., Mancoridis, S., Mitchell, B.S.: Automatic clustering of software systems using a genetic algorithm. In: Proceedings of the Software Technology and Engineering Practice, pp. 73–82. IEEE Computer Society, Washington, DC (1999), http://portal.acm.org/citation.cfm?id=829540.832036
Google Scholar
Ducasse, S., Pollet, D.: Software architecture reconstruction: A process-oriented taxonomy. IEEE Transactions on Software Engineering 35(4), 573–591 (2009)
Article Google Scholar
Ducasse, S., Rieger, M., Demeyer, S.: A language independent approach for detecting duplicated code. In: Proceedings of the International Conference on Software Maintenance, pp. 109–118 (1999)
Google Scholar
Finley, T., Joachims, T.: Supervised clustering with support vector machines. In: Proceedings of the 22nd International Conference on Machine Learning, ICML 2005, pp. 217–224. ACM, New York (2005), http://doi.acm.org/10.1145/1102351.1102379
Google Scholar
Frasconi, P., Passerini, A.: Learning with kernels and logical representations. In: De Raedt, L., Frasconi, P., Kersting, K., Muggleton, S. (eds.) Probabilistic Inductive Logic Programming. LNCS (LNAI), vol. 4911, pp. 56–91. Springer, Heidelberg (2008)
Chapter Google Scholar
Gabel, M., Jiang, L., Su, Z.: Scalable detection of semantic clones. In: Proceedings of the 30th International Conference on Software Engineering, ICSE 2008, pp. 321–330. ACM, New York (2008), http://doi.acm.org/10.1145/1368088.1368132
Google Scholar
Garlan, D.: Software architecture: a roadmap. In: Proceedings of the Conference on the Future of Software Engineering, ICSE 2000, pp. 91–101. ACM, New York (2000), http://doi.acm.org/10.1145/336512.336537
Google Scholar
Gönen, M., Alpaydin, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res., 2211–2268 (July 2011)
Google Scholar
Hofmann, T., Schölkopf, B., Smola, A.J.: Kernel methods in machine learning. Annals of Statistics 36(3), 1171–1220 (2008), http://www.projecteuclid.org/DPubS?verb=Displayversion=1.0service=UIhandle=euclid.aos/1211819561page=record
Article MathSciNet MATH Google Scholar
Jiang, L., Misherghi, G., Su, Z., Glondu, S.: Deckard: Scalable and accurate tree-based detection of code clones. In: Proceedings of the 29th International Conference on Software Engineering, ICSE 2007, pp. 96–105. IEEE Computer Society, Washington, DC (2007), http://dx.doi.org/10.1109/ICSE.2007.30
Google Scholar
Johnson, J.H.: Identifying redundancy in source code using fingerprints. In: Proc. Conf. Centre for Advanced Studies on Collaborative Research (CASCON), pp. 171–183. IBM Press (1993)
Google Scholar
Kamiya, T., Kusumoto, S., Inoue, K.: Ccfinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Software Eng. 28(7), 654–670 (2002)
Article Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46, 604–632 (1999), http://doi.acm.org/10.1145/324133.324140
Article MathSciNet MATH Google Scholar
Komondoor, R., Horwitz, S.: Using slicing to identify duplication in source code. In: Cousot, P. (ed.) SAS 2001. LNCS, vol. 2126, pp. 40–56. Springer, Heidelberg (2001)
Chapter Google Scholar
Koschke, R.: Atomic architectural component recovery for program understanding and evolution. Softwaretechnik-Trends (2000), http://www.iste.uni-stuttgart.de/ps/rainer/thesis
Koschke, R., Falke, R., Frenzel, P.: Clone detection using abstract syntax suffix trees. In: WCRE 2006: Proceedings of the 13th Working Conference on Reverse Engineering, pp. 253–262. IEEE Computer Society, Washington, DC (2006)
Chapter Google Scholar
Krinke, J.: Identifying Similar Code with Program Dependence Graphs. In: Proc. Working Conf. Reverse Engineering (WCRE), pp. 301–309. IEEE Computer Society Press (2001)
Google Scholar
Kuhn, A., Ducasse, S., Gírba, T.: Semantic clustering: Identifying topics in source code. Information and Software Technology 49, 230–243 (2007), http://portal.acm.org/citation.cfm?id=1224560.1224698
Article Google Scholar
Landwehr, N., Passerini, A., Raedt, L., Frasconi, P.: Fast learning of relational kernels. Mach. Learn. 78(3), 305–342 (2010), http://dx.doi.org/10.1007/s10994-009-5163-1
Article MathSciNet Google Scholar
Lehman, M.M.: Programs, life cycles, and laws of software evolution. Proc. IEEE 68(9), 1060–1076 (1980)
Article Google Scholar
Leitão, A.M.: Detection of redundant code using r²d². Software Quality Journal 12(4), 361–382 (2004)
Article Google Scholar
Maletic, J.I., Marcus, A.: Supporting program comprehension using semantic and structural information. In: Proceedings of the 23rd International Conference on Software Engineering, ICSE 2001, pp. 103–112. IEEE Computer Society, Washington, DC (2001), http://portal.acm.org/citation.cfm?id=381473.381484
Chapter Google Scholar
Maqbool, O., Babri, H.: Hierarchical clustering for software architecture recovery. IEEE Transactions on Software Engineering 33(11), 759–780 (2007)
Article Google Scholar
Menchetti, S., Costa, F., Frasconi, P.: Weighted decomposition kernels. In: Proceedings of the 22nd International Conference on Machine Learning, ICML 2005, pp. 585–592. ACM, New York (2005), http://doi.acm.org/10.1145/1102351.1102425
Google Scholar
Mitchell, B.S., Mancoridis, S.: On the automatic modularization of software systems using the bunch tool. IEEE Transactions on Software Engineering 32, 193–208 (2006), http://portal.acm.org/citation.cfm?id=1128600.1128815
Article Google Scholar
Moschitti, A., Basili, R., Pighin, D.: Tree Kernels for Semantic Role Labeling. In: Computational Linguistics, pp. 193–224. MIT Press, Cambridge (2008)
Google Scholar
Risi, M., Scanniello, G., Tortora, G.: Using fold-in and fold-out in the architecture recovery of software systems. Formal Asp. Comput. 24(3), 307–330 (2012)
Article Google Scholar
Roy, C.K., Cordy, J.R.: Nicad: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In: ICPC, pp. 172–181 (2008)
Google Scholar
Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci. Comput. Program. 74(7), 470–495 (2009)
Article MathSciNet MATH Google Scholar
Scanniello, G., D’Amico, A., D’Amico, C., D’Amico, T.: Architectural layer recovery for software system understanding and evolution. Software Practice and Experience 40, 897–916 (2010), http://dx.doi.org/10.1002/spe.v40:10
Article Google Scholar
Scanniello, G., D’Amico, A., D’Amico, C., D’Amico, T.: Using the kleinberg algorithm and vector space model for software system clustering. In: Proceedings of the IEEE 18th International Conference on Program Comprehension, ICPC 2010, pp. 180–189. IEEE Computer Society, Washington, DC (2010), http://dx.doi.org/10.1109/ICPC.2010.17
Chapter Google Scholar
Tzerpos, V., Holt, R.C.: On the stability of software clustering algorithms. In: Proceedings of the 8th International Workshop on Program Comprehension, pp. 211–218 (2000)
Google Scholar
Vert, J.P.: A Tree Kernel to analyse phylogenetic profiles. Bioinformatics 18(suppl. 1), S276–S284 (2002)
Google Scholar
Wahler, V., Seipel, D., von Gudenberg, J.W., Fischer, G.: Clone detection in source code by frequent itemset techniques. In: SCAM 2004: Proceedings of the Fourth IEEE International Workshop on Source Code Analysis and Manipulation, pp. 128–135. IEEE Computer Society, Washington, DC (2004)
Chapter Google Scholar
Wiggerts, T.A.: Using clustering algorithms in legacy systems remodularization. In: Proceedings of the Fourth Working Conference on Reverse Engineering (WCRE 1997), pp. 33–43. IEEE Computer Society, Washington, DC (1997), http://portal.acm.org/citation.cfm?id=832304.836999
Chapter Google Scholar
Wu, J., Hassan, A.E., Holt, R.C.: Comparison of clustering algotithms in the context of software evolution. In: Proceedings of the 21st IEEE International Conference on Software Maintenance, pp. 525–535. IEEE Computer Society (2005)
Google Scholar
Yang, W.: Identifying syntactic differences between two programs. Software - Practice and Experience 21(7), 739–755 (1991)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Naples “Federico II”, Italy
Anna Corazza, Sergio Di Martino & Valerio Maggio
University of Trento, Italy
Alessandro Moschitti & Andrea Passerini
University of Basilicata, Italy
Giuseppe Scanniello
ISTI Institute - CNR, Italy
Fabrizio Silvestri

Authors

Anna Corazza
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Di Martino
View author publications
You can also search for this author in PubMed Google Scholar
Valerio Maggio
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Moschitti
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Passerini
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Scanniello
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Silvestri
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Engineering and Computer Science, University of Trento, Via Sommarive 5, 38123, Povo, Trento, Italy
Alessandro Moschitti & Barbara Plank &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Corazza, A. et al. (2013). Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability. In: Moschitti, A., Plank, B. (eds) Trustworthy Eternal Systems via Evolving Software, Data and Knowledge. EternalS 2012. Communications in Computer and Information Science, vol 379. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45260-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-45260-4_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45259-8
Online ISBN: 978-3-642-45260-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics