Abstract
In this paper, we investigate some ideas based on Machine Learning, Natural Language Processing, and Information Retrieval to outline possible research directions in the field of software architecture recovery and clone detection. In particular, after presenting an extensive related work, we illustrate two proposals for addressing these two issues, that represent hot topics in the field of Software Maintenance. Both proposals use Kernel Methods for exploiting structural representation of source code and to automate the detection of clones and the recovery of the actually implemented architecture in a subject software system.
The research described in this paper has been partially supported by the European Community’s Seventh Framework Programme (FP7/2007-2013) under the grants #247758: EternalS – Trustworthy Eternal Systems via Evolving Software, Data and Knowledge, and #288024: LiMoSINe – Linguistically Motivated Semantic aggregation engiNes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Anquetil, N., Fourrier, C., Lethbridge, T.C.: Experiments with clustering as a software remodularization method. In: Proceedings of the 6th Working Conference on Reverse Engineering, pp. 235–255. IEEE Computer Society, Washington, DC (1999)
Baker, B.: On finding duplication and near-duplication in large software systems. In: IEEE Proceedings of the Working Conference on Reverse Engineering (1995)
Baxter, I.D., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using abstract syntax trees. In: Proceedings of the International Conference on Software Maintenance, pp. 368–377. IEEE Press (1998)
Bellon, S., Koschke, R., Antoniol, G., Krinke, J., Merlo, E.M.: Comparison and evaluation of clone detection tools. IEEE Trans. Software Eng., 577–591 (September 2007)
Bittencourt, R.A., Guerrero, D.D.S.: Comparison of graph clustering algorithms for recovering software architecture module views. In: Proceedings of the European Conference on Software Maintenance and Reengineering, pp. 251–254. IEEE Computer Society, Washington, DC (2009), http://portal.acm.org/citation.cfm?id=1545011.1545446
Bulychev, P., Minea, M.: Duplicate code detection using anti-unification. In: Spring/Summer Young Researcher’s Colloquium (2008)
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.: Investigating the use of lexical information for software system clustering. In: Proceedings of the 15th European Conference on Software Maintenance and Reengineering, CSMR 2011, pp. 35–44. IEEE Computer Society, Washington, DC (2011), http://dx.doi.org/10.1109/CSMR.2011.8
Corazza, A., Di Martino, S., Scanniello, G.: A probabilistic based approach towards software system clustering. In: Proceedings of the European Conference on Software Maintenance and Reengineering, pp. 88–96 (2010)
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.: A tree kernel based approach for clone detection. In: Proceedings of the 2010 IEEE International Conference on Software Maintenance, ICSM 2010, pp. 1–5. IEEE Computer Society, Washington, DC (2010), http://dx.doi.org/10.1109/ICSM.2010.5609715
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.: Combining machine learning and information retrieval techniques for software clustering. In: Moschitti, A., Scandariato, R. (eds.) EternalS 2011. CCIS, vol. 255, pp. 42–60. Springer, Heidelberg (2012)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.49.7546
Doval, D., Mancoridis, S., Mitchell, B.S.: Automatic clustering of software systems using a genetic algorithm. In: Proceedings of the Software Technology and Engineering Practice, pp. 73–82. IEEE Computer Society, Washington, DC (1999), http://portal.acm.org/citation.cfm?id=829540.832036
Ducasse, S., Pollet, D.: Software architecture reconstruction: A process-oriented taxonomy. IEEE Transactions on Software Engineering 35(4), 573–591 (2009)
Ducasse, S., Rieger, M., Demeyer, S.: A language independent approach for detecting duplicated code. In: Proceedings of the International Conference on Software Maintenance, pp. 109–118 (1999)
Finley, T., Joachims, T.: Supervised clustering with support vector machines. In: Proceedings of the 22nd International Conference on Machine Learning, ICML 2005, pp. 217–224. ACM, New York (2005), http://doi.acm.org/10.1145/1102351.1102379
Frasconi, P., Passerini, A.: Learning with kernels and logical representations. In: De Raedt, L., Frasconi, P., Kersting, K., Muggleton, S. (eds.) Probabilistic Inductive Logic Programming. LNCS (LNAI), vol. 4911, pp. 56–91. Springer, Heidelberg (2008)
Gabel, M., Jiang, L., Su, Z.: Scalable detection of semantic clones. In: Proceedings of the 30th International Conference on Software Engineering, ICSE 2008, pp. 321–330. ACM, New York (2008), http://doi.acm.org/10.1145/1368088.1368132
Garlan, D.: Software architecture: a roadmap. In: Proceedings of the Conference on the Future of Software Engineering, ICSE 2000, pp. 91–101. ACM, New York (2000), http://doi.acm.org/10.1145/336512.336537
Gönen, M., Alpaydin, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res., 2211–2268 (July 2011)
Hofmann, T., Schölkopf, B., Smola, A.J.: Kernel methods in machine learning. Annals of Statistics 36(3), 1171–1220 (2008), http://www.projecteuclid.org/DPubS?verb=Displayversion=1.0service=UIhandle=euclid.aos/1211819561page=record
Jiang, L., Misherghi, G., Su, Z., Glondu, S.: Deckard: Scalable and accurate tree-based detection of code clones. In: Proceedings of the 29th International Conference on Software Engineering, ICSE 2007, pp. 96–105. IEEE Computer Society, Washington, DC (2007), http://dx.doi.org/10.1109/ICSE.2007.30
Johnson, J.H.: Identifying redundancy in source code using fingerprints. In: Proc. Conf. Centre for Advanced Studies on Collaborative Research (CASCON), pp. 171–183. IBM Press (1993)
Kamiya, T., Kusumoto, S., Inoue, K.: Ccfinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Software Eng. 28(7), 654–670 (2002)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46, 604–632 (1999), http://doi.acm.org/10.1145/324133.324140
Komondoor, R., Horwitz, S.: Using slicing to identify duplication in source code. In: Cousot, P. (ed.) SAS 2001. LNCS, vol. 2126, pp. 40–56. Springer, Heidelberg (2001)
Koschke, R.: Atomic architectural component recovery for program understanding and evolution. Softwaretechnik-Trends (2000), http://www.iste.uni-stuttgart.de/ps/rainer/thesis
Koschke, R., Falke, R., Frenzel, P.: Clone detection using abstract syntax suffix trees. In: WCRE 2006: Proceedings of the 13th Working Conference on Reverse Engineering, pp. 253–262. IEEE Computer Society, Washington, DC (2006)
Krinke, J.: Identifying Similar Code with Program Dependence Graphs. In: Proc. Working Conf. Reverse Engineering (WCRE), pp. 301–309. IEEE Computer Society Press (2001)
Kuhn, A., Ducasse, S., Gírba, T.: Semantic clustering: Identifying topics in source code. Information and Software Technology 49, 230–243 (2007), http://portal.acm.org/citation.cfm?id=1224560.1224698
Landwehr, N., Passerini, A., Raedt, L., Frasconi, P.: Fast learning of relational kernels. Mach. Learn. 78(3), 305–342 (2010), http://dx.doi.org/10.1007/s10994-009-5163-1
Lehman, M.M.: Programs, life cycles, and laws of software evolution. Proc. IEEE 68(9), 1060–1076 (1980)
Leitão, A.M.: Detection of redundant code using r2d2. Software Quality Journal 12(4), 361–382 (2004)
Maletic, J.I., Marcus, A.: Supporting program comprehension using semantic and structural information. In: Proceedings of the 23rd International Conference on Software Engineering, ICSE 2001, pp. 103–112. IEEE Computer Society, Washington, DC (2001), http://portal.acm.org/citation.cfm?id=381473.381484
Maqbool, O., Babri, H.: Hierarchical clustering for software architecture recovery. IEEE Transactions on Software Engineering 33(11), 759–780 (2007)
Menchetti, S., Costa, F., Frasconi, P.: Weighted decomposition kernels. In: Proceedings of the 22nd International Conference on Machine Learning, ICML 2005, pp. 585–592. ACM, New York (2005), http://doi.acm.org/10.1145/1102351.1102425
Mitchell, B.S., Mancoridis, S.: On the automatic modularization of software systems using the bunch tool. IEEE Transactions on Software Engineering 32, 193–208 (2006), http://portal.acm.org/citation.cfm?id=1128600.1128815
Moschitti, A., Basili, R., Pighin, D.: Tree Kernels for Semantic Role Labeling. In: Computational Linguistics, pp. 193–224. MIT Press, Cambridge (2008)
Risi, M., Scanniello, G., Tortora, G.: Using fold-in and fold-out in the architecture recovery of software systems. Formal Asp. Comput. 24(3), 307–330 (2012)
Roy, C.K., Cordy, J.R.: Nicad: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In: ICPC, pp. 172–181 (2008)
Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci. Comput. Program. 74(7), 470–495 (2009)
Scanniello, G., D’Amico, A., D’Amico, C., D’Amico, T.: Architectural layer recovery for software system understanding and evolution. Software Practice and Experience 40, 897–916 (2010), http://dx.doi.org/10.1002/spe.v40:10
Scanniello, G., D’Amico, A., D’Amico, C., D’Amico, T.: Using the kleinberg algorithm and vector space model for software system clustering. In: Proceedings of the IEEE 18th International Conference on Program Comprehension, ICPC 2010, pp. 180–189. IEEE Computer Society, Washington, DC (2010), http://dx.doi.org/10.1109/ICPC.2010.17
Tzerpos, V., Holt, R.C.: On the stability of software clustering algorithms. In: Proceedings of the 8th International Workshop on Program Comprehension, pp. 211–218 (2000)
Vert, J.P.: A Tree Kernel to analyse phylogenetic profiles. Bioinformatics 18(suppl. 1), S276–S284 (2002)
Wahler, V., Seipel, D., von Gudenberg, J.W., Fischer, G.: Clone detection in source code by frequent itemset techniques. In: SCAM 2004: Proceedings of the Fourth IEEE International Workshop on Source Code Analysis and Manipulation, pp. 128–135. IEEE Computer Society, Washington, DC (2004)
Wiggerts, T.A.: Using clustering algorithms in legacy systems remodularization. In: Proceedings of the Fourth Working Conference on Reverse Engineering (WCRE 1997), pp. 33–43. IEEE Computer Society, Washington, DC (1997), http://portal.acm.org/citation.cfm?id=832304.836999
Wu, J., Hassan, A.E., Holt, R.C.: Comparison of clustering algotithms in the context of software evolution. In: Proceedings of the 21st IEEE International Conference on Software Maintenance, pp. 525–535. IEEE Computer Society (2005)
Yang, W.: Identifying syntactic differences between two programs. Software - Practice and Experience 21(7), 739–755 (1991)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Corazza, A. et al. (2013). Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability. In: Moschitti, A., Plank, B. (eds) Trustworthy Eternal Systems via Evolving Software, Data and Knowledge. EternalS 2012. Communications in Computer and Information Science, vol 379. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45260-4_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-45260-4_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45259-8
Online ISBN: 978-3-642-45260-4
eBook Packages: Computer ScienceComputer Science (R0)