An automated approach to assess the similarity of GitHub repositories

Nguyen, Phuong T.; Di Rocco, Juri; Rubei, Riccardo; Di Ruscio, Davide

doi:10.1007/s11219-019-09483-0

An automated approach to assess the similarity of GitHub repositories

Published: 15 February 2020

Volume 28, pages 595–631, (2020)
Cite this article

Software Quality Journal Aims and scope Submit manuscript

Phuong T. Nguyen¹,
Juri Di Rocco¹,
Riccardo Rubei¹ &
…
Davide Di Ruscio ORCID: orcid.org/0000-0002-5077-6793¹

1283 Accesses
18 Citations
Explore all metrics

Abstract

Open source software (OSS) allows developers to study, change, and improve the code free of charge. There are several high-quality software projects which deliver stable and well-documented products. Most OSS forges typically sustain active user and expert communities which in turn provide decent levels of support both with respect to answering user questions as well as to repairing reported software bugs. Code reuse is an intrinsic feature of OSS, and developing a new system by leveraging existing open source components can reduce development effort, and thus it can be beneficial to at least two phases of the software life cycle, i.e., implementation and maintenance. However, to improve software quality, it is essential to develop a system by learning from well-defined, mature projects. In this sense, the ability to find similar projects that facilitate the undergoing development activities is of high importance. In this paper, we address the issue of mining open source software repositories to detect similar projects, which can be eventually reused by developers. We propose CrossSim as a novel approach to model the OSS ecosystem and to compute similarities among software projects. An evaluation on a dataset collected from GitHub shows that our proposed approach outperforms three well-established baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Data collection and quality challenges in deep learning: a data-centric AI perspective

Article 03 January 2023

Steven Euijong Whang, Yuji Roh, … Jae-Gil Lee

How different are different diff algorithms in Git?

Article Open access 11 September 2019

Yusuf Sulistyo Nugroho, Hideaki Hata & Kenichi Matsumoto

Applications of AI in classical software engineering

Article Open access 26 July 2020

Marco Barenkamp, Jonas Rebstadt & Oliver Thomas

Notes

https://www.crossminer.org
SourceForge: https://sourceforge.net/
About GitHub: https://github.com/about
https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/
The files pom.xml and with the extension .gradle are related to management of dependencies by means of Maven (https://maven.apache.org/) and Gradle (https://gradle.org/), respectively.
GitHub Rate Limit: https://developer.github.com/v3/rate_limit/
JUnit: http://junit.org/junit5/
https://github.com/yunzhang28/RepoPal/blob/master/1000repo.xlsx
For the sake of clarity, in this paper, we give a name for the algorithms that have not been originally named
https://developer.android.com/guide/topics/manifest/manifest-intro.html
https://github.com/crossminer/CrossSim/blob/master/user-study/

References

Bagnato, A, Barmpis, K, Bessis, N, Cabrera-Diego, LA, Di Rocco, J, Di Ruscio, D, Gergely, T, Hansen, S, Kolovos, D, Krief, P, Korkontzelos, I, Laurière, S, Lopez de la Fuente, JM, Maló, P, Paige, RF, Spinellis, D, Thomas, C, Vinju, J. (2018). Developer-centric knowledge mining from large open-source software repositories (crossminer). In Seidl, M, & Zschaler, S (Eds.) Software technologies: applications and foundations (pp. 375–384). Cham: Springer International Publishing.
Baltes, S., Dumani, L., Treude, C., Diehl, S. (2018). SOTorrent: reconstructing and analyzing the evolution of stack overflow posts. In: MSR.
Behnamghader, P., Alfayez, R., Srisopha, K., Boehm, B. (2017). Towards better understanding of software quality evolution through commit-impact analysis. In 2017 IEEE International conference on software quality, reliability and security (QRS) (pp. 251–262).
Bhandari, U, Sugiyama, K, Datta, A, Jindal, R. (2013). Serendipitous recommendation for mobile apps using item-item similarity graph. In Banchs, RE, Silvestri, F, Liu, T-Y, Zhang, M, Gao, S, Lang, J (Eds.) AIRS, volume 8281 of lecture notes in computer science (pp. 440–451): Springer.
Bizer, C., Heath, T., Berners-Lee, T. (2009). Linked data - the story so far. International Journal on Semantic Web and Information Systems, 5(3), 1–22.
Article Google Scholar
Blondel, V.D., Gajardo, A., Heymans, M., Senellart, P., Dooren, P.V. (2004). A measure of similarity between graph vertices: applications to synonym extraction and web searching. SIAM Review, 46(4), 647–666.
Article MathSciNet Google Scholar
Borges, H., Hora, A., Valente, M.T. (2016). Understanding the factors that impact the popularity of github repositories. In 2016 IEEE International conference on software maintenance and evolution (ICSME) (pp. 334–344).
Chen, N., Hoi, S.C., Li, S., Xiao, X. (2015). SimApp: a framework for detecting similar mobile applications by online kernel learning. In Proceedings of the eighth ACM international conference on web search and data mining, WSDM ’15 (pp. 305–314). New York: ACM.
CLAN evaluation dataset. (2018). http://www.cs.wm.edu/semeru/clan/CaseStudyMaterials.zip. Last access 16.10.2018.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493–2537.
MATH Google Scholar
Coutinho, A.E.V.B., Cartaxo, E.G., de Lima Machado, P.D. (2014). Analysis of distance functions for similarity-based test suite reduction in the context of model-based testing. Software Quality Journal, 24, 407–445.
Article Google Scholar
Crussell, J., Gibler, C., Chen, H. (2013). AnDarwin: scalable detection of semantically similar android applications. In Computer security - ESORICS 2013 - 18th European symposium on research in computer security, Egham, UK, September 9-13, 2013. Proceedings (pp. 182–199).
Di Noia, T., Mirizzi, R., Ostuni, V.C., Romito, D., Zanker, M. (2012). Linked open data to support content-based recommender systems. In Proceedings of the 8th international conference on semantic systems, I-SEMANTICS ’12 (pp. 1–8). New York: ACM.
Evans, W.S., Fraser, C.W., Ma, F. (2009). Clone detection via structural abstraction. Software Quality Journal, 17(4), 309–330.
Article Google Scholar
Garg, P.K., Kawaguchi, S., Matsushita, M., Inoue, K. (2004). MUDABlue: an automatic categorization system for open source repositories. In 2013 20th Asia-Pacific software engineering conference (APSEC) (pp. 184–193).
Ghose, S., & Lowengart, O. (2001). Taste tests: impacts of consumer perceptions and preferences on brand positioning strategies. Journal of Targeting, Measurement and Analysis for Marketing, 10(1), 26–41.
Article Google Scholar
Gitchell, D., & Tran, N. (1999). Sim: a utility for detecting similarity in computer programs. In The proceedings of the thirtieth SIGCSE technical symposium on computer science education, SIGCSE ’99 (pp. 266–270). New York: ACM.
Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50.
Article Google Scholar
Jeh, G., & Widom, J. (2002). Simrank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02 (pp. 538–543). New York: ACM.
Jiang, J., Lo, D., He, J., Xia, X., Kochhar, P.S., Zhang, L. (2017). Why and how developers fork what from whom in github. Empirical Software Engineering, 22(1), 547–578.
Article Google Scholar
Kendall, M.G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 81–93.
Article Google Scholar
Khan, S.U.R., Lee, S.P., Ahmad, R.W., Akhunzada, A., Chang, V. (2016). A survey on test suite reduction frameworks and tools. International Journal of Information Management, 36(6), 963–975.
Article Google Scholar
Kobilarov, G., Scott, T., Raimond, Y., Oliver, S., Sizemore, C., Smethurst, M., Bizer, C., Lee, R. (2009). Media meets semantic web – how the bbc uses dbpedia and linked data to make connections. In Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (Eds.) The semantic web: research and applications (pp. 723–737). Berlin: Springer.
Kollias, G., Sathe, M., Schenk, O., Grama, A. (2014). Fast parallel algorithms for graph similarity and matching. Journal of Parallel and Distributed Computing, 74 (5), 2400–2410.
Article Google Scholar
Landauer, T.K. (2006). Latent semantic analysis. Wiley Online Library.
Landauer, T., Foltz, P., Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25, 259–284.
Article Google Scholar
Leitão, A.M. (2004). Detection of redundant code using r2d2. Software Quality Journal, 12(4), 361–382.
Article Google Scholar
Linares-Vasquez, M., Holtzhauer, A., Poshyvanyk, D. (2016). On automatically detecting similar android apps. 2016 IEEE 24th International Conference on Program Comprehension (ICPC), 00, 1–10.
Google Scholar
Liu, C., Chen, C., Han, J., Yu, P.S. (2006). GPLAG: detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’06 (pp. 872–881). New York: ACM.
Lo, D., Jiang, L., Thung, F. (2012). Detecting similar applications with collaborative tagging. In Proceedings of the 2012 IEEE international conference on software maintenance (ICSM), ICSM ’12 (pp. 600–603). Washington, DC: IEEE Computer Society.
Maarek, Y.S., Berry, D.M., Kaiser, G.E. (1991). An information retrieval approach for automatically constructing software libraries. IEEE Transactions on Software Engineering, 17(8), 800–813.
Article Google Scholar
McMillan, C., Grechanik, M., Poshyvanyk, D. (2012). Detecting similar software applications. In Proceedings of the 34th international conference on software engineering, ICSE ’12 (pp. 364–374). Piscataway: IEEE Press.
Miller, G.A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11), 39–41.
Article Google Scholar
Nassar, H., Veldt, N., Mohammadi, S., Grama, A., Gleich, D.F. (2018). Low rank spectral network alignment. In Proceedings of the 2018 World Wide Web conference, WWW ’18 (pp. 619–628). Republic and Canton of Geneva: International World Wide Web Conferences Steering Committee.
Nguyen, P.T., Tomeo, P., Di Noia, T., Di Sciascio, E. (2015). An evaluation of SimRank and personalized PageRank to build a recommender system for the web of data. In Proceedings of the 24th international conference on World Wide Web, WWW ’15 Companion (pp. 1477–1482). New York: ACM.
Nguyen, P.T., Di Rocco, J., Di Ruscio, D. (2018a). Knowledge-aware recommender system for software development. In Proceedings of the 1st Workshop on Knowledge-aware and Conversational Recommender System, KaRS, Vol. 2018. New York: ACM.
Nguyen, P.T., Di Rocco, J., Di Ruscio, D. (2018b). Mining software repositories to support OSS developers: a recommender systems approach. In Proceedings of the 9th Italian information retrieval workshop, Rome, Italy, May, 28-30, 2018.
Nguyen, P.T., Di Rocco, J., Rubei, R., Di Ruscio, D. (2018c). CrossSim: exploiting mutual relationships to detect similar OSS projects. In 2018 44th Euromicro conference on software engineering and advanced applications (SEAA) (pp. 388–395).
Nguyen, P.T., Di Rocco, J., Rubei, R., Di Ruscio, D. (2018d). CrossSim tool and evaluation data. https://github.com/crossminer/CrossSim.
Nguyen, P.T., Di Rocco, J., Di Ruscio, D. (2019a). Enabling heterogeneous recommendations in OSS development: what’s done and what’s next in CROSSMINER. In Proceedings of the evaluation and assessment on software engineering, EASE ’19 (pp. 326–331). New York: ACM.
Nguyen, P.T., Di Rocco, J., Di Ruscio, D., Ochoa, L., Degueule, T., Di Penta, M. (2019b). FOCUS: a recommender system for mining API function calls and usage patterns. In Proceedings of the 41st international conference on software engineering, ICSE ’19 (pp. 1050–1060). Piscataway: IEEE Press.
Pettigrew, S., & Charters, S. (2008). Tasting as a projective technique. Qualitative Market Research: An International Journal, 11(3), 331–343.
Article Google Scholar
Ponzanelli, L., Bavota, G., Di Penta, M., Oliveto, R., Lanza, M. (2014). Mining StackOverflow to turn the IDE into a self-confident programming prompter. In Proceedings of MSR 2014 (pp. 102–111): ACM.
Ragkhitwetsagul, C., Krinke, J., Clark, D. (2018a). A comparison of code similarity analysers. Empirical Software Engineering, 23(4), 2464–2519.
Article Google Scholar
Ragkhitwetsagul, C., Krinke, J., Marnette, B. (2018b). A picture is worth a thousand words: code clone detection based on image similarity. In 2018 IEEE 12th International workshop on software clones (IWSC) (pp. 44–50).
Rattan, D., Bhatia, R., Singh, M. (2013). Software clone detection: a systematic review. Information and Software Technology, 55(7), 1165–1199.
Article Google Scholar
Schafer, J.B., Frankowski, D., Herlocker, J., Sen, S. (2007). The adaptive web. Chapter collaborative filtering recommender systems, (pp. 291–324). Berlin: Springer.
Google Scholar
Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology, 15(1), 72–101.
Article Google Scholar
Spinellis, D., & Szyperski, C. (2004). How is open source affecting software development? IEEE Software, 21(1), 28–33.
Article Google Scholar
Stadler, C., Lehmann, J., Höffner, K., Auer, S. (2012). LinkedGeoData: a core for a web of spatial open data. Semantic Web, 3, 333–354.
Article Google Scholar
Thung, F., Lo, D., Lawall, J. (2013). Automated library recommendation. In 2013 20th Working conference on reverse engineering (WCRE) (pp. 182–191).
Tiarks, R., Koschke, R., Falke, R. (2011). An extended assessment of type-3 clones as detected by state-of-the-art tools. Software Quality Journal, 19(2), 295–331.
Article Google Scholar
Turney, P.D., & Pantel, P. (2010). From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research, 37(1), 141–188.
Article MathSciNet Google Scholar
Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327–352.
Article Google Scholar
Ugurel, S., Krovetz, R., Giles, C.L. (2002). What’s the code?: automatic classification of source code archives. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02 (pp. 632–638). New York: ACM.
Walenstein, A., El-Ramly, M., Cordy, J.R., Evans, W.S., Mahdavi, K., Pizka, M., Ramalingam, G., von Gudenberg, J.W. (2006). Similarity in programs. In Duplication, redundancy, and similarity in software, 23.07. - 26.07.2006.
Wang, H., Guo, Y., Ma, Z., Chen, X. (2015a). WuKong: a scalable and accurate two-phase approach to android App clone detection. In Proceedings of the 2015 international symposium on software testing and analysis, ISSTA 2015 (pp. 71–82). New York: ACM.
Wang, M., Wang, C., Yu, J.X., Zhang, J. (2015b). Community detection in social networks: an in-depth benchmarking study with a procedure-oriented framework. Proceedings of the VLDB Endowment, 8(10), 998–1009.
Article Google Scholar
Xia, X., Lo, D., Wang, X., Zhou, B. (2013). Tag recommendation in software information sites. In Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13 (pp. 287–296). Piscataway: IEEE Press.
Zhang, Y., Lo, D., Kochhar, P.S., Xia, X., Li, Q., Sun, J. (2017). Detecting similar repositories on GitHub. 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER), 00, 13–23.
Article Google Scholar

Download references

Acknowledgements

The research described in this paper has been carried out as part of the CROSSMINER Project, EU Horizon 2020 Research and Innovation Programme, grant agreement No. 732223. We thank our project partners for the help with the user evaluation presented in this paper. Furthermore, we thank the anonymous reviewers for their valuable comments and suggestions that help us improve our paper.

Author information

Authors and Affiliations

Department of Information Engineering, Computer Science and Mathematics, Università degli Studi dell’Aquila, Via Vetoio 2, 67100, L’Aquila, Italy
Phuong T. Nguyen, Juri Di Rocco, Riccardo Rubei & Davide Di Ruscio

Authors

Phuong T. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Juri Di Rocco
View author publications
You can also search for this author in PubMed Google Scholar
Riccardo Rubei
View author publications
You can also search for this author in PubMed Google Scholar
Davide Di Ruscio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Davide Di Ruscio.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Questionnaire

This is the questionnaire sent to the developers who took part in our user evaluation. We adopted most of the content proposed by CLAN evaluation dataset (2018) and McMillan et al. (2012)

1.2 Materials

We uploaded the materials created from the user evaluation in GitHub for future reference.^{Footnote 11}

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nguyen, P.T., Di Rocco, J., Rubei, R. et al. An automated approach to assess the similarity of GitHub repositories. Software Qual J 28, 595–631 (2020). https://doi.org/10.1007/s11219-019-09483-0

Download citation

Published: 15 February 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s11219-019-09483-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

An automated approach to assess the similarity of GitHub repositories

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

How different are different diff algorithms in Git?

Applications of AI in classical software engineering

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

1.1 Questionnaire

1.2 Materials

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An automated approach to assess the similarity of GitHub repositories

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

How different are different diff algorithms in Git?

Applications of AI in classical software engineering

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

Appendix

1.1 Questionnaire

1.2 Materials

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation