Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

An automated approach to assess the similarity of GitHub repositories

  • 42 Accesses


Open source software (OSS) allows developers to study, change, and improve the code free of charge. There are several high-quality software projects which deliver stable and well-documented products. Most OSS forges typically sustain active user and expert communities which in turn provide decent levels of support both with respect to answering user questions as well as to repairing reported software bugs. Code reuse is an intrinsic feature of OSS, and developing a new system by leveraging existing open source components can reduce development effort, and thus it can be beneficial to at least two phases of the software life cycle, i.e., implementation and maintenance. However, to improve software quality, it is essential to develop a system by learning from well-defined, mature projects. In this sense, the ability to find similar projects that facilitate the undergoing development activities is of high importance. In this paper, we address the issue of mining open source software repositories to detect similar projects, which can be eventually reused by developers. We propose CrossSim as a novel approach to model the OSS ecosystem and to compute similarities among software projects. An evaluation on a dataset collected from GitHub shows that our proposed approach outperforms three well-established baselines.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8


  1. 1.


  2. 2.

    SourceForge: https://sourceforge.net/

  3. 3.

    About GitHub: https://github.com/about

  4. 4.


  5. 5.

    The files pom.xml and with the extension .gradle are related to management of dependencies by means of Maven (https://maven.apache.org/) and Gradle (https://gradle.org/), respectively.

  6. 6.

    GitHub Rate Limit: https://developer.github.com/v3/rate_limit/

  7. 7.

    JUnit: http://junit.org/junit5/

  8. 8.


  9. 9.

    For the sake of clarity, in this paper, we give a name for the algorithms that have not been originally named

  10. 10.


  11. 11.



  1. Bagnato, A, Barmpis, K, Bessis, N, Cabrera-Diego, LA, Di Rocco, J, Di Ruscio, D, Gergely, T, Hansen, S, Kolovos, D, Krief, P, Korkontzelos, I, Laurière, S, Lopez de la Fuente, JM, Maló, P, Paige, RF, Spinellis, D, Thomas, C, Vinju, J. (2018). Developer-centric knowledge mining from large open-source software repositories (crossminer). In Seidl, M, & Zschaler, S (Eds.) Software technologies: applications and foundations (pp. 375–384). Cham: Springer International Publishing.

  2. Baltes, S., Dumani, L., Treude, C., Diehl, S. (2018). SOTorrent: reconstructing and analyzing the evolution of stack overflow posts. In: MSR.

  3. Behnamghader, P., Alfayez, R., Srisopha, K., Boehm, B. (2017). Towards better understanding of software quality evolution through commit-impact analysis. In 2017 IEEE International conference on software quality, reliability and security (QRS) (pp. 251–262).

  4. Bhandari, U, Sugiyama, K, Datta, A, Jindal, R. (2013). Serendipitous recommendation for mobile apps using item-item similarity graph. In Banchs, RE, Silvestri, F, Liu, T-Y, Zhang, M, Gao, S, Lang, J (Eds.) AIRS, volume 8281 of lecture notes in computer science (pp. 440–451): Springer.

  5. Bizer, C., Heath, T., Berners-Lee, T. (2009). Linked data - the story so far. International Journal on Semantic Web and Information Systems, 5(3), 1–22.

  6. Blondel, V.D., Gajardo, A., Heymans, M., Senellart, P., Dooren, P.V. (2004). A measure of similarity between graph vertices: applications to synonym extraction and web searching. SIAM Review, 46(4), 647–666.

  7. Borges, H., Hora, A., Valente, M.T. (2016). Understanding the factors that impact the popularity of github repositories. In 2016 IEEE International conference on software maintenance and evolution (ICSME) (pp. 334–344).

  8. Chen, N., Hoi, S.C., Li, S., Xiao, X. (2015). SimApp: a framework for detecting similar mobile applications by online kernel learning. In Proceedings of the eighth ACM international conference on web search and data mining, WSDM ’15 (pp. 305–314). New York: ACM.

  9. CLAN evaluation dataset. (2018). http://www.cs.wm.edu/semeru/clan/CaseStudyMaterials.zip. Last access 16.10.2018.

  10. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493–2537.

  11. Coutinho, A.E.V.B., Cartaxo, E.G., de Lima Machado, P.D. (2014). Analysis of distance functions for similarity-based test suite reduction in the context of model-based testing. Software Quality Journal, 24, 407–445.

  12. Crussell, J., Gibler, C., Chen, H. (2013). AnDarwin: scalable detection of semantically similar android applications. In Computer security - ESORICS 2013 - 18th European symposium on research in computer security, Egham, UK, September 9-13, 2013. Proceedings (pp. 182–199).

  13. Di Noia, T., Mirizzi, R., Ostuni, V.C., Romito, D., Zanker, M. (2012). Linked open data to support content-based recommender systems. In Proceedings of the 8th international conference on semantic systems, I-SEMANTICS ’12 (pp. 1–8). New York: ACM.

  14. Evans, W.S., Fraser, C.W., Ma, F. (2009). Clone detection via structural abstraction. Software Quality Journal, 17(4), 309–330.

  15. Garg, P.K., Kawaguchi, S., Matsushita, M., Inoue, K. (2004). MUDABlue: an automatic categorization system for open source repositories. In 2013 20th Asia-Pacific software engineering conference (APSEC) (pp. 184–193).

  16. Ghose, S., & Lowengart, O. (2001). Taste tests: impacts of consumer perceptions and preferences on brand positioning strategies. Journal of Targeting, Measurement and Analysis for Marketing, 10(1), 26–41.

  17. Gitchell, D., & Tran, N. (1999). Sim: a utility for detecting similarity in computer programs. In The proceedings of the thirtieth SIGCSE technical symposium on computer science education, SIGCSE ’99 (pp. 266–270). New York: ACM.

  18. Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50.

  19. Jeh, G., & Widom, J. (2002). Simrank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02 (pp. 538–543). New York: ACM.

  20. Jiang, J., Lo, D., He, J., Xia, X., Kochhar, P.S., Zhang, L. (2017). Why and how developers fork what from whom in github. Empirical Software Engineering, 22(1), 547–578.

  21. Kendall, M.G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 81–93.

  22. Khan, S.U.R., Lee, S.P., Ahmad, R.W., Akhunzada, A., Chang, V. (2016). A survey on test suite reduction frameworks and tools. International Journal of Information Management, 36(6), 963–975.

  23. Kobilarov, G., Scott, T., Raimond, Y., Oliver, S., Sizemore, C., Smethurst, M., Bizer, C., Lee, R. (2009). Media meets semantic web – how the bbc uses dbpedia and linked data to make connections. In Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (Eds.) The semantic web: research and applications (pp. 723–737). Berlin: Springer.

  24. Kollias, G., Sathe, M., Schenk, O., Grama, A. (2014). Fast parallel algorithms for graph similarity and matching. Journal of Parallel and Distributed Computing, 74 (5), 2400–2410.

  25. Landauer, T.K. (2006). Latent semantic analysis. Wiley Online Library.

  26. Landauer, T., Foltz, P., Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25, 259–284.

  27. Leitão, A.M. (2004). Detection of redundant code using r2d2. Software Quality Journal, 12(4), 361–382.

  28. Linares-Vasquez, M., Holtzhauer, A., Poshyvanyk, D. (2016). On automatically detecting similar android apps. 2016 IEEE 24th International Conference on Program Comprehension (ICPC), 00, 1–10.

  29. Liu, C., Chen, C., Han, J., Yu, P.S. (2006). GPLAG: detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’06 (pp. 872–881). New York: ACM.

  30. Lo, D., Jiang, L., Thung, F. (2012). Detecting similar applications with collaborative tagging. In Proceedings of the 2012 IEEE international conference on software maintenance (ICSM), ICSM ’12 (pp. 600–603). Washington, DC: IEEE Computer Society.

  31. Maarek, Y.S., Berry, D.M., Kaiser, G.E. (1991). An information retrieval approach for automatically constructing software libraries. IEEE Transactions on Software Engineering, 17(8), 800–813.

  32. McMillan, C., Grechanik, M., Poshyvanyk, D. (2012). Detecting similar software applications. In Proceedings of the 34th international conference on software engineering, ICSE ’12 (pp. 364–374). Piscataway: IEEE Press.

  33. Miller, G.A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11), 39–41.

  34. Nassar, H., Veldt, N., Mohammadi, S., Grama, A., Gleich, D.F. (2018). Low rank spectral network alignment. In Proceedings of the 2018 World Wide Web conference, WWW ’18 (pp. 619–628). Republic and Canton of Geneva: International World Wide Web Conferences Steering Committee.

  35. Nguyen, P.T., Tomeo, P., Di Noia, T., Di Sciascio, E. (2015). An evaluation of SimRank and personalized PageRank to build a recommender system for the web of data. In Proceedings of the 24th international conference on World Wide Web, WWW ’15 Companion (pp. 1477–1482). New York: ACM.

  36. Nguyen, P.T., Di Rocco, J., Di Ruscio, D. (2018a). Knowledge-aware recommender system for software development. In Proceedings of the 1st Workshop on Knowledge-aware and Conversational Recommender System, KaRS, Vol. 2018. New York: ACM.

  37. Nguyen, P.T., Di Rocco, J., Di Ruscio, D. (2018b). Mining software repositories to support OSS developers: a recommender systems approach. In Proceedings of the 9th Italian information retrieval workshop, Rome, Italy, May, 28-30, 2018.

  38. Nguyen, P.T., Di Rocco, J., Rubei, R., Di Ruscio, D. (2018c). CrossSim: exploiting mutual relationships to detect similar OSS projects. In 2018 44th Euromicro conference on software engineering and advanced applications (SEAA) (pp. 388–395).

  39. Nguyen, P.T., Di Rocco, J., Rubei, R., Di Ruscio, D. (2018d). CrossSim tool and evaluation data. https://github.com/crossminer/CrossSim.

  40. Nguyen, P.T., Di Rocco, J., Di Ruscio, D. (2019a). Enabling heterogeneous recommendations in OSS development: what’s done and what’s next in CROSSMINER. In Proceedings of the evaluation and assessment on software engineering, EASE ’19 (pp. 326–331). New York: ACM.

  41. Nguyen, P.T., Di Rocco, J., Di Ruscio, D., Ochoa, L., Degueule, T., Di Penta, M. (2019b). FOCUS: a recommender system for mining API function calls and usage patterns. In Proceedings of the 41st international conference on software engineering, ICSE ’19 (pp. 1050–1060). Piscataway: IEEE Press.

  42. Pettigrew, S., & Charters, S. (2008). Tasting as a projective technique. Qualitative Market Research: An International Journal, 11(3), 331–343.

  43. Ponzanelli, L., Bavota, G., Di Penta, M., Oliveto, R., Lanza, M. (2014). Mining StackOverflow to turn the IDE into a self-confident programming prompter. In Proceedings of MSR 2014 (pp. 102–111): ACM.

  44. Ragkhitwetsagul, C., Krinke, J., Clark, D. (2018a). A comparison of code similarity analysers. Empirical Software Engineering, 23(4), 2464–2519.

  45. Ragkhitwetsagul, C., Krinke, J., Marnette, B. (2018b). A picture is worth a thousand words: code clone detection based on image similarity. In 2018 IEEE 12th International workshop on software clones (IWSC) (pp. 44–50).

  46. Rattan, D., Bhatia, R., Singh, M. (2013). Software clone detection: a systematic review. Information and Software Technology, 55(7), 1165–1199.

  47. Schafer, J.B., Frankowski, D., Herlocker, J., Sen, S. (2007). The adaptive web. Chapter collaborative filtering recommender systems, (pp. 291–324). Berlin: Springer.

  48. Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology, 15(1), 72–101.

  49. Spinellis, D., & Szyperski, C. (2004). How is open source affecting software development? IEEE Software, 21(1), 28–33.

  50. Stadler, C., Lehmann, J., Höffner, K., Auer, S. (2012). LinkedGeoData: a core for a web of spatial open data. Semantic Web, 3, 333–354.

  51. Thung, F., Lo, D., Lawall, J. (2013). Automated library recommendation. In 2013 20th Working conference on reverse engineering (WCRE) (pp. 182–191).

  52. Tiarks, R., Koschke, R., Falke, R. (2011). An extended assessment of type-3 clones as detected by state-of-the-art tools. Software Quality Journal, 19(2), 295–331.

  53. Turney, P.D., & Pantel, P. (2010). From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research, 37(1), 141–188.

  54. Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327–352.

  55. Ugurel, S., Krovetz, R., Giles, C.L. (2002). What’s the code?: automatic classification of source code archives. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02 (pp. 632–638). New York: ACM.

  56. Walenstein, A., El-Ramly, M., Cordy, J.R., Evans, W.S., Mahdavi, K., Pizka, M., Ramalingam, G., von Gudenberg, J.W. (2006). Similarity in programs. In Duplication, redundancy, and similarity in software, 23.07. - 26.07.2006.

  57. Wang, H., Guo, Y., Ma, Z., Chen, X. (2015a). WuKong: a scalable and accurate two-phase approach to android App clone detection. In Proceedings of the 2015 international symposium on software testing and analysis, ISSTA 2015 (pp. 71–82). New York: ACM.

  58. Wang, M., Wang, C., Yu, J.X., Zhang, J. (2015b). Community detection in social networks: an in-depth benchmarking study with a procedure-oriented framework. Proceedings of the VLDB Endowment, 8(10), 998–1009.

  59. Xia, X., Lo, D., Wang, X., Zhou, B. (2013). Tag recommendation in software information sites. In Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13 (pp. 287–296). Piscataway: IEEE Press.

  60. Zhang, Y., Lo, D., Kochhar, P.S., Xia, X., Li, Q., Sun, J. (2017). Detecting similar repositories on GitHub. 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER), 00, 13–23.

Download references


The research described in this paper has been carried out as part of the CROSSMINER Project, EU Horizon 2020 Research and Innovation Programme, grant agreement No. 732223. We thank our project partners for the help with the user evaluation presented in this paper. Furthermore, we thank the anonymous reviewers for their valuable comments and suggestions that help us improve our paper.

Author information

Correspondence to Davide Di Ruscio.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.




This is the questionnaire sent to the developers who took part in our user evaluation. We adopted most of the content proposed by CLAN evaluation dataset (2018) and McMillan et al. (2012)



We uploaded the materials created from the user evaluation in GitHub for future reference.Footnote 11

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nguyen, P.T., Di Rocco, J., Rubei, R. et al. An automated approach to assess the similarity of GitHub repositories. Software Qual J (2020). https://doi.org/10.1007/s11219-019-09483-0

Download citation


  • Mining software repositories
  • Software similarity
  • Software quality
  • SimRank