SimAndro: an effective method to compute similarity of Android applications

  • Masoud Reyhani Hamednai
  • Gyoosik Kim
  • Seong-je ChoEmail author


As the number of Android applications (apps) is increasing dramatically, users face a serious problem to find relevant apps to their needs. Therefore, there is an important demand for app search engines or recommendation services where developing an accurate similarity method is a challenging issue. Contrary to malware detection, very fewer efforts have been devoted to similarity computation of apps. Furthermore, all the existing methods use the features obtained only from the app stores such as description and rating, which could be inaccurate, varied in different stores, and affected by language barrier; they totally neglect useful information clearly capturing the app’s functionalities and behaviors that can be mined from the apps themselves such as the API calls and manifest information. In this paper, we propose an effective method called SimAndro to compute the similarity of apps, which extracts the features based on the information obtained only from apps themselves and the Android platform without using information obtained from third-party sources such as app stores. SimAndro performs both feature extraction and similarity computation where the API calls, manifest information, package name, and strings are used as features. To compute the similarity score of an app-pair, a separate similarity score is computed based on each feature, and a weighted linear combination of these four scores is regarded as the final similarity score by utilizing an automatic weighting scheme based on TreeRankSVM. The results of extensive experiments with three real-world datasets and a dataset constructed by human experts demonstrate the effectiveness of SimAndro.


Similarity Android apps Feature extraction Automatic weighting 



This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (no. 2015R1D1A1A02061946), and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (no. 2018R1A2B2004830).

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.


  1. Android developers site., (December 2018)Google Scholar
  2. Aafer Y, Du W, Yin H (2013) Droidapiminer: mining api-level features for robust malware detection in android. In: Proceedings of international conference on security and privacy in communication systems, pp 86–103Google Scholar
  3. Airola A, Pahikkala T, Salakoski T (2011) Training linear ranking svms in linearithmic time using redblack trees. Pattern Recognit Lett 32(9):1328–1336CrossRefGoogle Scholar
  4. Arp D, Spreitzenbarth M, Gascon H, Rieck K (2014) Drebin: effective and explainable detection of android malware in your pocket. In: Proceedings of the 14st international conference on network and distributed system security symposium, pp 1–12Google Scholar
  5. Backurs A, Indyk P (2015) Edit distance cannot be computed in strongly subquadratic time (unless seth is false). In: Proceedings of the 47th annual ACM symposium on theory of computing, pp 51–58Google Scholar
  6. Bhandari U, Sugiyama K, Datta A, Jindal R (2013) Serendipitous recommendation for mobile apps using item-item similarity graph. In: Proceedings of the 10th Asia information retrieval societies conference, pp 440–451Google Scholar
  7. Chae D-K, Kim S-W, Cho S-J, Kim Y (2015) Effective and efficient detection of software theft via dynamic API authority vectors. J Syst Softw 110:1–9CrossRefGoogle Scholar
  8. Chen N, Hoi S, Li S, Xiao X (2015) Simapp: a framework for detecting similar mobile applications by online kernel learning. In: Proceedings of the 8th ACM international conference on web search and data mining, pp 305–314Google Scholar
  9. Chen N, Hoi S, Li S, Xiao X (2016) Mobile app tagging. In: Proceedings of the 9th ACM international conference on web search and data mining, pp 63–72Google Scholar
  10. Chiki NF, Rothenburger B, Gilles N (2008) Combining link and content information for scientific topics discovery. In: Proceedings of 20th IEEE international conference on tools with artificial intelligence, ICTAI, pp 211–214Google Scholar
  11. Crussell J, Gibler C, Chen H (2012) Attack of the clones: detecting cloned applications on android markets. In: Proceedings of the European symposium on research in computer security, pp 37–54Google Scholar
  12. Crussell J, Gibler C, Chen H (2016) Andarwin: scalable detection of android application clones based on semantics. IEEE Trans Mobile Comput 14(10):2007–2019CrossRefGoogle Scholar
  13. Demontis A, Melis M, Biggio B, Maiorca D, Arp D, Corona I (2017) Yes, machine learning can be more secure! a case study on android malware detection. IEEE Trans Dependable Secure Comput 1–14.
  14. Dalvik executable format., (December 2018)
  15. Do Q, Martini B, Choo K-K (2015) Exfiltrating data from android devices. Comput Secur 48(C):74–91CrossRefGoogle Scholar
  16. Dutta B, Shinde JV (2017) Intuitionistic fuzzy clustering based segmentation of spine mr image. Int Res J Eng Technol 4(7):790–794Google Scholar
  17. Faruki P, Bharmal A, Laxmi V, Ganmoor V, Gaur M (2015) Android security: a survey of issues, malware penetration, and defenses. IEEE Commun Surv Tutor 17(2):998–1022CrossRefGoogle Scholar
  18. Faruki P, Laxmi V, Bharmal A, Gaur MS, Ganmoor V (2015) Androsimilar: Robust signature for detecting cariants of android malware. Inf Secur Appl 22:66–80Google Scholar
  19. Feizollah A, Anuar NB, Salleh R, Abdul Wahab A (2015) A review on feature selection in mobile malware detection. Digit Investig 13(C):22–37CrossRefGoogle Scholar
  20. Hamedani MR, Kim S-W (2016) Simcc-at: a method to compute similarity of scientific papers with automatic parameter tuning. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, pp 1005–1008Google Scholar
  21. Hamedani MR, Kim S (2017) Jacsim: an accurate and efficient link-based similarity measure in graphs. Inf Sci 414:203–224CrossRefGoogle Scholar
  22. Hamedani MR, Kim S-W, Kim D-J (2016) Simcc: a novel method to consider both content and citations for computing similarity of scientific papers. Inf Sci 334–335(C):273–292CrossRefGoogle Scholar
  23. Jang J-W, Kang H, Woo J, Aziz M, Kim HK (2015) Andro-autopsy: Anti-malware system based on similarity matching of malware and malware creator-centric information. Digit Investig 14:17–35CrossRefGoogle Scholar
  24. Kim Y, Cho S-J, Han S, You I (2018) A software classification scheme using binary level characteristics for efficient software filtering. Soft Comput 22(2):595–606CrossRefGoogle Scholar
  25. Ko J, Shim H, Kim D, Jeong Y-S, Cho S-j, Park M, Han S, Kim SB (2013) Measuring similarity of android applications via reversing and k-gram birthmarking. In: Proceedings of research in adaptive and convergent systems, pp 336–341Google Scholar
  26. Lee K, Ban Y, Lee S (2017) Efficient depth enhancement using a combination of color and depth information. Sensors 17(7):1–27CrossRefGoogle Scholar
  27. Lee S, Dolby J, Ryu S (2016) Hybridroid: static analysis framework for android hybrid applications. In: Proceedings of the 31st IEEE/ACM international conference on automated software engineering, pp 250–261Google Scholar
  28. Levin J (2015) Android internals—a confectioner’s cookbook. vol I. Cambridge, MA, USAGoogle Scholar
  29. Li M, Li Q, Long Y (2017) Representation learning of multiword expressions with compositionality constraint. In: Proceedings of the international conference on knowledge science, engineering and management, pp 507–519Google Scholar
  30. Lin Z, Lyu MR, King I (2012) Matchsim: a novel similarity measure based on maximum neighborhood matching. Knowl Inf Syst 32(1):141–166CrossRefGoogle Scholar
  31. Magdy W, Jones GJF (2010) Pres: A score metric for evaluating recall-oriented information retrieval applications. In: Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, pp 611–618Google Scholar
  32. Manning CD, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
  33. Motta JM, Ladouceur J (2017) A CRF machine learning model reinforced by ontological knowledge for document summarization. In: Proceedings of the international conference artificial intelligence, pp 127–135Google Scholar
  34. Narudin F, Feizollah A, Anuar N, Gani A (2016) Evaluation of machine learning classifiers for mobile malware detection. Soft Comput Fusion Found Methodol Appl 20(1):343–357Google Scholar
  35. Ng T (2016) Prefix distance between regular languages. In: Proceedings of the international conference on implementation and application of automata, pp 224–235Google Scholar
  36. Rastogi V, Chen Y, Jiang X (2014) Catch me if you can: evaluating android anti-malware against transformation attacks. IEEE Trans Inf Forensics Secur 9(1):99–108CrossRefGoogle Scholar
  37. Sanz B, Santos I, Laorden C, Ugarte-Pedrero X, Bringas PGa (2012) On the automatic categorisation of android applications. In: Proceedings of the 9th annual IEEE consumer communications and networking conference-security and content protection, pp 149–153Google Scholar
  38. Sarma B, Li N, Gates C, Potharaju R, Nita-Rotaru C, Molloy I (2012) Android permissions: a perspective combining risks and benefits. In: Proceedings of the 17th ACM symposium on access control models and technologies, pp 13–22Google Scholar
  39. Sugiyama K, Kan M-Y (2013) Exploiting potential citation papers in scholarly paper recommendation. In: Proceedings of the 13th ACM/IEEE joint conference on digital libraries, pp 153–162Google Scholar
  40. Wei J, He J, Kai C, Zhou Y, Tang Z (2017) Collaborative filtering and deep learning based recommendation system for cold start items. Expert Syst Appl 69(1):29–39CrossRefGoogle Scholar
  41. Wei T-E, Tyan H-R, Jeng A, Lee H-M, Liao H-Y, Wang J-C (2015) Droidexec: root exploit malware recognition against wide variability via folding redundant function-relation graph. In: Proceedings of the 17st international conference on advanced communication technology, pp 161–169Google Scholar
  42. Wu D-J, Mao C-H, Wei T-E, Lee H-M, Wu K-P (2012) Droidmat: android malware detection through manifest and API calls tracing. In: Proceedings of the 7th Asia joint conference on information security, pp 62–96Google Scholar
  43. Yerima S, Sezer S, McWilliams G, Igor M (2013) A new android malware detection approach using bayesian classification. In: Proceedings of the 27th IEEE international conference on advanced information networking and applications, pp 121–128Google Scholar
  44. Yin P, Luo P, Lee W-C, Wang M (2013) App recommendation: a contest between satisfaction and temptation. In: Proceedings of the 6th ACM international conference on web search and data mining, pp 395–404Google Scholar
  45. Zhang M, Duan Y, Yin H, Zhao Z (2014) Semantics-aware android malware classification using weighted contextual API dependency graphs. In: Proceedings of the ACM SIGSAC conference on computer and communications security, pp 1105–1116Google Scholar
  46. Zheng M, Sun M, Lui J (2013) Droid analytics: a signature based analytic system to collect, extract, analyze and associate android malware. In: Proceedings of the 12st IEEE international conference on trust, security and privacy in computing and communications, pp 163–171Google Scholar
  47. Zhou W, Zhou Y, Grace M, Jian X, Zou S (2013) Fast, scalable detection of piggybacked mobile applications. In: Proceedings of the 3th ACM conference on data and application security and privacy, pp 185–196Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Computer and Software, Center for Creative Convergence EducationHanyang UniversitySeoulKorea
  2. 2.Department of Computer Science and EngineeringDankook UniversityYonginKorea

Personalised recommendations