Computing a Similarity Coefficient for Mining Massive Data Sets

  • M. CoşulschiEmail author
  • M. Gabroveanu
  • A. Sbîrcea
Part of the Studies in Computational Intelligence book series (SCI, volume 627)


Large amounts of data can be found today in all areas as a result of various processes like e-commerce transactions, banking or credit card transactions, or web navigation user sessions (recorded into web server logs). The development and implementation of algorithms able to process huge amounts of data have become more affordable due to cloud computing and the MapReduce programming model, which, in turn, enabled the development of some open-source frameworks, such as Apache Hadoop. Based on the values obtained by computing the Jaccard similarity coefficients for two very large graphs, we have analysed in this paper the connections and influences that certain nodes have over other nodes. Also, we have illustrated how the Apache Hadoop framework and the MapReduce programming model can be used for a large amount of computations.


Big data Virtualization Hadoop Mapreduce Jaccard similarity 


  1. Armbrust, M., Fox, A., Griffith, R., Joseph, A., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., Zaharia, D.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010)CrossRefGoogle Scholar
  2. Bank, J., Cole, B.: Calculating the Jaccard similarity coefficient with map reduce for entity pairs in Wikipedia. (2008)
  3. Blundo, C., De Cristofaro, E., Gasti, P.: EsPRESSo: efficient privacy-preserving evaluation of sample set similarity, In: 7th ESORICS Workshop on Data Privacy Management (DPM 2012) (2012)Google Scholar
  4. Borthakur, D.: Hadoop architecture and its usage at facebook. (2009)
  5. Caruana, G., Li, M., Qi, M.: A MapReduce based parallel SVM for large scale spam filtering. In: 8th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), IEEE, vol. 4, pp. 2659–2662 (2011)Google Scholar
  6. Coşulschi, M., Gabroveanu, M., Slabu, F., Sbîrcea, A.: Experiments with computing similarity coefficient over big data. In: 5th International Conference on Information, Intelligence, Systems and Applications (IISA 2014), pp. 112–117. IEEE (2014)Google Scholar
  7. Coşulschi, M., Gabroveanu, M., Sbîrcea, A.: Running Hadoop applications in virtualization environment. Ann. Univ. Craiova Math. Comput. Sci. Ser. 39(2), 322–333 (2012)Google Scholar
  8. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation (OSDI04), vol. 6, pp. 137–150 (2004)Google Scholar
  9. Ding, Z., Guo, D., Chen, X., Luo, X.: Performing MapReduce on data centers with hierarchical structures. Int. J. Comput. Commun. 7(3), 432–449 (2012)CrossRefGoogle Scholar
  10. Engen, S., Grøtan, V., Sæther, B.-E.: Estimating similarity of communities: a parametric approach to spatio-temporal analysis of species diversity. Ecography 34, 220–231 (2011)CrossRefGoogle Scholar
  11. Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP03), pp. 29–43. ACM (2003)Google Scholar
  12. Hildebrandt, E.: Distributed computing the Google way, Java Forum Stuttgart and Herbstcampus (2010).
  13. Indyk, W., Kajdanowicz, T., Kazienko, P., Plamowski, S.: Web spam detection using MapReduce approach to collective classification. In: International Joint Conference CISIS/ICEUTE/SOCO Special Sessions, vol. 189. Springer (2013)Google Scholar
  14. Irving, B.: Big data and the power of Hadoop, Yahoo! Hadoop Summit (2010)Google Scholar
  15. Kleinberg, J.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)CrossRefMathSciNetzbMATHGoogle Scholar
  16. Kunegis, J., Lommatzsch, A., Bauckhag, C.: The slashdot zoo: mining a social network with negative edges. In: Proceedings of World Wide Web Conference, pp. 741–750 (2009)Google Scholar
  17. Lam, C.: Hadoop in Action. Manning Publications (2010)Google Scholar
  18. Leskovec, J., Lang, K., Dasgupta, A., Mahoney, M.: Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math. 6(1), 29–123 (2009)CrossRefMathSciNetzbMATHGoogle Scholar
  19. Leydesdorff, L.: On the normalization and visualization of author co-citation data: Salton’s Cosine versus the Jaccard index. J. Am. Soc. Inform. Sci. Technol. 59(1), 77–85 (2008)CrossRefGoogle Scholar
  20. Lin, J., Dyer, C.: Data-intensive text processing with MapReduce. Morgan & Claypool Publishers (2010)Google Scholar
  21. Machaj, J., Piché, R., Brida, P.: Rank based fingerprinting algorithm for indoor positioning. In: International Conference on Indoor Positioning and Indoor Navigation (IPIN), pp. 1–6 (2011)Google Scholar
  22. Mell, P., Grance, T.: The NIST Definition of Cloud Computing. National Institute of Science and Technology (2011)Google Scholar
  23. Mulqueen, C.M., Stetz, T.A., Beaubien, J.M., O’Connell, B.J.: Developing dynamic work roles using Jaccard similarity indices of employee competency data. Ergometrika 2, 26–37 (2001)Google Scholar
  24. Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press (2012)
  25. Smith, J.E., Nair, R.: The architecture of virtual machines. Computer 38(5), 32–38 (2005)CrossRefGoogle Scholar
  26. Sugerman, J., Venkitachalam, G., Lim, B.H.: Virtualizing I/O devices on VMware workstation’s hosted virtual machine monitor. In: Proceedings of the General Track: 2002 USENIX Annual Technical Conference, pp. 1–14 (2001)Google Scholar
  27. White, T.: Hadoop: The Definitive Guide. Storage and Analysis at Internet Scale, 3rd edn. O’Reilly Media/Yahoo Press (2012)Google Scholar
  28. Zikopoulos, P., Eaton, C., DeRoos, D., Deutsch, T., Lapis, G.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of CraiovaCraiovaRomania

Personalised recommendations