Multimedia Tools and Applications

, Volume 69, Issue 2, pp 513–537 | Cite as

Distributed media indexing based on MPI and MapReduce

  • Hisham Mohamed
  • Stéphane Marchand-Maillet


Web-scale digital assets comprise millions or billions of documents. Due to such increase, sequential algorithms cannot cope with this data, and parallel and distributed computing become the solution of choice. MapReduce is a programming model proposed by Google for scalable data processing. MapReduce is mainly applicable for data intensive algorithms. In contrast, the message passing interface (MPI) is suitable for high performance algorithms. This paper proposes an adapted structure of the MapReduce programming model using MPI for multimedia indexing. Experimental results are done on various multimedia applications to validate our model. The experiments indicate that our proposed model achieves good speedup compared to the original sequential versions, Hadoop and the earlier versions of MapReduce using MPI.


Distributed multimedia indexing MPI MapReduce Distributed inverted indexing Permutation-based indexes Distributed approximate similarity search 



This work is jointly supported by the Swiss National Science Foundation (SNSF) via the Swiss National Center of Competence in Research (NCCR) on Interactive Multimodal Information Management (IM2) and the European COST Action on Multilingual and Multifaceted Interactive Information Access (MUMIA) via the Swiss State Secretariat for Education and Research (SER).


  1. 1.
    Ahmad F, Lee S, Thottethodi M, Vijaykumar TN (2007) Mapreduce with communication overlap. Technical reportGoogle Scholar
  2. 2.
    Amato G, Savino P (2008) Approximate similarity search in metric spaces using inverted files. In: Proceedings of the 3rd international conference on scalable information systems, InfoScale ’08, ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering). ICST, Brussels, Belgium, pp 28:1–28:10.
  3. 3.
    Bruno E, Marchand-Maillet S (2009) Multimodal preference aggregation for multimedia information retrieval. J Multimedia 4(5):321–329CrossRefGoogle Scholar
  4. 4.
    Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, pp 1–22Google Scholar
  5. 5.
    Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on opearting systems design & implementation, vol 6. USENIX Association, Berkeley, p 10Google Scholar
  6. 6.
    Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: CVPR ’09Google Scholar
  7. 7.
    Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G (2011) Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing, HPDC ’10. ACM, pp 810–818. doi: 10.1145/1851476.1851593
  8. 8.
    Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain RH, Daniel DJ, Graham RL, Woodall TS (2004) Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings, 11th European PVM/MPI users’ group meeting, pp 97–104Google Scholar
  9. 9.
    Gillick D, Faria A, Denero J (2006) Mapreduce: distributed computing for machine learningGoogle Scholar
  10. 10.
    Gonzalez E, Figueroa K, Navarro G (2008) Effective proximity retrieval by ordering permutations. IEEE Trans Pattern Anal Mach Intell 30(9):1647–1658. doi: 10.1109/TPAMI.2007.70815 CrossRefGoogle Scholar
  11. 11.
    Gropp W, Lusk E, Skjellum A (1994) Using MPI: portable parallel programming with the message-passing interface. MIT Press, CambridgeGoogle Scholar
  12. 12.
    Hoefler T, Lumsdaine A, Dongarra J (2009) Towards efficient mapreduce using mpi. In: Ropo M, Westerholm J, Dongarra J (eds) PVM/MPI, Lecture notes in computer science, vol 5759. Springer, pp 240–249Google Scholar
  13. 13.
    Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 30th annual ACM symposium on theory of computing, STOC ’98. ACM, New York, pp 604–613. doi: 10.1145/276698.276876 Google Scholar
  14. 14.
    Jagadish HV, Mendelzon AO, Milo T (1995) Similarity-based queries. In: Proceedings of the 14th ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, PODS ’95. ACM, New York, pp 36–45. doiL: 10.1145/212433.212444 CrossRefGoogle Scholar
  15. 15.
    Kumar V (2002) Introduction to parallel computing, 2nd edn. Addison-Wesley Longman, BostonGoogle Scholar
  16. 16.
    Lu X, Wang B, Zha L, Xu Z (2011) Can mpi benefit hadoop and mapreduce applications? In: 40th international conference on parallel processing workshops (ICPPW), pp 371–379. doi: 10.1109/ICPPW.2011.56
  17. 17.
    McCreadie R, Macdonald C, Ounis I (2011) Mapreduce indexing strategies: studying scalability and efficiency. Inf Process Manag. doi: 10.1016/j.ipm.2010.12.003
  18. 18.
    Message passing interface.
  19. 19.
  20. 20.
    Patella M, Ciaccia P (2009) Approximate similarity search: a multi-faceted problem. J Discrete Algorithms 7(1):36–48. doi: 10.1016/j.jda.2008.09.014 CrossRefMATHMathSciNetGoogle Scholar
  21. 21.
    Plimpton SJ, Devine KD (2011) Mapreduce in mpi for large-scale graph algorithms. Parallel Comput 37(9):610–632CrossRefGoogle Scholar
  22. 22.
    Project gutenberg.
  23. 23.
    Rajasekaran R, Reif J (2007) Handbook of parallel computing: models, algorithms and applications. CRC PressGoogle Scholar
  24. 24.
    Samet H (2006) Foundations of multidimensional and metric data structures. In: The Morgan Kaufmann series in computer graphics and geometric modeling. Elsevier/Morgan Kaufmann.
  25. 25.
    Stanfill C (1990) Partitioned posting files: a parallel inverted file structure for information retrieval. In: Proceedings of the 13th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’90. ACM, New York, pp 413–428. doi: 10.1145/96749.98247 CrossRefGoogle Scholar
  26. 26.
    White T (2009) Hadoop: the definitive guide, 1st edn. O’ReillyGoogle Scholar
  27. 27.
    Witten IH, Moffat A, Bell TC (1999) Managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann, San FranciscoGoogle Scholar
  28. 28.
    von Wyl M, Mohamed H, Bruno E, Marchand-Maillet S (2011) A parallel cross-modal search engine over large-scale multimedia collections with interactive relevance feedback. In: Proceedings of the 1st ACM international conference on multimedia retrieval, pp 73:1–73:2Google Scholar
  29. 29.
    Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach, advances in database systems, vol 32. SpringerGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  1. 1.Viper Group, Computer Vision and Multimedia LaboratoryUniversity of GenevaGenevaSwitzerland

Personalised recommendations