Multimedia Tools and Applications

, Volume 51, Issue 1, pp 247–277 | Cite as

Personalization in multimedia retrieval: A survey

  • Yijuan Lu
  • Nicu Sebe
  • Ross Hytnen
  • Qi TianEmail author


With the explosive broadcast of multimedia (text documents, image, video etc.) in our life, how to annotate, search, index, browse and relate various forms of information efficiently becomes more and more important. Combining these challenges by relating them to user preference and customization only complicates the matter further. The goal of this survey is to give an overview of the current situation in the branches of research that are involved in annotation, relation and presentation to a user by preference. This paper will present some current models and techniques being researched to model ontology, preference, context, and presentation and bring them together in a chain of ideas that leads from raw uninformed data to an actual usable user interface that adapts with user preference and customization.


Personalization Information Access Multimedia 



We would like to thank Dick Bulterman, Stavros Christodoulakis, Chabane Djeraba, Daniel Gatica-Perez, Thomas Huang, Alex Jaimes, Ramesh Jain, Mike Lew, Andy Rauber, Pasquale Savino, Arnold Smeulders, and the whole FACS consortium for excellent suggestions and discussions. The work of Nicu Sebe has been supported by the FP7 IP GLOCAL european project and by the FIRB S-PATTERN project. The work of Yijuan Lu was supported in part by the Research Enhancement Program (REP) and start-up funding from the Texas State University.


  1. 1.
    Agarwal S, Fankhauser P, Gonzalez-Ollala J, Hartman J, Hollfelder S, Jameson A, Klink S, Lehti P, Ley M, Rabbidge E, Scharzkopf E, Shrestha N, Stojanovic N, Studer R, Stumme G, Walter B, Weber A (2003) Semantic methods and tools for information portals. Proceedings of INFORMATIK 2003 - Innovative Informatikanwendungen, pp 116–131Google Scholar
  2. 2.
    Agius H, Angelides M (2007) Closing the content-user gap in MPEG-7: the hanging basket model. Multimed Syst 13(2):155–176CrossRefGoogle Scholar
  3. 3.
    Ahn LV, Liu R, Blum M (2006) Peekaboom: a game for locating objects in images, SIGCHI Conference. Human Factors in Computing Systems, pp 55–64Google Scholar
  4. 4.
    Aizawa K, Tancharoen D, Kawasaki S, Yamasaki T (2004) Efficient retrieval of life log based on context and content. ACM Workshop on Continuous Archival and Retrieval of Personal Experiences, pp 22–31Google Scholar
  5. 5.
    Arifin S, Cheung PYK (2007) A computation method for video segmentation utilizing the pleasure-arousal-dominance emotional information. ACM Multimedia, pp 68–77Google Scholar
  6. 6.
    Arthur GM, Harry A (2008) Video summarization: a conceptual framework and survey of the state of the art. J Vis Commun Image Represent 19(2):121–143CrossRefGoogle Scholar
  7. 7.
    Battelle J (2005) The search: how Google and its rivals rewrote the rules of business and transformed our culture, Portofolio HardcoverGoogle Scholar
  8. 8.
    Belloti R, Decurtins C, Grossniklaus M, Norrie M, Palinginis A (2004) Modeling context for information environments, ubiquitous mobile information and collaboration systems. Lect Notes Comput Sci 3272:43–56Google Scholar
  9. 9.
    Blei D, Jordan M (2003) Modeling annotated data. ACM SIGIR, pp 127–134Google Scholar
  10. 10.
    Brewer E et al (2005) The case for technology in developing regions. IEEE Computer 38(6):25–38Google Scholar
  11. 11.
    Bruno D, Denis L, Sharon O (2009) Multimodal interfaces: a survey of principles, models and frameworks, human machine interaction. Lect Notes Comput Sci 5440:3–26CrossRefGoogle Scholar
  12. 12.
    Bulterman D, Rutledge L (2004) SMIL 2.0: Interactive multimedia for web and mobile devices. Springer-Verlag, HeidelbergGoogle Scholar
  13. 13.
    Bulterman D, Hardman L, Jansen J, Mullender K, Rutledge L (1998) GRiNS: A GRaphical interface for creating and playing SMIL documents. Comput Netw ISDN systems 10:519–529CrossRefGoogle Scholar
  14. 14.
    Chen L, Sycara K (1998) WebMate: personal agent for browsing and searching. Int. Conf. on Autonomous Agents, pp 132–139Google Scholar
  15. 15.
    Chen H, Zheng NN, Liang L, Li Y, Xu YQ, Shum HY (2002) PicToon: a personalized image-based cartoon system, ACM Multimedia, pp 171–178Google Scholar
  16. 16.
    Crystal D (1991) A dictionary of linguistics and phonetics. Blackwell, OxfordGoogle Scholar
  17. 17.
    Deng J, Dong W, Socher R, Li J, Li K, Li FF (2009) ImageNet: a large-scale hierarchical image database. IEEE Conf. on Computer Vision and Pattern Recognition, pp 248–255Google Scholar
  18. 18.
    Dimitrova N (2003) Multimedia content analysis: the next wave, Int. Conf. on Image and Video Retrieval, pp 415–420Google Scholar
  19. 19.
    Dimitrova N, Zhang HJ, Shahraray B, Sezan I, Huang T, Zakhor A (2002) Applications of video-content analysis and retrieval. IEEE Multimedia 9(3):42–55CrossRefGoogle Scholar
  20. 20.
    Dorai C, Farrell R, Katriel A, Kofman G, Li Y, Park Y (2006) BMAGICAL demonstration: system for automated metadata generation for instructional content. ACM Multimedia, pp 491–492Google Scholar
  21. 21.
  22. 22.
    Eynard D (2008) Using semantics and user participation to customize personalization, HP Laboratories Technical Report HPL-2008-197Google Scholar
  23. 23.
    Fergus R, Perona P, Zissermann A (2003) Object class recognition by unsupervised scale invariant learning, IEEE Conf. on Computer Vision and Pattern Recognition, pp 264–271Google Scholar
  24. 24.
    Foote JT (1997) Content-based retrieval of music and audio. SPIE Multimed Storage Archiving Syst II 3229:138–147Google Scholar
  25. 25.
    Gevers T, Smeulders A (1999) Color based object recognition. Pattern Recogn 32:453–464CrossRefGoogle Scholar
  26. 26.
    Ghidini C, Giunchiglia F (2001) Local models, semantics, or contextual reasoning = locality + compatibility. Artif Intell 127(2):221–259zbMATHCrossRefMathSciNetGoogle Scholar
  27. 27.
    Giunchiglia F, Serafini L (1994) Multilanguage hierarchical logics, or how can we do without modal logics. Artif Intell 65(1):29–70zbMATHCrossRefMathSciNetGoogle Scholar
  28. 28.
    Guerts J, van OssenBruggen J, Hardman L (2001) Application-specific constraints for multimedia presentation generation. Int. Conf. on Multimedia Modelling, pp 247–266Google Scholar
  29. 29.
    Guerts J, van OssenBruggen J, Hardman L, Rutledge L (2003) Towards a multimedia formatting vocabulary. Int. Conf. on WWW, pp 384–393Google Scholar
  30. 30.
    Hanjalic A (2005) Adaptive extraction of highlights from a sport video based on excitement modeling. IEEE Trans Multimedia 7(6):1114–1122CrossRefGoogle Scholar
  31. 31.
    Hanjalic A (2006) Extracting moods from pictures and sounds: towards truly personalized TV. IEEE Signal Process Mag 23(2):90–100CrossRefGoogle Scholar
  32. 32.
    Hanjalic A, Xu LQ (2005) Affective video content representation and modeling. IEEE Trans Multimedia 7(1):143–154CrossRefGoogle Scholar
  33. 33.
    Hirsh H, Basu C, Davison B (2000) Learning to personalize. Commun ACM 43(8):102–106CrossRefGoogle Scholar
  34. 34.
    Hori T, Aizawa K (2003) Context-based video retrieval system for the Life Log applications. ACM Multimedia Information Retrieval Workshop, pp 31–38Google Scholar
  35. 35.
    Hori T, Aizawa K (2004) Capturing life log and retrieval based on context. IEEE Conf. on Multimedia and Expo, pp 301–304Google Scholar
  36. 36.
  37. 37.
    Hua XS, Lu L, Zhang HJ (2004) P-Karaoke: personalized karaoke system, ACM Multimedia, pp 172–173Google Scholar
  38. 38.
  39. 39.
    Isbister K, Hook K, Sharp M, Laaksolahti J (2006) The sensual evaluation instrument: developing an affective evaluation tool. SIGCHI Conf. on Human Factors in Computing Systems, pp 1163–1172Google Scholar
  40. 40.
    Jaimes A, Sebe N (2007) Multimodal human-computer interaction: a survey. Comput Vis Image Underst 108(1–2):116–134Google Scholar
  41. 41.
    Jaimes A, Sebe N, Gatica-Perez D (2006) Human-centered computing: a multimedia perspective, ACM Multimedia, pp 855–864Google Scholar
  42. 42.
    Jaimes A, Gatica-Perez D, Sebe N, Huang T (2007) Human-centered computing: toward a human revolution. IEEE Computer 40(5):30–34Google Scholar
  43. 43.
    Jain R (2003) Folk computing. Communications ACM 46(4):27–29Google Scholar
  44. 44.
    Jameson A (2001) Systems that adapt to their users. Tutorial presented at IJCAI 2001,
  45. 45.
    Jameson A (2001) User-adaptive and other smart adaptive systems: possible synergies. The First EUNITE Symposium, pp 13–14Google Scholar
  46. 46.
    Kadlek T, Jelenik I (2008) Semantic user profile acquisition and sharing, Int. Conf. on Computer Systems and Technologies and Workshop for PhD students in ComputingGoogle Scholar
  47. 47.
    Kang HB (2002) Analysis of scene context related with emotional events. ACM Multimedia, pp 311–314Google Scholar
  48. 48.
    Klemke R (2000) Context framework—an open approach to enhance organizational memory systems with context modeling techniques, Int. Conf. on Practical Aspects of Knowledge Management, pp 14-1–14-12Google Scholar
  49. 49.
    Lang PJ (1993) The network model of emotion: motivational connections. In: Advances in social cognition. Lawrence Erlbaum Associates, Hillsdale, NJ, pp 109–133Google Scholar
  50. 50.
    Lavrenko V, Feng S, Manmatha R (2003) Statistical models for automatic video annotation and retrieval. Int. Conf. on Acoustics, Speech and Signal Processing, pp 17–21Google Scholar
  51. 51.
    Lee M, Wilks Y (1996) An ascription-based approach to speech acts, Int. Conf. on Computational Linguistics, pp 699–704Google Scholar
  52. 52.
    Lew M, Sebe N, Djeraba C, Jain R (2006) Content-based multimedia information retrieval: state-of-the-art and challenges. ACM Trans Multimed Comput Commun Appl 2(1):1–19CrossRefGoogle Scholar
  53. 53.
    Li T, Mitsunori O (2003) Detecting emotion in music. Int. Conf. on Music Information Retrieval (ISMIR), pp 239–240Google Scholar
  54. 54.
    Li X, Yan J, Fan WG, Liu N, Yan SC, Chen Z (2009) An online blog reading system by topic clustering and personalized ranking. ACM Trans. on Internet Technology 9(3) Article 9Google Scholar
  55. 55.
    Liu D, Lu L, Zhang HJ (2003) Automatic mood detection from acoustic music data. Int. Conf. on Music Information Retrieval (ISMIR), pp 81–87Google Scholar
  56. 56.
    Liu B, Gupta A, Jain R (2005) MedSMan: a streaming data management system over live multimedia, ACM Multimedia, pp 171–180Google Scholar
  57. 57.
    Liu D, Hua G, Viola P, Chen T (2008) Integrated feature selection and higher-order spatial feature extraction for object categorization. IEEE Conf. on Computer Vision and Pattern Recognition, pp 1–8Google Scholar
  58. 58.
    Lu L, Liu D, Zhang HJ (2006) Automatic mood detection and tracking of music audio signals. IEEE Trans Audio Lang Process 14(1):5–18CrossRefMathSciNetGoogle Scholar
  59. 59.
    Magnini B, Strapparava C (2004) User modeling for news web sites with word sense based techniques. User Model User-Adapt Interact 14(2–3):239–257CrossRefGoogle Scholar
  60. 60.
    Mann W, Matthiesen C, Thompson S (1989) Rhetorical structure theory and text analysis, technical report ISI/RR-89-242, NovemberGoogle Scholar
  61. 61.
    Marszalek M, Schmid C (2006) Spatial weighting for bag-of-features. IEEE Conf. on Computer Vision and Pattern Recognition, pp 2118–2125Google Scholar
  62. 62.
    Maybury MT (1997) Intelligent multimedia information retrieval, AAAI/MIT PressGoogle Scholar
  63. 63.
    McCarthy J (1987) Generality in artificial intelligence. Commun ACM 30(12):1030–1035zbMATHCrossRefMathSciNetGoogle Scholar
  64. 64.
    Mehrabian A (1996) Pleasure-arousal-dominance: a general framework for describing and measuring individual differences in temperament. Curr Psycho 14(4):261–292CrossRefMathSciNetGoogle Scholar
  65. 65.
    Mikolajczyk K, Schmid C (2004) Scale and affine invariant interest point detectors. Int J Comp Vis 60:63–86CrossRefGoogle Scholar
  66. 66.
    Moncrieff S, Dorai C, Venkatesh S (2001) Affect computing in film through sound energy dynamics. ACM Multimedia, pp 525–527Google Scholar
  67. 67.
    MPEG—Moving Picture Expert Group,
  68. 68.
    Naphade, Huang TS (2001) A probabilistic framework for semantic video indexing, filtering and reieval. IEEE Trans Multimedia 3(1):141–151Google Scholar
  69. 69.
    Naphade MR, Huang TS (2002) Extracting semantics from audiovisual content: the final frontier in multimedia retrieval. IEEE Trans Neural Netw 13(4):793–810CrossRefGoogle Scholar
  70. 70.
    Naphade MR, Kristjansson T, Frey B, Huang TS (1998) Probabilistic multimedia objects (Multijects): a novel approach to video indexing and retrieval in multimedia systems. Int. Conf. on Image Processing, pp 536–540Google Scholar
  71. 71.
    Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree, IEEE Conf. on Computer Vision and Pattern Recognition, pp 2161–2168Google Scholar
  72. 72.
    Oviatt S (2003) User-centered modeling and evaluation of multimodal interfaces. Proc IEEE 91(9):1457–1468CrossRefGoogle Scholar
  73. 73.
    Parsons S, Sierra C, Jennings NR (1998) Agents that reason and negotiate by arguing. J Log Comput 8(3):261–292zbMATHCrossRefMathSciNetGoogle Scholar
  74. 74.
    Quiroga L (1999) Empirical evaluation of explicit vs implicit acquisition of user profiles in information filtering systems, ACM Conf. on Digital Libraries, pp 238–239Google Scholar
  75. 75.
    Rauber A, Pampalk E, Merkl D (2003) The SOM-enhanced jukebox: organization and visualization of music collections based on perceptual models. J New Music Res JNMR 32(2):193–210CrossRefGoogle Scholar
  76. 76.
    Rigo S, Jose O (2008) Advanced in conceptual modeling—challenges and opportunities: ER 2008 Workshops CMLSA, ECDM, FP-UML, M2AS, RIGiM, SeCoGIS, WISM. Lect Notes Comput Sci 5232Google Scholar
  77. 77.
    Roy D, Pentland A (2002) Learning words from sights and sounds: a computational model. Cogn Sci 26(1):113–146Google Scholar
  78. 78.
    Russell J, Mehrabian A (1977) Evidence for a three-factor theory of emotions. J Res Pers 11:273–294CrossRefGoogle Scholar
  79. 79.
    Savarese S, Winn J, Criminisi A (2006) Discriminative object class models of appearance and shape by correlatons. IEEE Conf. on Computer Vision and Pattern Recognition, pp 2033–2040Google Scholar
  80. 80.
    Schilit B, Adams N, Want R (1994) Context-aware computing applications. IEEE Workshop on Mobile Computing Systems and Applications, pp 85–90Google Scholar
  81. 81.
    Schlosberg H (1954) Three dimensions of emotion. Psychol Rev 61(2):81–88CrossRefGoogle Scholar
  82. 82.
    Sebe N, Tian Q (2007) Personalized multimedia retrieval: the new trend? ACM Multimedia Information Retrieval Workshop, pp 299–306Google Scholar
  83. 83.
    Zhang S, Huang Q, Jiang S, Gao W, Tian Q (2010) Affective visualization and retrieval for music video. IEEE Trans Multimedia, Special Issue on Multimodal Afftective Interaction 12(6):510–522Google Scholar
  84. 84.
    Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos, Int. Conf. on Computer Vision, pp 1470–1477Google Scholar
  85. 85.
    Smeulders A, Worring M, Santini S, Gupta A, Jain R (2000) Content based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380Google Scholar
  86. 86.
    Snoek CGM, Worring M, Geusebroek J, Koelma D, Seinstra F, Smeulders A (2006) The semantic pathfinder: using an authoring metaphor for generic multimedia indexing. IEEE Trans Patt Anal Mach Intell 28(10):1678–1689CrossRefGoogle Scholar
  87. 87.
    Song Y, Hua XS, Dai LR, Wang M (2005) Semi-automatic video annotation based on active learning with multiple complementary predictors. ACM Int. Workshop on Multimedia Information Retrieval, pp 97–104Google Scholar
  88. 88.
  89. 89.
    Sullivan DO, Smyth B, Wilson DC, McDonald K, Smeaton A (2004) Improving the quality of the personalized electronic program guide. User Model User-Adapt Interact 14(1):5–36CrossRefGoogle Scholar
  90. 90.
    Torralba A, Fergus R, Freeman WT (2008) 80 million tiny images: a large dataset for non-parametric object and scene recognition. IEEE Trans Pattern Anal Mach Intell 30(11):1958–1970CrossRefGoogle Scholar
  91. 91.
    Tseng BL, Lin CY, Smith JR (2004) Using MPEG-7 and MPEG-21 for personalizing video. IEEE Trans Multimedia 11(1):42–52CrossRefGoogle Scholar
  92. 92.
    Tsinaraki C, Christodoulakis S (2005) Semantic user preference descriptions in MPEG-7/21. The 4th Hellienic Data Managerment Symposium (HDMS)Google Scholar
  93. 93.
    Tsinaraki C, Christodoulakis S (2006) A multimedia user preference model that supports semantics and its application to MPEG 7/21. Int. Conf. on Multimedia Modelling, pp 35–42Google Scholar
  94. 94.
    Tsinaraki C, Polydoros P, Kazasis F, Christodoulakis S (2005) Ontology-based semantic indexing for MPEG-7 and TV-anytime audiovisual content. Multimed Tools Appl 26(3):299–325CrossRefGoogle Scholar
  95. 95.
    Venkatesh S, Adams B, Phung D, Dorai C, Farrell RG, Agnihotri L, Dimitrova N (2008) “You Tube and I Find”-personalizing multimedia content access. Proc IEEE 96(4):697–711CrossRefGoogle Scholar
  96. 96.
    Wang HL, Cheong LF (2006) Affective understanding in film. IEEE Trans Circuits Syst Video Technol 16(6):689–704CrossRefGoogle Scholar
  97. 97.
    Wang FS, Lu W, Liu J, Shah M, Xu D (2008) Automatic video annotation with adaptive number of key words, Int. Conf. on Pattern Recognition, pp 1–4Google Scholar
  98. 98.
    Wang F, Jiang YG, Ngo CW (2008) Video event detection using motion relativity and visual relatedness. ACM Multimedia, pp 239–248Google Scholar
  99. 99.
    Webb GI, Pazzani MJ, Billsus D (2001) Machine learning for user modeling. User Model User-Adapt Interact 11(1–2):19–29zbMATHCrossRefGoogle Scholar
  100. 100.
    Wei G, Petrushin V, Gershman A (2002) From data to insight: the community of multimedia agents, Int. Workshop on Multimedia Data MiningGoogle Scholar
  101. 101.
    Weitzman L, Wittenberg K (1994) Automatic presentation of multimedia documents using relational grammars. ACM Multimedia, pp 443–451Google Scholar
  102. 102.
    Winn J, Criminisi A, Minka T (2005) Object categorization by learning universal visual word dictionary. Int. Conf. on Computer Vision, pp 1800–1807Google Scholar
  103. 103.
    Wold E, Blum T, Kreislar D, Wheaton J (1996) Content-based classification, search, and retrieval of audio. IEEE Multimedia 3(3):27–36CrossRefGoogle Scholar
  104. 104.
    Xu D, Chang SF (2008) Video event recognition using kernel methods with multilevel temporal alignment. IEEE Trans Pattern Anal Mach Intell 30(11):1985–1997CrossRefGoogle Scholar
  105. 105.
    Xu M, Chia LT, Jin J (2005) Affective content analysis in comedy and horror videos by audio emotional event detection. IEEE Int. Conf. on Multimedia and Expo, pp 622–625Google Scholar
  106. 106.
    Yang L, Meer P, Foran DJ (2007) Multiple class segmentation using a unified framework over mean-shift patches. IEEE Conf. on Computer Vision and Pattern Recognition, pp 1–8Google Scholar
  107. 107.
    Yu B, Ma WY, Nahrstedt K, Zhang HJ (2003) Video summarization based on user log enhanced link analysis. ACM Multimedia, pp 382–391Google Scholar
  108. 108.
    Zeng ZH, Pantic M, Roisman GI, Huang T. A survey of affect recognition methods: audio, visual and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58Google Scholar
  109. 109.
    Zhang S, Tian Q, Hua G, Huang Q, Li S (2009) Descriptive visual words and visual phrases for image applications. ACM Multimedia, pp 75–84Google Scholar
  110. 110.
    Zhou M (1999) Visual planning: a practical approach to automated presentation design. Int. Joint Conference on Artificial Intelligence, pp 634–641Google Scholar
  111. 111.
    Zhou XS, Huang TS (2003) Relevance feedback in image retrieval: a comprehensive review. Multimed Syst 8(6):536–544CrossRefGoogle Scholar
  112. 112.
    Zhou M, Houck K, Pan S, Shaw J, Aggarwal V, Wen Z (2006) Enabling context-sensitive information seeking, Int. Conf. on Intelligent User Interfaces, pp 116–123Google Scholar
  113. 113.
    Zhou X, Zhuang XD, Yan SC, Chang SF, Johnson MH, Huang T (2008) SIFT-Bag kernel for video event analysis. ACM Multimedia, pp 229–238Google Scholar
  114. 114.
    Von AL (2006) Games with a purpose. IEEE Computer 39(6):96–98Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Department of Computer ScienceTexas State UniversitySan MarcosUSA
  2. 2.Department of Information Engineering and Computer ScienceUniversity of TrentoTrentoItaly
  3. 3.Computer Science DepartmentUniversity of Texas at San AntonioSan AntonioUSA

Personalised recommendations