Multimedia Tools and Applications

, Volume 75, Issue 24, pp 17035–17057 | Cite as

Cognition inspired format for the expression of computer vision metadata

  • H. Castro
  • J. Monteiro
  • A. Pereira
  • D. Silva
  • G. Coelho
  • P. Carvalho


Over the last decade noticeable progress has occurred in automated computer interpretation of visual information. Computers running artificial intelligence algorithms are growingly capable of extracting perceptual and semantic information from images, and registering it as metadata. There is also a growing body of manually produced image annotation data. All of this data is of great importance for scientific purposes as well as for commercial applications. Optimizing the usefulness of this, manually or automatically produced, information implies its precise and adequate expression at its different logical levels, making it easily accessible, manipulable and shareable. It also implies the development of associated manipulating tools. However, the expression and manipulation of computer vision results has received less attention than the actual extraction of such results. Hence, it has experienced a smaller advance. Existing metadata tools are poorly structured, in logical terms, as they intermix the declaration of visual detections with that of the observed entities, events and comprising context. This poor structuring renders such tools rigid, limited and cumbersome to use. Moreover, they are unprepared to deal with more advanced situations, such as the coherent expression of the information extracted from, or annotated onto, multi-view video resources. The work here presented comprises the specification of an advanced XML based syntax for the expression and processing of Computer Vision relevant metadata. This proposal takes inspiration from the natural cognition process for the adequate expression of the information, with a particular focus on scenarios of varying numbers of sensory devices, notably, multi-view video.


Metadata Multi-view video Multimedia annotation Computer vision Cognition 



The Work was largely developed in the context of: project Media Arts and Technologies (MAT), NORTE-07-0124-FEDER-000061, financed by the North Portugal Regional Operational Programme (ON.2 – O Novo Norte), under the National Strategic Reference Framework (NSRF), through the European Regional Development Fund (ERDF), and by national funds, through the Portuguese funding agency, Fundação para a Ciência e a Tecnologia (FCT); Project QREN 23277 RETAIL PRO, a co-promotion R&D project funded by European Regional Development Fund (ERDF) through ON2 as part of the National Strategic Reference Framework (NSRF), and managed by Agência de Inovação (ADI); Project QREN 33910 ARENA, a R&D project funded by European Regional Development Fund (ERDF) through ON2 as part of the National Strategic Reference Framework (NSRF), and managed by IAPMEI - Agência para a Competitividade e Inovação, I.P.


  1. 1.
    Barrett D (2013) One surveillance camera for every 11 people in Britain, says CCTV survey. The Telegraph. Britain-says-CCTV-survey.html
  2. 2.
    Carvalho P, Cardoso JS, Corte-Real e L (2012) Filling the gap in quality assessment of video object tracking. Image Vis Comput 30(9):630–640CrossRefGoogle Scholar
  3. 3.
    Carvalho P, Oliveira T, Ciobanu L, Gaspar F, Teixeira LF, Bastos R, Dias MS, Cardoso JS, Côrte-Real e L (2013) Analysis of object description methods in a video object tracking environment. Mach Vis Appl 24(6):1149–1165CrossRefGoogle Scholar
  4. 4.
    Castro H, Alves AP (2009) Cognitive object format, international conference on knowledge engineering and ontology development. Funchal. doi: 10.5220/0002263103510358.
  5. 5.
    Doherty AR, Hodges SE, King AC, Smeaton AF, Berry E, Moulin CJA, Lindley S, Kelly P, Foster C (2013) Wearable cameras in health: the state of the art and future possibilities. Am J Prev Med 44(3):320–323. doi: 10.1016/j.amepre.2012.11.008 CrossRefGoogle Scholar
  6. 6.
    Drost B, Ulrich M, Navab N, Ilic S (2010) Model globally, match locally: efficient and robust 3D object recognition. In CVPRGoogle Scholar
  7. 7.
    Francescani C, NYPD (2013) expands surveillance net to fight crime as well as terrorism. Reuters,
  8. 8.
    Information technology - multimedia content description interface - part 9: Profiles and levels, amendment 1: extensions to profiles and levels ISO/IEC 15938-9:2005/Amd.1:2012 (2012)Google Scholar
  9. 9.
    Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. Int J Comput Vis 50(2):171–184CrossRefMATHGoogle Scholar
  10. 10.
    List T, Fisher RB (2004) CVML – An XML-based computer vision markup language. Proceedings of the 17th international conference on pattern recognition ICPRGoogle Scholar
  11. 11.
    Manjunath BS, Salembier P, Sikora T (2002) Introduction to mpeg-7: multimedia content description interface. ISBN: 978–0-471-48678-7Google Scholar
  12. 12.
    Marr D (2010) Vision. A computational investigation into the human representation and processing of visual information. The MIT Press, Cambridge. ISBN 978-0262514620Google Scholar
  13. 13.
    Newcombe RA, Davison AJ (2010) Live dense reconstruction with a single moving camera. In proceedings of the ieee conference on computer vision and pattern recognition (CvPR) 1:2.2Google Scholar
  14. 14.
    Pereira F, Koenen R (2001) MPEG-7: a standard for multimedia content description. Intern J Imag Grap 1(3):527--547Google Scholar
  15. 15.
  16. 16.
    Project ViPER website,
  17. 17.
    Reisslein M, Rinner B, Roy-Chowdhury A (2014) Smart camera networks [guest editors’ introduction]. Computer 47(5):23–25. doi: 10.1109/MC.2014.134 CrossRefGoogle Scholar
  18. 18.
    Saligrama V, Konrad J, Jodoin P (2010) Video anomaly identification: a statistical approach. IEEE Signal Process Mag 27(5):18–33CrossRefGoogle Scholar
  19. 19.
    Sanes DH, Reh TA, Harris WA (2006) Development of the nervous system. Elsevier Academic Press, LondonGoogle Scholar
  20. 20.
    Sano M, Bailer W, Messina A, Evain J-P, Matton M (2013) The MPEG-7 audiovisual description profile (avdp) and its application to multi-view video IVMSP Workshop. 2013 IEEE 11th, pp 1--4, 2013.Google Scholar
  21. 21.
    Schallauer P, Bailer W, Hofmann A, Mörzinger R (2009) SAM – an interoperable metadata model for multimodal surveillance applications. In proceedings of spie defense, security, and sensing 2009. OrlandoGoogle Scholar
  22. 22.
    Vezzani R, Cucchiara R (2010) Video surveillance online repository (ViSOR): an integrated framework. Multimedia Tools Appli 50(2):359–380CrossRefGoogle Scholar
  23. 23.
    Volkmer T, Smith JR, Natsev A (2005) A web-based system for collaborative annotation of large image and video collections: an evaluation and user study. Proceedings of the 13th annual ACM international conference on multimedia, pp 892–901Google Scholar
  24. 24.
    Wines M (2011) China: chongqing will Add 200,000 surveillance cameras. The New York Times.
  25. 25.
    Yan Y, Ricci E, Subramanian R, Lanz O, Sebe N (2013) No matter where you are: flexible graph-guided multi-task learning for multi-view head pose classification under target motion. International conference on computer visionGoogle Scholar
  26. 26.
    Yan Y, Ricci E, Subramanian R, Liu G, Sebe N (2014) Multi-task linear discriminant analysis for multi-view action recognition. IEEE Trans Image Process 23(12):5599–5611MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • H. Castro
    • 1
  • J. Monteiro
    • 1
  • A. Pereira
    • 1
  • D. Silva
    • 1
  • G. Coelho
    • 1
  • P. Carvalho
    • 1
    • 2
  1. 1.INESC TEC, Campus da FEUPPortoPortugal
  2. 2.Instituto Superior de Engenharia do PortoPortoPortugal

Personalised recommendations