Multimedia Analysis in Police–Citizen Communication: Supporting Daily Policing Tasks

  • Peter LeškovskýEmail author
  • Santiago Prieto
  • Aratz Puerto
  • Jorge García
  • Luis Unzueta
  • Nerea Aranjuelo
  • Haritz Arzelus
  • Aitor Álvarez
Part of the Security Informatics and Law Enforcement book series (SILE)


This chapter describes an approach for improved multimedia analysis as part of an ICT-based tool for community policing. It includes technology for automatic processing of audio, image and video contents sent as evidence by the citizens to the police. In addition to technical details of their development, results of their performance within initial pilots simulating nearly real crime situations are presented and discussed.



This work has been supported by the EU project INSPEC2T under the H2020-FCT-2014 programme (GA 653749).


  1. Animetrics. (2018). Advanced 2D-to-3D algorithms for face recognition applications. Animetrics. Retrieved October, 2018, from
  2. Amped. (2018). Amped Five. Amped SRL. Retrieved October, 2018, from
  3. Baltieri, D., Vezzani, R., & Cucchiara, R. (2011). 3DPes: 3D People Dataset for Surveillance and Forensics. In Proceedings of the 1st International ACM Workshop on Multimedia access to 3D Human Objects, pp. 59–64Google Scholar
  4. Bevilacqua, M., Roumy, A., Guillemot, C., & Marie-Line. A. M. (2013). Video super-resolution via sparse combinations of key-frame patches in a compression context. In: 30th Picture Coding Symposium (PCS)Google Scholar
  5. Bisani, M., & Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. J Speech communication, 50(5), 434–451.Google Scholar
  6. BOSCH. (n.d.). Video analytics at the edge. Bosch Sicherheitssysteme GmbH. Retrieved October, 2018, from
  7. Campbell, J. P., Shen, W., Campbell, W. M., et al. (2009). Forensic speaker recognition. J IEEE Signal Processing Magazine, 26(2), 95.CrossRefGoogle Scholar
  8. Can, D., & Saraclar, M. (2011). Lattice indexing for spoken term detection. J IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2338–2347.Google Scholar
  9. Cheng, D. S., Cristani, M., Stoppa, M., Bazzani, L., & Murino, V. (2011). Custom pictorial structures for re-identification. In: British Machine Vision Conference (BMVC).Google Scholar
  10. CitizenCOP Foundation. (n.d.). CitizenCOP APP. CitizenCOP Foundation. Retrieved October, 2018, from
  11. Davis, E. K. (2009). Dlib-ml: A machine learning toolkit. J Machine Learning Research, 10, 1755–1758.Google Scholar
  12. del Pozo, A., Aliprandi, C., & Álvarez, A. Mendes, C., Neto, J., Paulo, S., Piccinini, N., Raffaelli, M. (2014) SAVAS: Collecting, annotating and sharing audiovisual language resources for automatic subtitling. In: Ninth international conference on language resources and evaluation (LREC).Google Scholar
  13. Eurostat. (2018). Individuals using the internet for participating in social networks, code: tin00127, Eurostat. Retrieved October, 2018, from
  14. Freesound Org. (n.d.). Freesound, Freesound Org. Retrieved October, 2018, from
  15. Garcia-Romero, D., & Espy-Wilson, C. (2010). Speech forensics: Automatic acquisition device identification. The Journal of the Acoustical Society of America, 127(3), 2044–2044.CrossRefGoogle Scholar
  16. Gray, D., & Tao, H. (2008). Viewpoint invariant pedestrian recognition with an ensemble of localized features. In: 10th European Conference on Computer Vision (ECCV).Google Scholar
  17. Heafield, K. (2011). KenLM: Faster and smaller language model queries. In: Sixth workshop on statistical machine translation. Association for Computational Linguistics.Google Scholar
  18. Huang, G. B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts.Google Scholar
  19. Hunt, A. (1996). BEEP dictionary. Speech Applications Group, Sun Microsystems Laboratories. Retrieved October, 2018, from
  20. Ikram, S., & Malik, H. (2010). Digital audio forensics using background noise. In: IEEE International Conference on Multimedia and Expo (ICME).Google Scholar
  21. Itseez Inc. (2015). Open source computer vision library. Retrieved from
  22. Jain, V., & Learned-Miller, E. (2010). FDDB: A benchmark for face detection in unconstrained settings. Technical report UM-CS-2010-009, University of Massachusetts.Google Scholar
  23. Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). DenseCap: Fully convolutional localization networks for dense captioning. In: IEEE Conf. on Computer Vision and Pattern Recognition.Google Scholar
  24. Kazemi, V., & Sullivan, J. (2014). One Millisecond Face Alignment with an Ensemble of Regression Trees. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  25. Koehn, P., Hoang, H., & Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E. (2007) Moses: Open source toolkit for statistical machine translation. In: 45th annual meeting of the ACL on interactive poster and demonstration sessions, Association for Computational Linguistics.Google Scholar
  26. Koenig, B. E., & Lacey, D. S. (2015). Forensic authentication of digital audio and video files. In Handbook of digital forensics of multimedia data and devices, (pp. 133–181).Google Scholar
  27. Loy CC (2017) QMUL underGround re-IDentification (GRID) dataset, School of Computer Science and Engineering, Nanyang Technological University, Singapore. Retrieved October, 2018, from
  28. López Morràs, X. (2004). Transcriptor fonético automático del español. Retrieved October, 2018, from
  29. Maher, R. C. (2009). Audio forensic examination. IEEE Signal Processing Magazine, 26(2), 84–94.CrossRefGoogle Scholar
  30. Malik, H. (2013). Acoustic environment identification and its applications to audio forensics. J IEEE Transactions on Information Forensics and Security, 8(11), 1827–1837.CrossRefGoogle Scholar
  31. Mattys, S. L., Davis, M. H., Bradlow, A. R., & Scott, S. K. (2012). Speech recognition in adverse conditions: A review. J Language and Cognitive Processes, 27(7–8), 953–978.Google Scholar
  32. Miami-Dade County. (2018). Download the COP app. Miami-Dade County. Retrieved October, 2018, from
  33. Ministerio del Interior. (2018). AlertCops: Law Enforcement Agencies App, Ministerio del Interior Gobierno de España. Retrieved October, 2018, from
  34. Panayotov, V., Chen, G., Povey, D.,& Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books.In: IEEE international conference on acoustics, Speech and Signal Processing (ICASSP).Google Scholar
  35. Petroff, A. (2016). MasterCard launching selfie payments. Cable News Network. Retrieved October, 2018, from
  36. Povey, D., Ghoshal, A., & Boulianne, G., et al. (2011). The Kaldi speech recognition toolkit. In: IEEE workshop on automatic speech recognition and understanding (ASRU), IEEE Signal Processing SocietyGoogle Scholar
  37. Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2013). 300 Faces in-the-wild challenge: The first facial landmark localization challenge. In: IEEE Intl Conf. On computer vision.Google Scholar
  38. Sargsyan, G., & Stoter, A. (2016). D3.4 2nd SAG Meeting Report. INSPEC2T consortum public deliverableGoogle Scholar
  39. TED Conferences. (n.d.). TED Ideas worth spreading. TED Conferences. Retrieved October, 2018, from
  40. The Reno Police Department. (n.d.). myRPD App. The Reno police department. Retrieved October, 2018, from
  41. Tilk, O., & Alum, T. (2015). LSTM for punctuation restoration in speech transcripts. In: 16th annual Conf. Of the international speech communication association (INTERSPEECH).Google Scholar
  42. Varol, G., & Salah, A. A. (2015). Efficient large-scale action recognition in videos using extreme learning machines. J Expert Systems with Applications, 42(21), 8274.Google Scholar
  43. Veaux, C., Yamagishi, J., & MacDonald, K., et al. (2017). CSTR VCTK Corpus: English multi-speaker Corpus for CSTR voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR).Google Scholar
  44. WiredBlue. (n.d.). My Police Deapartment App. WiredBlue. Retrieved October, 2018, from
  45. Wu, X., He, R., & Sun, Z. (2015). A lightened CNN for deep face representation. In: CoRR arXiv:1511.02683.
  46. Yi, D., Lei, Z., Liao, S., & Li, S. Z. (2014). Learning face representation from scratch. In: CoRR. arXiv:1411.7923Google Scholar
  47. Zheng, W. S., Gong, S., & Xiang, T. (2016). Towards open-world person re-identification by one-shot group-based verification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(3), 591–606.CrossRefGoogle Scholar
  48. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. Advances in neural information processing systems 27.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Peter Leškovský
    • 1
    Email author
  • Santiago Prieto
    • 1
  • Aratz Puerto
    • 1
  • Jorge García
    • 1
  • Luis Unzueta
    • 1
  • Nerea Aranjuelo
    • 1
  • Haritz Arzelus
    • 1
  • Aitor Álvarez
    • 1
  1. 1.VicomtechSan SebastianSpain

Personalised recommendations