Advertisement

On Shape Recognition and Language

  • Petros MaragosEmail author
  • Vassilis Pitsikalis
  • Athanasios Katsamanis
  • George Pavlakos
  • Stavros Theodorakis
Conference paper
Part of the Mathematics and Visualization book series (MATHVISUAL)

Abstract

Shapes convey meaning. Language is efficient in expressing and structuring meaning. The main thesis of this chapter is that by integrating shape with linguistic information shape recognition can be improved in performance. It broadens the concept of shape to visual shapes that include both geometric and optical information and explores ways that additional linguistic information may help with shape recognition. Towards this goal, it briefly describes some shape categories which have the potential of better recognition via language, with emphasis on gestures and moving shapes of sign language, as well as on cross-modal relations between vision and language in videos. It also draws inspiration from psychological studies that explore connections between gestures and human languages. Afterwards, it focuses on the broad class of multimodal gestures that combine spatio-temporal visual shapes with audio information. In this area, an approach is reviewed that significantly improves multimodal gesture recognition by fusing 3D shape information from motion-position of gesturing hands/arms and spatio-temporal handshapes in color and depth visual channels with audio information in the form of acoustically recognized sequences of gesture words.

Keywords

Sign Language Gesture Recognition Linguistic Information Shape Recognition Visual Shape 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

We wish to thank Niki Efthymiou and Nancy Zlatintsi at NTUA CVSP Lab for Fig. 15.8 and discussions related to Sect. 15.3.2. This research work was supported by the project COGNIMUSE which is implemented under the ARISTEIA Action of the Operational Program Education and Lifelong Learning and is co-funded by the European Social Fund and Greek National Resources. It was also partially supported by the European Union under the project MOBOT with grant FP7-ICT-2011-9 2.1 – 600796.

References

  1. 1.
    Agris, U., Zieren, J., Canzler, U., Bauer, B., Kraiss, K.F.: Recent developments in visual sign language recognition. Univ. Access Inf. Soc. 6, 323–362 (2008)CrossRefGoogle Scholar
  2. 2.
    Antonakos, E., Pitsikalis, V., Maragos, P.: Classification of extreme facial events in sign language videos. EURASIP J. Image Video Process. 2014, 14 (2014)Google Scholar
  3. 3.
    Arbib, M.A.: How the Brain Got Language: The Mirror System Hypothesis. Oxford University Press, New York (2012)CrossRefGoogle Scholar
  4. 4.
    Bayer, I., Silbermann, T.: A multi modal approach to gesture recognition from audio and video data. In: Proceedings of the ACM International Conference on Multimodal Interaction, Sydney, pp. 461–466 (2013)Google Scholar
  5. 5.
    Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)zbMATHGoogle Scholar
  6. 6.
    Bolt, R.A.: Put-that-there: voice and gesture at the graphics interface. ACM Comput. Graph. 14 (3), 262–270 (1980)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Bordier, C., Puja, F., Macaluso, E.: Sensory processing during viewing of cinematographic material: computational modeling and functional neuroimaging. NeuroImage 67, 213–226 (2013)CrossRefGoogle Scholar
  8. 8.
    Bowden, R., Windridge, D., Kadir, T., Zisserman, A., Brady, M.: A linguistic feature vector for the visual interpretation of sign language. In: Proceedings of the European Conference on Computer Vision (ECCV), Prague (2004)Google Scholar
  9. 9.
    Buehler, P., Everingham, M., Zisserman, A.: Learning sign language by watching TV (using weakly aligned subtitles). In: Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition (CVPR), Miami, pp. 2961–2968 (2009)Google Scholar
  10. 10.
    Chow, Y.-L., Schwartz, R.: The N-best algorithm: an efficient procedure for finding top N sentence hypotheses. In: HLT’89 Proceedings of the Workshop on Speech and Natural Language, Morristown, pp. 199–202 (1989)Google Scholar
  11. 11.
    Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 23 (6), 681–685 (2001)CrossRefGoogle Scholar
  12. 12.
    Cour, T., Sapp, B., Nagle, A., Taskar, B.: Talking pictures: temporal grouping and dialog-supervised person recognition. In: Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition (CVPR), San Francisco (2010)Google Scholar
  13. 13.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition (CVPR), San Diego, pp. 886–893 (2005)Google Scholar
  14. 14.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2001)zbMATHGoogle Scholar
  15. 15.
    Emmorey, K.: Language, Cognition, and the Brain: Insights from Sign Language Research. Lawrence Erlbaum Associates, Mahwah (2002)Google Scholar
  16. 16.
    Escalera, S., Gonzàlez, J., Baró, X., Reyes, M., Guyon, I., Athitsos, V., Escalante, H., Sigal, L., Argyros, A., Sminchisescu, C., Bowden, R., Sclaroff, S.: ChaLearn multi-modal gesture recognition 2013: grand challenge and workshop summary. In: Proceedings of the ACM International Conference on Multimodal Interaction, Sydney, pp. 365–368 (2013)Google Scholar
  17. 17.
    Escalera, S., Gonzàlez, J., Baró, X., Reyes, M., Lopes, O., Guyon, I., Athistos, V., Escalante, H.J.: Multi-modal gesture recognition challenge 2013: dataset and results. In: Proceedings of the ACM International Conference on Multimodal Interaction, pp. 445–452 (2013)Google Scholar
  18. 18.
    Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.A.: Describing objects by their attributes. In: Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition (CVPR), Miami (2009)Google Scholar
  19. 19.
    Fei-Fei, L., Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition (CVPR), San Diego (2005)Google Scholar
  20. 20.
    Gersho, A., Gray, R.M.: Vector Quantization and Signal Compression. Springer Science & Business Media, Boston (1992)CrossRefzbMATHGoogle Scholar
  21. 21.
    Glotin, H., Vergyr, D., Neti, C., Potamianos, G., Luettin, J.: Weighting schemes for audio-visual fusion in speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Salt Lake City, pp. 173–176 (2001)Google Scholar
  22. 22.
    Jaimes, A., Sebe, N.: Multimodal human–computer interaction: a survey. Comput. Vis. Image Underst. 108 (1), 116–134 (2007)CrossRefGoogle Scholar
  23. 23.
    Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1997)Google Scholar
  24. 24.
    Johnson, R.E., Liddell, S.K.: A segmental framework for representing signs phonetically. Sign Lang. Stud. 11 (3), 408–463 (2011)CrossRefGoogle Scholar
  25. 25.
    Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press, Cambridge/New York (2004)Google Scholar
  26. 26.
    Kopp, S., Bergmann, K.: Automatic and strategic alignment of co-verbal gestures in dialogue. In: Wachsmuth, I., de Ruiter, J., Kopp, S., Jaecks, P. (eds.) Alignment in Communication: Towards a New Theory of Communication, pp. 87–107. John Benjamins Publ. Co., Amsterdam (2013)CrossRefGoogle Scholar
  27. 27.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition (CVPR), Anchorage (2008)Google Scholar
  28. 28.
    Liddell, S.K.: Grammar, Gesture and Meaning in American Sign Language. Cambridge University Press, Cambridge (2003)Google Scholar
  29. 29.
    Maragos, P., Gros, P., Katsamanis, A., Papandreou, G.: Cross-modal integration for performance improving in multimedia: a review. In: Maragos, P., Potamianos, A., Gros, P. (eds.) Multimodal Processing and Interaction: Audio, Video, Text, pp. 3–48. Springer, New York (2008)CrossRefGoogle Scholar
  30. 30.
    McNeill, D.: Gesture: a psycholinguistic approach. In: The Encyclopedia of Language and Linguistics, pp. 1–15. Elsevier, Boston (2006)Google Scholar
  31. 31.
    McNeill, D.: Gesture-speech unity: phylogenesis, ontogenesis microgenesis. Lang. Interact. Acquis. 5 (2), 137–184 (2014)CrossRefGoogle Scholar
  32. 32.
    Ong, S., Ranganath, S.: Automatic sign language analysis: a survey and the future beyond lexical meaning. IEEE Trans. Pattern Anal. Mach. Intell. 27, 873–891 (2005)CrossRefGoogle Scholar
  33. 33.
    Ostendorf, M., Kannan, A., Austin, S., Kimball, O., Schwartz, R., Rohlicek, J.R.: Integration of diverse recognition methodologies through reevaluation of N-best sentence hypotheses. In: HLT’91 Proceedings of the Workshop on Speech and Natural Language, pp. 83–87 (1991)Google Scholar
  34. 34.
    Oviatt, S., Cohen, P.: Perceptual user interfaces: multimodal interfaces that process what comes naturally. Commun. ACM 43 (3), 45–53 (2000)CrossRefGoogle Scholar
  35. 35.
    Parikh, D., Grauman, K.: Relative attributes. In: Proceedings of the International Conference on Computer Vision (ICCV), Barcelona (2011)Google Scholar
  36. 36.
    Pastra, K.: COSMOROE: a cross-media relations framework for modelling multimedia dialectics. Multimed. Syst. 14, 299–323 (2008)CrossRefGoogle Scholar
  37. 37.
    Pavlakos, G., Theodorakis, S., Pitsikalis, V., Katsamanis, A., Maragos, P.: Kinect-based multimodal gesture recognition using a two-pass fusion scheme. In: Proceeding of the IEEE International Conference on Image Processing (ICIP), Paris, pp. 1495–1499 (2014)Google Scholar
  38. 38.
    Pitsikalis, V., Katsamanis, A., Theodorakis, S., Maragos, P.: Multimodal gesture recognition via multiple hypotheses rescoring. J. Mach. Learn. Res. 16, 255–284 (2015)MathSciNetGoogle Scholar
  39. 39.
    Pitsikalis, V., Theodorakis, S., Vogler, C., Maragos, P.: Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition. In: Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition Workshops, Colorado Springs (2011)CrossRefGoogle Scholar
  40. 40.
    Rabiner, L.R., Juang, B.H.: Fundamentals of Speech Recognition. Prentice Hall, Englewood Cliffs (1993)zbMATHGoogle Scholar
  41. 41.
    Rose, R.C., Paul, D.B.: A hidden Markov model based keyword recognition system. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Albuquerque, pp. 129–132 (1990)Google Scholar
  42. 42.
    Searle, J.R.: Mind, Language, and Society: Philosophy in the Real World. Basic Books, New York (1999)Google Scholar
  43. 43.
    Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering objects and their location in images. In: Proceedings of the International Conference on Computer Vision (ICCV), Beijing, (2005)Google Scholar
  44. 44.
    Starner, T., Weaver, J., Pentland, A.: Real-time American sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. 20 (12), 1371–1375 (1998)CrossRefGoogle Scholar
  45. 45.
    Theodorakis, S., Pitsikalis, V., Maragos, P.: Dynamic–static unsupervised sequentiality, statistical subunits and lexicon for sign language recognition. Image Vis. Comput. 32, 533–549 (2014)CrossRefGoogle Scholar
  46. 46.
    Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 4th edn. Academic Press (2008)Google Scholar
  47. 47.
    Tomasello, M.: Origins of Human Communication. MIT Press, Cambridge (2008)Google Scholar
  48. 48.
    Vatakis, A., Spence, C.: Audiovisual synchrony perception for music, speech, and object actions. Brain Res. 1111, 134–142 (2006)CrossRefGoogle Scholar
  49. 49.
    Vogler, C., Metaxas, D.: A framework for recognizing the simultaneous aspects of American sign language. Comput. Vis. Image Underst. 81 (3), 358–384 (2001)CrossRefzbMATHGoogle Scholar
  50. 50.
    Wilpon, J., Rabiner, L.R., Lee, C.H., Goldman, E.R.: Automatic recognition of keywords in unconstrained speech using hidden Markov models. IEEE Trans. Acoust. Speech Signal Process. 38 (11), 1870–1878 (1990)CrossRefGoogle Scholar
  51. 51.
    Wittgenstein, L.: Philosophical Investigations. (Translated by Anscombe, G.E.M., and Editors Hacker, P.M.S., Schulte, J., 4th edn.). Wiley-Blackwell Publ. (2009) (1953)Google Scholar
  52. 52.
    Wittgenstein, L.: The Big Typescript: TS 213 (Edited and translated by Luckhardt, C.G., Aue, M.E.). Blackwell Publication (2005)Google Scholar
  53. 53.
    Wu, J., Cheng, J., Zhao, C., Lu, H.: Fusing multi-modal features for gesture recognition. In: Proceedings of the ACM International Conference on Multimodal Interaction, Sydney, pp. 453–460 (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Petros Maragos
    • 1
    Email author
  • Vassilis Pitsikalis
    • 1
  • Athanasios Katsamanis
    • 1
  • George Pavlakos
    • 2
  • Stavros Theodorakis
    • 1
  1. 1.School of Electrical and Computer EngineeringNational Technical University of AthensAthensGreece
  2. 2.Computer and Information ScienceUniversity of PennsylvaniaPhiladelphiaUSA

Personalised recommendations