Multimodal Integration for Meeting Group Action Segmentation and Recognition

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3869)


We address the problem of segmentation and recognition of sequences of multimodal human interactions in meetings. These interactions can be seen as a rough structure of a meeting, and can be used either as input for a meeting browser or as a first step towards a higher semantic analysis of the meeting. A common lexicon of multimodal group meeting actions, a shared meeting data set, and a common evaluation procedure enable us to compare the different approaches. We compare three different multimodal feature sets and our modelling infrastructures: a higher semantic feature approach, multi-layer HMMs, a multi-stream DBN, as well as a multi-stream mixed-state DBN for disturbed data.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Al-Hames, M., Rigoll, G.: A multi-modal graphical model for robust recognition of group actions in meetings from disturbed videos. In: Proc. IEEE ICIP, Italy (2005)Google Scholar
  2. 2.
    Al-Hames, M., Rigoll, G.: A multi-modal mixed-state dynamic Bayesian network for robust meeting event recognition from disturbed data. In: Proc. IEEE ICME (2005)Google Scholar
  3. 3.
    Bengio, S.: An asynchronous hidden markov model for audio-visual speech recognition. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in NIPS 15. MIT Press, Cambridge (2003)Google Scholar
  4. 4.
    Bilmes, J.: Graphical models and automatic speech recognition. Mathematical Foundations of Speech and Language Processing (2003)Google Scholar
  5. 5.
    Dielmann, A., Renals, S.: Multistream dynamic Bayesian network for meeting segmentation. In: Bengio, S., Bourlard, H. (eds.) MLMI 2004. LNCS, vol. 3361, pp. 76–86. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  6. 6.
    Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia 2(3), 141–151 (September 2000)CrossRefGoogle Scholar
  7. 7.
    Lathoud, G., McCowan, I.A., Odobez, J.-M.: Unsupervised Location-Based Segmentation of Multi-Party Speech. In: Proc. 2004 ICASSP-NIST Meeting Recognition Workshop (2004)Google Scholar
  8. 8.
    McCowan, I., Gatica-Perez, D., Bengio, S., Lathoud, G., Barnard, M., Zhang, D.: Automatic analysis of multimodal group actions in meetings. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 27(3), 305–317 (2005)CrossRefGoogle Scholar
  9. 9.
    Oliver, N., Horvitz, E., Garg, A.: Layered representations for learning and inferring office activity from multiple sensory channels. In: Proc. ICMI, Pittsburgh (October 2002)Google Scholar
  10. 10.
    Pavlovic, V., Frey, B., Huang, T.S.: Time series classification using mixed-state dynamic Bayesian networks. In: Proc. IEEE CVPR (1999)Google Scholar
  11. 11.
    Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. of the IEEE 2(77), 257–286 (1989)CrossRefGoogle Scholar
  12. 12.
    Reiter, S., Rigoll, G.: Segmentation and classification of meeting events using multiple classifier fusion and dynamic programming. In: Proc. IEEE ICPR, pp. 434–437 (2004)Google Scholar
  13. 13.
    Reiter, S., Rigoll, G.: Multimodal meeting analysis by segmentation and classification of meeting events based on a higher level semantic approach. In: Proc. IEEE ICASSP (2005)Google Scholar
  14. 14.
    Tritschler, A., Gopinath, R.A.: Improved speaker segmentation and segments clustering using the bayesian information criterion. In: Proc. EUROSPEECH 1999 (1999)Google Scholar
  15. 15.
    Zhang, D., Gatica-Perez, D., Bengio, S., McCowan, I., Lathoud, G.: Modeling individual and group actions in meetings: a two-layer hmm framework. In: IEEE Workshop on Event Mining at the Conference on Computer Vision and Pattern Recognition (CVPR) (2004)Google Scholar
  16. 16.
    Zobl, M., Wallhoff, F., Rigoll, G.: Action recognition in meeting scenarios using global motion features. In: Ferryman, J. (ed.) Proc. PETS-ICVS, pp. 32–36 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  1. 1.Institute for Human-Machine-CommunicationTechnische Universität MünchenMunichGermany
  2. 2.Centre for Speech Technology ResearchUniversity of EdinburghEdinburghUK
  3. 3.IDIAP Research Institute and Ecole Polytechnique Federale de Lausanne (EPFL)MartignySwitzerland

Personalised recommendations