Probabalistic Models and Informative Subspaces for Audiovisual Correspondence
We propose a probabalistic model of single source multi-modal generation and show how algorithms for maximizing mutual information can find the correspondences between components of each signal. We show how non-parametric techniques for finding informative subspaces can capture the complex statistical relationship between signals in different modalities. We extend a previous technique for finding informative subspaces to include new priors on the projection weights, yielding more robust results. Applied to human speakers, our model can find the relationship between audio speech and video of facial motion, and partially segment out background events in both channels. We present new results on the problem of audio-visual verification, and show how the audio and video of a speaker can be matched even when no prior model of the speaker’s voice or appearance is available.
KeywordsMutual Information Video Sequence Video Frame Audio Signal Information Theoretic Approach
Unable to display preview. Download preview PDF.
- 1.Suzanna Becker. An Information-theoretic Unsupervised Learning Algorithm for Neural Networks. PhD thesis, University of Toronto, 1992.Google Scholar
- 4.John W. Fisher III, Trevor Darrell, William T. Freeman, and Paul. Viola. Learning joint statistical models for audio-visual fusion and segregation. In Advances in Neural Information Processing Systems 13, 2000.Google Scholar
- 5.John W. Fisher III and Jose Principe. Entropy manipulation of arbitrary nonlinear mappings. In J.C. Principe, editor, Proc. IEEE Workshop, Neural Networks for Signal Processing VII, pages 14–23, 1997.Google Scholar
- 6.John W. Fisher III and Jose Principe. A methodology for information theoretic feature extraction. In A. Stuberud, editor, Proceedings of the IEEE International Joint Conference on Neural Networks, 1998.Google Scholar
- 7.John Hershey and Javier Movellan. Using audio-visual synchrony to locate sounds. In S. A. Solla, T. K. Leen, and K-R. Mller, editors, Advances in Neural Information Processing Systems 12, pages 813–819, 1999.Google Scholar
- 9.Uwe Meier, Rainer Stiefelhagen, Jie Yang, and Alex Waibel. Towards unrestricted lipreading. In Second International Conference on Multimodal Interfaces (ICMI99), 1999.Google Scholar
- 11.M. Plumbley. On information theory and unsupervised neural networks. Technical Report CUED/F-INFENG/TR. 78, Cambridge University Engineering Department, UK, 1991.Google Scholar
- 12.M. Plumbley and S Fallside. An information-theoretic approach to unsupervised connectionist models. In D. Touretzky, G. Hinton, and T. Sejnowski, editors, Proceedings of the 1988 Connectionists Models Summer School, pages 239–245. Morgan Kaufman, San Mateo, CA, 1988.Google Scholar
- 13.Malcolm Slaney and Michele Covell. Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In T. K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, 2000.Google Scholar
- 14.G. Wolff, K. V. Prasad, D. G. Stork, and M. Hennecke. Lipreading by neural networks: Visual preprocessing, learning and sensory integration. In Proc. of Neural Information Proc. Sys. NIPS-6, pages 1027–1034, 1994.Google Scholar