Abstract
The interaction of image and speech processing is a crucial property of multimedia systems. Classical systems using inferences on pure qualitative high level descriptions miss a lot of information when concerned with erroneous, vague, or incomplete data. We propose a new architecture that integrates various levels of processing by using multiple representations of the visually observed scene. They are vertically connected by Bayesian networks in order to find the most plausible interpretation of the scene.
The interpretation of a spoken utterance naming an object in the visually observed scene is modeled as another partial representation of the scene. Using this concept, the key problem is the identification of the verbally specified object instances in the visually observed scene. Therefore, a Bayesian network is generated dynamically from the spoken utterance and the visual scene representation. In this network spatial knowledge as well as knowledge extracted from psycholinguistic experiments is coded. First results show the robustness of our approach.
The work of G. Socher has been supported by the German Research Foundation (DFG).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
G. Adorni, M. D. Manzo, and F. Giunchiglia. Natural language driven image generation. In COLING, pages 495–500, 1984.
L. E. Bernstein. For speech perception by humans or machines, three senses are better than one. In International Conference on Spoken Language Processing, pages 1477–1480, 1996.
S. Dickenson and D. Metaxas. Integrating qualitative and quantitative shape recovery. International Journal of Computer Vision, 13(3):1–20, 1994.
T. Fuhr, G. Socher, C. Scheering, and G. Sagerer. A three-dimensional spatial model for the interpretation of image data. In P. Olivier and K.-P. Gapp, editors, Representation and Processing of Spatial Expressions, pages 103–118. Lawrence Erlbaum Associates, 1997.
G. Heidemann and H. Ritter. Objekterkennung mit Neuronalen Netzen. Technical Report 2, Situierte Künstliche Kommunikatoren, SFB 360, Universität Bielefeld, 1996.
H. Kollnig and H.-H. Nagel. Ermittlung von begrifflichen Beschreibungen von Geschehen in Straßenverkehrsszenen mit Hilfe unscharfer Mengen. In Informatik Froschung und Entwicklung, 8, pages 186–196, 1993.
S. M. Kosslyn. Mental imagery. In D. A. O. et al, editor, Visual Cognition and Action, pages 73–7, Cambridge, Mass, 1990. MIT Press.
F. Kummert, G. A. Fink, and G. Sagerer. Schritthaltende hybride Objektdetektion. In Mustererkennung 97, 19, pages 137–44, Berlin, 1997. DAGM-Symposium Braunschweig, Springer-Verlag.
F. Lavagetto, S. Lepsoy, C. Braccini, and S. Curinga. Lip motion modeling and speech driven estimation. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing, pages 183–86, 1997.
K. Lee. Automatic Speech Recognition: The Development of the SPHINX System. Kluwer Academic Publishers, 1989.
A. Maßmann, S. Posch, and D. Schlüter. Using markov random fields for contour-based grouping. In Proceedings of International Conference on Image Processing, volume 2, pages 207–42, 1997.
T. Maybury, editor. Intelligent Multimedia Interfaces. AAAI Press/The MIT Press, 1993.
D. McDonald and E. J. Conklin. Salience as a simplifying metaphor for natural language generation. In Proceedings of AAAI-81, pages 49–51, 1981.
K. Nagao. Abduction and dynamic preference in plan-based dialogue understanding. In International Joint Conference on Artificial Intelligence, pages 1186–192. Morgan Kaufmann Publishers, Inc., 1993.
K. Nagao and J. Rekimoto. Ubiquitous talker: Spoken language interaction with real world objects. In International Joint Conference on Artificial Intelligence, pages 1284–290, 1995.
P. Olivier, T. Maeda, and J. ichi Tsujii. Automatic depiction of spatial descriptions. In Proceedings of AAAI-94, pages 1405–1410, Seattle, WA, 1994.
W. Richards, A. Jepson, and J. Feldman. Priors, preferences and categorial percepts. In W. Richards and D. Knill, editors, Perception as Bayesian Inference, pages 93–122. Cambridge University Press, 1996.
G. Socher. Qualitative Scene Descriptions from Images for Integrated Speech and Image Understanding. Dissertationen zur Künstlichen Intelligenz (DISKI 170). infix-Verlag, Sankt Augustin, 1997.
G. Socher, T. Merz, and S. Posch. 3-D Reconstruction and Camera Calibration from Images with Known Objects. In D. Pycock, editor, Proc. 6th British Machine Vision Conference, pages 167–176, 1995.
G. Socher, G. Sagerer, and P. Perona. Baysian Reasoning on Qualitative Descriptions from Images and Speech. In H. Buxton and A. Mukerjee, editors, ICCV’98 Workshop on Conceptual Description of Images, Bombay, India, 1998.
R. K. Srihari. Computational models for integrating linguistic and visual information: A survey. In Artificial Intelligence Review, 8, pages 349–369, Netherlands, 1994. Kluwer Academic Publishers.
R. K. Srihari and D. T. Burhans. Visual semantics: Extracting visual information from text accompanying pictures. In Proceedings of AAAI-94, pages 793–798, Seattle, WA, 1994.
J. K. Tsotsos and etal. The PLAYBOT Project. In J. Aronis, editor, IJCAI’Workshop on AI Applications for Disabled People, Montreal, 1995.
J. K. Tsotsos, G. Verghese, S. Dickenson, M. Jenkin, A. Jepson, E. Milios, F. Nuflo, S. Stevenson, M. Black, D. Metaxas, S. Culhane, Y. Yet, and R. Mann. Playbot: A visuallyguided robot for physically disabled children. Image and Vision Computing, 16(4):275–292, 1998.
G. Verghese and J. K. Tsotsos. Real-time model-based tracking using perspective alignment. In Proceedings of Vision Interface’ pages 202–209, 1994.
C. Vorwerg, G. Socher, T. Fuhr, G. Sagerer, and G. Rickheit. Projective relations for 3D space: computational model, application, and psychological evaluation. In Proceedings of the 14th National Joint Conference on Artificial Intelligence AAAI-97, Rhode Island, 1997.
S. Wachsmuth, G. A. Fink, and G. Sagerer. Integration of parsing and incremental speech recognition. In Proceedings EUSIPCO-98, 1998.
W. Wahlster. One word says more than a thousand pictures. on the automatic verbalization of the results of image sequence analysis systems. In Computers and Artificial Intelligence, 8, pages 479–492, 1989.
D. L. Waltz. Generating and understanding scene descriptions. In B. Webber and I. Sag, editors, Elements of Discourse Understanding, pages 266–282, New York, NY, 1981. Cambridge University Press.
M. Zancanaro, O. Stock, and C. Strapparava. Dialogue cohension sharing and adjusting in an enhanced multimodal environment. In International Joint Conference on Artificial Intelligence, pages 1230–1236. Morgan Kaufmann Publishers, Inc., 1993.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wachsmuth, S., Brandt-Pook, H., Socher, G., Kummert, F., Sagerer, G. (1999). Multilevel Integration of Vision and Speech Understanding Using Bayesian Networks. In: Computer Vision Systems. ICVS 1999. Lecture Notes in Computer Science, vol 1542. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49256-9_15
Download citation
DOI: https://doi.org/10.1007/3-540-49256-9_15
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65459-9
Online ISBN: 978-3-540-49256-6
eBook Packages: Springer Book Archive