Abstract
The integration of vision and natural language processing increasingly attracts attention in different areas of AI research. Up to now, however, there have only been a few attempts at connecting vision systems with natural language access systems. Within the SFB 314, special collaborative program on AI and knowledge-based systems, the automatic natural language description of real world image sequences constitutes a major research goal, which has been pursued during the last ten years. The aim of our approach is to obtain an incremental evaluation and simultaneous description of the perceived time-varying scenes. In this contribution we will report on new results of our joint efforts at combining the natural language access system Vitra with a vision system. We have investigated the problem of describing the movements of articulated bodies in image sequences within an integrated natural language and computer vision system. The paper will focus on our model-based approach for the recognition of pedestrians and on the further evaluation and language production in Vitra.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Preview
Unable to display preview. Download preview PDF.
References
E. André, G. Herzog, and T. Rist Von der Bildfolge zur multimedialen Präsentation. In Integration von Bild, Modell und Text '95, pages 129–142, Madgeburg, 1995. ASIM, Techn. Univ. Wien.
Artificial Intelligence Review Journal, 8, Special Volume on the Integration of Natural Language and Vision Processing, 1994.
N. I. Badler, B. L. Webber, J. Kalita, and J. Esakov. Animation from Instructions. In N. I. Badler, B. A. Barsky, and D. Zeltzer, editors, Making Them Move: Mechanics, Control, and Animation of Articulated Figures, pages 51–93. Morgan Kaufmann, San Mateo, CA, 1991.
R. Bajcsy, A. Joshi, E. Krotkov, and A. Zwarico. LandScan: A Natural Language and Computer Vision System for Analyzing Aerial Images. In Proc. of the 9th IJCAI, pages 919–921, Los Angeles, CA, 1985.
X. Briffault and M. Zock. What do we Mean when we Say “to the Left” or “to the Right”? How to Learn about Space by Building and Exploring a Microworld. In P. Jorrand and V. Sgurev, editors, Artificial Intelligence: Methodology, Systems, Applications (AIMSA'94), pages 363–371. World Scientific, Singapore, 1994.
C. Cédras and M. Shah. Motion-based Recognition: A Survey. Image and Vision Computing, 13(2): 129–155, 1995.
Centre National de la Recherche Scientifique. Images et Langages: Multimodalité et Modélisation Cognitive, Colloque Interdisciplinaire du Comité National de la Recherche Scientifique, Paris, 1993.
D. N. Chin, M. McGranaghan, and T.-T. Chen. Understanding Location Descriptions in the LEI System. In Proc. of the 4th Conf. on Applied Natural Language Processing, pages 138–143, Stuttgart, Germany, 1994.
L. Dreschler and H.-H. Nagel. Volumetric Model and 3D-Trajectory of a Moving Car Derived from Monocular TV-Frame Sequences of a Street Scene. Computer Graphics and Image Processing, 20:199–228, 1982.
M. Fürnsinn, M. Khenkhar, and B. Ruschkowski. GEOSYS — Ein Frage-Antwort-System mit räumlichem Vorstellungsvermögen. In C.-R. Rollinger, editor, Probleme des (Text-) Ver Stehens, Ansätze der künstlichen Intelligenz, pages 172–184. Niemeyer, Tübingen, 1984.
K.-P. Gapp. Basic Meanings of Spatial Relations: Computation and Evaluation in 3D Space. In Proc. of AAAI-94, pages 1393–1398, Seattle, WA, 1994.
G. Herzog. Utilizing Interval-Based Event Representations for Incremental High-Level Scene Analysis. In M. Aurnague, A. Borillo, M. Borillo, and M. Bras, editors, Proc. of the 4th International Workshop on Semantics of Time, Space, and Movement and Spatio-Temporal Reasoning, pages 425–435, Château de Bonas, France, 1992.
G. Herzog, T. Rist, and E. André. Sprache und Raum: Natürlichsprachlicher Zugang zu visuellen Daten. In C. Freksa and C. Habel, editors, Repräsentation und Verarbeitung räumlichen Wissens, pages 207–220. Springer, Berlin, Heidelberg, 1990.
G. Herzog, C.-K. Sung, E. André, W. Enkelmann, H.-H. Nagel, T. Rist, W. Wahlster, and G. Zimmermann. Incremental Natural Language Description of Dynamic Imagery. In C. Freksa and W. Brauer, editors, Wissensbasierte Systeme. 3. Int. GI-KongreΒ, pages 153–162. Springer, Berlin, Heidelberg, 1989.
G. Herzog and P. Wazinski. Visual TRAnslator: Linking Perceptions and Natural Language Descriptions. Artificial Intelligence Review, 8(2/3):175–187, 1994.
B. Hildebrandt, R. Moratz, G. Rickheit, and G. Sagerer. Integration von Bild-und Sprachverstehen in einer kognitiven Architektur. Kognitionswissenschaft, 4(3): 118–128, 1995.
D. Hogg. Model-based Vision: A Program to See a Walking Person. Image and Vision Computing, 1(1):5–20, 1983.
D. Hogg. Interpreting Images of a Known Moving Object. PhD thesis, University of Sussex, Brighton, UK, 1984.
A. Kilger. Using UTAGs for Incremental and Parallel Generation. Computational Intelligence, 10(4):591–603, 1994.
D. Koller. Detektion, Verfolgung und Klassifikation bewegter Objekte in monokularen Bildfolgen am Beispiel von StraΒenverkehrsszenen. Infix, St. Augustin, 1992.
W. Maaß, P. Wazinski, and G. Herzog. VITRA GUIDE: Multimodal Route Descriptions for Computer Assisted Vehicle Navigation. In Proc. of the Sixth Int. Conf. on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems IEA/AIE-93, pages 144–147, Edinburgh, Scotland, 1993.
D. Marr and H. K. Nishihara. Representation and Recognition of the Spatial Organization of three-dimensional Shapes. In Proc. Royal Society B200, pages 269–294, London, 1978.
P. McKevitt, editor. Proc. of AAAI-94 Workshop on Integration of Natural Language and Vision Processing, Seattle, WA, 1994.
M. P. Murray, A. B. Drought, and R. C. Kory. Walking Patterns of Normal Men. Journal of Bone and Joint Surgery, 46-A(2):335–360, 1964.
B. Neumann and H.-J. Novak. NAOS: Ein System zur natürlichsprachlichen Beschreibung zeitveränderlicher Szenen. Informatik Forschung und Entwicklung, 1:83–92, 1986.
P. Olivier, T. Maeda, and J. Tsujii. Automatic Depiction of Spatial Descriptions. In Proc. of AAAI-94, pages 1405–1410, Seattle, WA, 1994.
G. Retz-Schmidt. Die Interpretation des Verhaltens mehrerer Akteure in Szenenfolgen. Springer, Berlin, Heidelberg, 1992.
K. Rohr. Auf dem Wege zu modellgestütztem Erkennen von bewegten nicht-starren Körpern in Realweltbildfolgen. In H. Burkhardt, K. H. Höhne, and B. Neumann, editors, Mustererkennung 1989, 11. DAGM Symposium, pages 324–328. Springer, Berlin, Heidelberg, 1989.
K. Rohr. Incremental Recognition of Pedestrians from Image Sequences. In Proc. of IEEE Conf. on Computer Vision & Pattern Recognition, pages 8–13, New York, NY, 1993.
K. Rohr. Towards Model-based Recognition of Human Movements in Image Sequences. Computer Vision, Graphics, and Image Processing (CVGIP): Image Understanding, 59(1):94–115, 1994.
J. R. J. Schirra, G. Bosch, C.-K. Sung, and G. Zimmermann. From Image Sequences to Natural Language: A First Step Towards Automatic Perception and Description of Motions. Applied Artificial Intelligence, 1:287–305, 1987.
E. Stopp, K.-P. Gapp, G. Herzog, T. Längle, and T. C. Lüth. Utilizing Spatial Relations for Natural Language Access to an Autonomous Mobile Robot. In B. Nebel and L. Dreschler-Fischer, editors, KI-94: Advances in Artificial Intelligence, pages 39–50. Springer, Berlin, Heidelberg, 1994.
I. Wachsmuth and Y. Cao. Interactive Graphics Design with Situated Agents. In W. Strasser and F. Wahl, editors, Graphics and Robotics. Springer, Berlin, Heidelberg, 1994.
W. Wahlster. Text and Images. In R. A. Cole, J. Mariani, H. Uszkoreit, A. Zaenen, and V. Zue, editors, Survey on Speech and Natural Language Technology. Kluwer, Dordrecht, 1994.
W. Wahlster, H. Marburger, A. Jameson, and S. Busemann. Over-answering Yes-No Questions: Extended Responses in a NL Interface to a Vision System. In Proc. of the 8th IJCAI, pages 643–646, Karlsruhe, FRG, 1983.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1995 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Herzog, G., Rohr, K. (1995). Integrating vision and language: Towards automatic description of human movements. In: Wachsmuth, I., Rollinger, CR., Brauer, W. (eds) KI-95: Advances in Artificial Intelligence. KI 1995. Lecture Notes in Computer Science, vol 981. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60343-3_42
Download citation
DOI: https://doi.org/10.1007/3-540-60343-3_42
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-60343-6
Online ISBN: 978-3-540-44944-7
eBook Packages: Springer Book Archive