Skip to main content

Combining linguistic and pictorial information: Using captions to interpret newspaper photographs

  • Conference paper
  • First Online:
Current Trends in SNePS — Semantic Network Processing System (SNePS 1989)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 437))

Included in the following conference series:

Abstract

There are many situations where linguistic and pictorial data are jointly presented to communicate information. A computer model for synthesising information from the two sources requires an initial interpretation of both the text and the picture followed by consolidation of information. The problem of performing general-purpose vision (without apriori knowledge) would make this a nearly impossible task. However, in some situations, the text describes salient aspects of the picture. In such situations, it is possible to extract visual information from the text, resulting in a relational graph describing the structure of the accompanying picture. This graph can then be used by a computer vision system to guide the interpretation of the picture. This paper discusses an application whereby information obtained from parsing a caption of a newspaper photograph is used to identify human faces in the photograph. Heuristics are described for extracting information from the caption which contributes to the hypothesised structure of the picture. The top-down processing of the image using this information is discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Giovanni Adorni, Mauro Di Manzo, and Fausto Giunchiglia. Natural language Driven Image Generation. In Proceedings of COLING, pages 495–500, 1984.

    Google Scholar 

  2. Edmund C. Arnold. Modern Newspaper Design. Harper and Row, New York, NY, 1969.

    Google Scholar 

  3. N. Abe, I. Soga, and S. Tsuji. A Plot Understanding System on Reference to Both Image and Language. In Proceedings of IJCAI, pages 77–84, 1981.

    Google Scholar 

  4. Venu Govindaraju, David B. Sher, Rohini K. Srihari, and Sargur N. Srihari. Locating human faces in newspaper photographs. In Proceedings of CVPR, pages 549–554, 1989.

    Google Scholar 

  5. Annette Herskovits. Language and Spatial Cognition. Cambridge University Press, 1986.

    Google Scholar 

  6. Robert M. Haralick and Linda G. Shapiro. The Consistent Labeling Problem: Part 1. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2):173–184, 1979.

    Google Scholar 

  7. Ray Jackendoff. On Beyond Zebra: The Relation of Linguistic and Visual Information. Cognition, 26(2):89–114, 1987.

    Google Scholar 

  8. Anthony B. Maddox and James Pustejovsky. Linguistic Descriptions of Visual Event Perceptions. In Proceedings of the Cognitive Science Society Conference, pages 442–454, Seattle, 1987.

    Google Scholar 

  9. B. Neumann and H. Novak. Event Models for Recognition and Natural Language Description of Events in Real-World Image Sequences. In Proceedings of IJCAI 1983, pages 724–726, 1983.

    Google Scholar 

  10. Stuart C. Shapiro. Generalized Augmented Transition Network Grammars for Generation from Semantic Networks. The American Journal for Computational Linguistics, 8(2):12–25, 1982.

    Google Scholar 

  11. Stuart C. Shapiro and William J. Rapaport. SNePS Considered as a Fully Intensional Propositional Semantic Network. In Nick Cercone and Gordon McCalla, editors, The Knowledge Frontier, Essays in the Representation of Knowledge, pages 262–315. Springer-Verlag, New York, 1987.

    Google Scholar 

  12. Rohini K. Srihari and William J. Rapaport. Extracting Visual Information From Text: Using Captions to Label Human Faces in Newspaper Photographs. In Proceedings of the 11th Annual Conference of the Cognitive Society, pages 364–371. Lawrence Erlbaum Associates, 1989.

    Google Scholar 

  13. Rohini K. Srihari. Combining Path-based and Node-based Reasoning in SNePS. Technical Report 183, SUNY at Buffalo, 1981.

    Google Scholar 

  14. David L. Waltz and L. Boggess. Visual Analog Representation for Natural Language Understanding. In Proceedings of IJCAI, pages 926–934, 1979.

    Google Scholar 

  15. T.E. Weymouth. Using Object Descriptions in a Schema Network for Machine Vision. PhD thesis, University of Masschusetts at Amherst, 1986.

    Google Scholar 

  16. Masao Yokota, Rin-ichiro Taniguchi, and Eiji Kawaguchi. Language-Picture Question-Answering Through Common Semantic Representation and its Application to the World of Weather Report. In Leonard Bolc, editor, Natural Language Communication with Pictorial Information Systems. Springer-Verlag, 1984.

    Google Scholar 

  17. Uri Zernik and Barbara J. Vivier. How Near Is Too Far? Talking about Visual Images. In Proceedings of the Tenth Annual Conference of the Cognitive Science Society, pages 202–208. Lawrence Erlbaum Associates, 1988.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

D. Kumar

Rights and permissions

Reprints and permissions

Copyright information

© 1990 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Srihari, R.K., Rapaport, W.J. (1990). Combining linguistic and pictorial information: Using captions to interpret newspaper photographs. In: Kumar, D. (eds) Current Trends in SNePS — Semantic Network Processing System. SNePS 1989. Lecture Notes in Computer Science, vol 437. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0022085

Download citation

  • DOI: https://doi.org/10.1007/BFb0022085

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-52626-1

  • Online ISBN: 978-3-540-47081-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics