Linking People in Videos with “Their” Names Using Coreference Resolution

Ramanathan, Vignesh; Joulin, Armand; Liang, Percy; Fei-Fei, Li

doi:10.1007/978-3-319-10590-1_7

Linking People in Videos with “Their” Names Using Coreference Resolution

Vignesh Ramanathan¹⁹,
Armand Joulin²⁰,
Percy Liang²⁰ &
…
Li Fei-Fei²⁰

Conference paper

37k Accesses
43 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 8689))

Abstract

Natural language descriptions of videos provide a potentially rich and vast source of supervision. However, the highly-varied nature of language presents a major barrier to its effective use. What is needed are models that can reason over uncertainty over both videos and text. In this paper, we tackle the core task of person naming: assigning names of people in the cast to human tracks in TV videos. Screenplay scripts accompanying the video provide some crude supervision about who’s in the video. However, even the basic problem of knowing who is mentioned in the script is often difficult, since language often refers to people using pronouns (e.g., “he”) and nominals (e.g., “man”) rather than actual names (e.g., “Susan”). Resolving the identity of these mentions is the task of coreference resolution, which is an active area of research in natural language processing. We develop a joint model for person naming and coreference resolution, and in the process, infer a latent alignment between tracks and mentions. We evaluate our model on both vision and NLP tasks on a new dataset of 19 TV episodes. On both tasks, we significantly outperform the independent baselines.

Download to read the full chapter text

Chapter PDF

References

Cisco visual networking index: Global mobile data traffic forecast update. Tech. rep., Cisco (February 2014)
Google Scholar
Bach, F., Harchaoui, Z.: Diffrac: A discriminative and flexible framework for clustering. In: NIPS (2007)
Google Scholar
Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D.M., Jordan, M.I.: Matching words and pictures. Journal of Machine Learning Research 3, 1107–1135 (2003)
MATH Google Scholar
Bäuml, M., Tapaswi, M., Stiefelhagen, R.: Semi-supervised Learning with Constraints for Person Identification in Multimedia Data. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (June 2013)
Google Scholar
Berg, T.L., Berg, A.C., Edwards, J., Maire, M., White, R., Teh, Y.W., Learned-Miller, E.G., Forsyth, D.A.: Names and faces in the news. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 848–854 (2004)
Google Scholar
Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Finding actors and actions in movies. In: ICCV (2013)
Google Scholar
Cour, T., Sapp, B., Jordan, C., Taskar, B.: Learning from ambiguously labeled images. In: CVPR (2009)
Google Scholar
Das, P., Xu, C., Doell, R.F., Corso, J.J.: A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In: CVPR (2013)
Google Scholar
Duchenne, O., Laptev, I., Sivic, J., Bach, F., Ponce, J.: Automatic annotation of human actions in video. In: ICCV (2009)
Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. John Wiley & Sons (2012)
Google Scholar
Everingham, M., Sivic, J., Zisserman, A.: Hello! my name is... buffy automatic naming of characters in tv video. In: BMVC (2006)
Google Scholar
Farhadi, A., Hejrati, M., Sadeghi, M., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: Generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010)
Chapter Google Scholar
Fidler, S., Sharma, A., Urtasun, R.: A sentence is worth a thousand pixels. In: CVPR. IEEE (2013)
Google Scholar
Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C.: Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In: CVPR (2009)
Google Scholar
Gupta, A., Davis, L.S.: Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 16–29. Springer, Heidelberg (2008)
Chapter Google Scholar
Gupta, A., Srinivasan, P., Shi, J., Davis, L.S.: Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 2012–2019. IEEE (2009)
Google Scholar
Haghighi, A., Klein, D.: Coreference resolution in a modular, entity-centered model. In: HLT-NAACL (2010)
Google Scholar
Han, X., Sun, L., Zhao, J.: Collective entity linking in web text: a graph-based method. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 765–774 (2011)
Google Scholar
Hobbs, J.R., Mulkar-Mehta, R.: Using abduction for video-text coreference. In: Proceedings of BOEMIE 2008 Workshop on Ontology Evolution and Multimedia Information Extraction (2008)
Google Scholar
Hodosh, M., Young, P., Rashtchian, C., Hockenmaier, J.: Cross-caption coreference resolution for automatic image understanding. In: Conference on Computational Natural Language Learning (2010)
Google Scholar
Jie, L., Caputo, B., Ferrari, V.: Who’s doing what: Joint modeling of names and verbs for simultaneous face and pose annotation. In: NIPS (2009)
Google Scholar
Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(7), 1409–1422 (2012)
Article Google Scholar
Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? text-to-image coreference. In: CVPR (2014)
Google Scholar
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: Understanding and generating simple image descriptions. In: CVPR (2011)
Google Scholar
Kummerfeld, J.K., Klein, D.: Error-driven analysis of challenges in coreference resolution. In: Proceedings of EMNLP (2013)
Google Scholar
Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)
Google Scholar
Lee, H., Peirsman, Y., Chang, A., Chambers, N., Surdeanu, M., Jurafsky, D.: Stanford’s mulit-pass sieve coreference resolution system at the conll-2011 shared task. In: CoNLL 2011 Shared Task (2011)
Google Scholar
Marszałek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR (2009)
Google Scholar
Matuszek, C., FitzGerald, N., Zettlemoyer, L., Bo, L., Fox, D.: A joint model of language and perception for grounded attribute learning. In: ICML (2012)
Google Scholar
Motwani, T.S., Mooney, R.J.: Improving video activity recognition using object recognition and text mining. In: ECAI (2012)
Google Scholar
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: Describing images using 1 million captioned photographs. In: NIPS (2011)
Google Scholar
Pham, P., Moens, M.F., Tuytelaars, T.: Linking names and faces: Seeing the problem in different ways. In: Proceedings of the 10th European Conference on Computer Vision: Workshop Faces in ‘Real-life’ Images: Detection, Alignment, and Recognition, pp. 68–81 (2008)
Google Scholar
Ramanathan, V., Liang, P., Fei-Fei, L.: Video event understanding using natural language descriptions. In: ICCV (2013)
Google Scholar
Rohrbach, M., Stark, M., Szarvas, G., Schiele, B.: What helps where – and why? semantic relatedness for knowledge transfer. In: CVPR (2010)
Google Scholar
Rohrbach, M., Wei, Q., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: ICCV (2013)
Google Scholar
Silberer, C., Ferrari, V., Lapata, M.: Models of semantic representation with visual attributes. In: ACL (2013)
Google Scholar
Sivic, J., Everingham, M., Zisserman, A.: “Who are you?” - learning person specific classifiers from video. In: CVPR (2009)
Google Scholar
Tapaswi, M., Bäuml, M., Stiefelhagen, R.: “Knock! Knock! Who is it?” Probabilistic Person Identification in TV Series. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (June 2012)
Google Scholar
Tellex, S., Kollar, T., Dickerson, S., Walter, M.R., Banerjee, A.G., Teller, S., Roy, N.: Understanding natural language commands for robotic navigation and mobile manipulation. AAAI (2011)
Google Scholar
Vijayanarasimhan, S., Grauman, K.: Keywords to visual categories: Multiple-instance learning for weakly supervised object categorization. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8. IEEE (2008)
Google Scholar
Wang, Y., Mori, G.: A discriminative latent model of image region and object tag correspondence. In: NIPS (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, Stanford University, USA
Vignesh Ramanathan
Computer Science Department, Stanford University, USA
Armand Joulin, Percy Liang & Li Fei-Fei

Authors

Vignesh Ramanathan
View author publications
You can also search for this author in PubMed Google Scholar
Armand Joulin
View author publications
You can also search for this author in PubMed Google Scholar
Percy Liang
View author publications
You can also search for this author in PubMed Google Scholar
Li Fei-Fei
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Toronto, 6 King’s College Road, M5H 3S5, Toronto, ON, Canada
David Fleet
Faculty of Electrical Engineering, Department of Cybernetics, Czech Technical University in Prague, Technicka 2, 166 27, Prague 6, Czech Republic
Tomas Pajdla
Max-Planck-Institut für Informatik, Campus E1 4, 66123, Saarbrücken, Germany
Bernt Schiele
PSI, iMinds, KU Leuven, ESAT, Kasteelpark Arenberg 10, Bus 2441, 3001, Leuven, Belgium
Tinne Tuytelaars

1 Electronic Supplementary Material

Electronic Supplementary Material (PDF 241 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ramanathan, V., Joulin, A., Liang, P., Fei-Fei, L. (2014). Linking People in Videos with “Their” Names Using Coreference Resolution. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8689. Springer, Cham. https://doi.org/10.1007/978-3-319-10590-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-10590-1_7
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10589-5
Online ISBN: 978-3-319-10590-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics