Abstract
This paper presents general purpose video analysis and annotation tools, which combine high-level and low-level information, and which learn through user interaction and feedback. The use of these tools is illustrated through the construction of two video browsers, which allow a user to fast forward (or rewind) to frames, shots, or scenes containing a particular character, characters, or other labeled content. The two browsers developed in this work are: (1) a basic video browser, which exploits relations between high-level scripting information and closed captions, and (2) an advanced video browser, which augments the basic browser with annotations gained from applying machine learning. The learner helps the system adapt to different peoples' labelings by accepting positive and negative examples of labeled content from a user, and relating these to low-level color and texture features extracted from the digitized video. This learning happens interactively, and is used to infer labels on data the user has not yet seen. The labeled data may then be browsed or retrieved from the database in real time.An evaluation of the learning performance shows that a combination of low-level color signal features outperforms several other combinations of signal features in learning character labels in an episode of the TV situation comedy, Seinfeld. We discuss several issues that arise in the combination of low-level and high-level information, and illustrate solutions to these issues within the context of browsing television sitcoms.
Similar content being viewed by others
References
B. Astle, “Video database indexing and method of presenting video database index to a user,” US Patent Office 5,485,611, January 1996. Assigned to Intel Corporation of Santa Clara CA, Filed Dec. 30, 1994.
V.M. Bove Jr., “Personalcasting: Interactive local augmentation of television programming,” Master's thesis, MIT, 1983.
S. Chang, J. Smith, and H. Wang, “Automatic feature extraction and indexing for content based visual query,” Technical Report 408-95-14, Columbia University, New York, NY, 1991.
S.S. Intille and A.F. Bobick, “Visual tracking using closed-worlds,” Technical Report 294, MIT Media Laboratory Perceptual Computing, 20 Ames Street, Cambridge, MA 02139, 1994.
A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice Hall: Englewood Cliffs, NJ, 1988.
K. Karahalios, “Salient movies,” Master's thesis, MIT, Cambridge, MA 02139, 1995.
B. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” Image Understanding Workshop, April 1981, pp. 121–130.
J. Meng and S.F. Chang, “Tools for compressed-domain video indexing and editing,” IS&T/SPIE Symposium on Electronic Imaging: Science and Technology—Storage & Retrieval for Image and Video Database IV, 2670, Feb. 1996.
T.P. Minka, “An image database browser that learns from user interaction,” Master's thesis, MIT, Cambridge, MA 02139, Feb. 1996. Also appears as MIT Media Lab Perceptual Computing Section Technical Report #365.
T.P. Minka and R.W. Picard, “Interactive learning using a 'society of models',” Special Issue of Pattern Recognition on Image Databases Classification and Retrieval, 1995. Also appears as MIT Media Lab Perceptual Computing Section Technical Report #349.
Y.I. Ohta, T. Kanade, and T. Sakai, “Color information for region segmentation,” Computer Graphics and Image Processing, Vol. 13, pp. 222–241, 1980.
R.W. Picard and T.P. Minka, “Vision texture for annotation,” ACM/Springer-Verlag Journal of Multimedia Systems, Vol. 3, pp. 3–15, 1995. Also appears as MIT Media Laboratory Perceptual Computing Section Technical Report #302.
B. Salt, Film Style and Technology and Analysis, Starwood, London, 1983.
R.K. Srihari, R. Chopra, D. Burhans, M. Venkatraman, and V. Govindaraju, “Use of collateral text in image interpretation,” in CEDAR Proceedings of The Image UnderstandingWorkshop, Vol. II, Monterey, CA, Nov. 13–16, 1994, pp. 897–905. ARPA Software and Intelligent Systems Technology Office.
R.K. Srihari, “Linguistic context in vision,” in Proceedings, MIT, Nov. 10–12, 1995, pp. 78–88. MIT AAA-I Fall Symposium Series Computational Models for Integrating Language and Vision.
M. Swain and D. Ballard, “Color indexing,” International Journal of ComputerVision,Vol. 7, No. 1, pp. 11–32, 1991.
A. Tversky, “Features of similarity,” Psychological Review, Vol. 84, No. 4, pp. 327–352, July 1977.
J.Y.A. Wang and E.A. Adelson, “Layered representation for motion analysis,” in Proceedings of the Computer Vision and Pattern Recognition Conference, June 1993. Also appears asMITMedia Lab Perceptual Computing Section Technical Report #221.
M.M. Yeung and B. Liu, “Efficient matching and clustering of video shots,” in Proceedings IEEE International Conference on Image Processing Vol. 1.,Washington, D.C., Oct. 23–26, 1995. Princeton University, pp. 338-341.
H. Zhang, A. Kankanhalli, and S. Smoliar, “Automatic partitioning of full-motion video,” Multimedia Systems, Vol. 1, pp. 10–28, 1993.
H. Zhang, C.Y. Low, and S. Smoliar, “Video parsing and browsing using compressed data,” Journal of Multimedia Tools and Applications, Vol. 1, No. 1, pp. 89–111, March 1995.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Wachman, J.S., Picard, R.W. Tools for Browsing a TV Situation Comedy Based on Content Specific Attributes. Multimedia Tools and Applications 13, 255–284 (2001). https://doi.org/10.1023/A:1009681230513
Issue Date:
DOI: https://doi.org/10.1023/A:1009681230513