Abstract
The similarity metric between videos is integral to several key tasks, including video retrieval, classification and recommendation. Since there is no standard criterion for the similarity measurement between videos except measuring manually, it is difficult to collect large training dataset for distance metric learning algorithms. Moreover, the existing distance metric learning (DML) methods for multimedia data suffer from two critical limitations: (1) they typically attempt to learn a distance function on the single label setting, in which each item is only labeled with single label; (2) they are often designed for learning distance metrics on low-level features, which ignore the semantic similarity of the multimedia data. To address these problems, in this paper, we propose a novel framework of Intermediate Semantics based Distance Learning (ISDL) for video clips, which aims to integrate semantics of multiple modals optimally for distance metric learning. In particular, the proposed framework: (1) generates the training pairs automatically; (2) defines multi-modal concepts for similarity measure among videos; (3) learns the distance metric for video clips based on the intermediate semantics. We conduct an extensive set of experiments to evaluate the performance of the proposed algorithms, and the results validate the effectiveness of our proposed approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D.: Learning distance functions using equivalence relations. In: Proceedings in Conference on Machine Learning, pp. 11–18 (2003)
Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information theoretical metric learning. In: Proceedings in Conference on Machine Learning, pp. 209–216 (2007)
Globerson, A., Roweis, S.T.: Metric learning by collapsing classes. In: Neural Information Processing Systems, pp. 451–458 (2005)
Giannakopoulos, T., Pikrakis, A., Theodoridis, S.: A multi-class audio classification method with respect to violent content in movies, using Bayesian networks. In: IEEE International Workshop on Multimedia Signal Processing, pp. 90–93 (2007)
Goldberger, J., Roweis, S., Hinton, G., Salakhutdinov, R.: Neighbourhood components analysis. In: Neural Information Processing Systems, pp. 513–520 (2004)
Hauptmann, A.G., Yan, R., Lin, W.H., Christel, M., Wactlar, H.: Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. IEEE Trans. Multimed. 9(5), 958–966 (2007)
Hoi, S.C.H., Liu, W., Lyu, M.R., Ma, W.Y.: Learning distance metrics with contextual constraints for image retrieval. In: Proceedings of Computer Vision and Pattern Recognition, pp. 2072–2078 (2006)
Jiang, Y.G., Ngo, C.W., Yang, J.: Toward optimal bag-of-features for object categorization and semantic video retrieval. In: ACM International Conference on Image Video Retrieval, pp. 494–501 (2007)
Kulis, B.: Metric learning: a survey. Found. Trends Mach. Learn. 5(4), 287–364 (2012)
Laptev, I.: On space-time interest points. IJCV 6(2/3), 107–123 (2005)
Lin, C.Y., Tseng, B.L., Smith, J.R.: Video collaborative annotation forum: establishing ground-truth labels on large multimedia datasets. In: Proceedings of the TRECVID Workshop (2003)
Lowe, D.: Distinctive image features from scale invariant keypoints. IJCV 60(2), 91–110 (2004)
Ma, Z., Hauptann, A.G., Yang, Y., Sebe, N.: Classifier-specific intermediate representation for multimedia tasks. In: ICMR, p. 50. ACM press, Hong Kong (2012)
Marszalek, M., Laptev, I.: Actions in context. In: CVPR, pp. 2929–2936. IEEE press (2009)
McFee, B., Lanckriet, G.R.G.: Learning multi-modal similarity. J. Mach. Learn. Res. 12, 491–523 (2011)
Mei, T., Yang, B., Hua, X.S., Li, S.: Contextual video recommendation by multimodal relevance and user feedback. ACM Trans. Inf. Syst. 29(2), 10 (2011)
Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors. IJCV 60(1), 63–86 (2004)
Naphade, M.R., Smith, J.R.: Large-scale concept ontology for multimedia. IEEE MultiMed. 13(3), 86–91 (2006)
Qi, G.J., Hua, X.S., Rui, Y., Tang, J., Mei, T., Zhang, H.J.: Correlative multi-label video annotation. In: ACM MultiMedia, pp. 17–26 (2007)
Schölkopf, B., Herbrich, R., Smola, A.J.: A generalized representer theorem. In: Helmbold, D.P., Williamson, B. (eds.) COLT 2001 and EuroCOLT 2001. LNCS (LNAI), vol. 2111, pp. 416–426. Springer, Heidelberg (2001)
Schultz, M., Joachims, T.: Learning a distance metric from relative comparisons. In: NIPS, pp. 41–48 (2003)
Snoek, C., Worring, M., Geusebroek, J.M., Smeulders, A.W.M.: The challenge problem for automated detection of 101 semantic concepts in multimedia. In: ACM MultiMedia, pp. 421–430 (2007)
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: HLT-NAACL, pp. 173–180 (2003)
TREC video retrieval evaluation. http://www-nlpir.nist.gov/projects/trecvid
Wang, M., Hua, X.: Study on the combination of video concept detectors. In: ACM MultiMedia, pp. 647–650 (2008)
Wang, Y., Lin, X., Zhang, Q.: Towards metric fusion on multi-view data: a cross-view based graph random walk approach. In: CIKM, pp. 805–810. ACM press, San Francisco (2013)
Weinberger, K., Blitzer, J., Saul, L.: Distance metric learning for large margin nearest neighbor classification. In: NIPS, pp. 1473–1480 (2006)
Wu, P., Hoi, S.C.H., Xia, H., Zhao, P., Wang, D., Miao, C.: Online multimodal deep similarity learning with application to image retrieval. In: ACM MultiMedia, pp. 153–162 (2008)
Xia, H., Wu, P., Hoi, S.C.H.: Online multi-modal distance learning for scalable multimedia retrieval. In: WSDM, pp. 455–464. ACM press, Rome (2013)
Yang, L., Jin, R.: Distance Metric Learning: A Comprehensive Survey. Michigan State University (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Qu, W., Zhou, X., Wang, D., Feng, S., Zhang, Y., Yu, G. (2016). Intermediate Semantics Based Distance Metric Learning for Video Annotation and Similarity Measurements. In: Cellary, W., Mokbel, M., Wang, J., Wang, H., Zhou, R., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2016. WISE 2016. Lecture Notes in Computer Science(), vol 10041. Springer, Cham. https://doi.org/10.1007/978-3-319-48740-3_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-48740-3_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48739-7
Online ISBN: 978-3-319-48740-3
eBook Packages: Computer ScienceComputer Science (R0)