Semantic concept based video retrieval using convolutional neural network
- 188 Downloads
Retrieval of videos efficiently and effectively has become a challenging issue nowadays and dealing with multi-concept videos is the center of focus. The aim of the work presented here is to propose an improved semantic concept-based video retrieval method using a novel ranked intersection filtering technique and a foreground driven concept co-occurrence matrix. In the proposed ranked intersection filtering technique, an intersection of ranked concept probability scores is taken from key-frames associated with a query shot to identify concepts to be used in retrieval. Convolutional neural network is used as a baseline. The proposed method is implemented using a classifier built with a fusion of asymmetrically trained deep CNNs to deal with data imbalance problem, a novel foreground driven concept co-occurrence matrix to exploit concept co-occurrence information and a ranked intersection filtering approach. Performance is evaluated by a measure, mean average precision on TRECVID multi-label dataset. The results are compared with state-of-the-art other existing methods in its class and shown its superiority.
KeywordsVideo concept detection Semantic concept-based video retrieval Multi-label classification Convolutional neural network Deep learning Concept co-occurrence matrix Foreground driven concept co-occurrence matrix
Due to the technology advancement in video capture, its storage and possible transmission, and extremely affordable costing of these devices has made big contribution in the explosive growth of video collections on the internet, which leads to need for efficient and effective access and retrieval of video data. Concept-based video retrieval is a way to facilitate video access. Providing concept level access to video data requires, indexing techniques which indexes videos on semantic concepts. For better access and retrieval of videos, effective indexing and retrieval techniques are necessary. The effectiveness of a video retrieval algorithm depends on the accuracy with which videos are accessed from dataset. The most important factors of any video retrieval system is (1) the effectiveness of a concept detector, (2) post classification improvement of concept probabilities and (3) effectively dealing multiple key-frames of a shot for effective access from dataset.
The work in this paper is an extension of . Here, we addressed the issue of semantic video retrieval and proposed and implemented a novel approach for retrieval with improved performance. In , we have shown that an efficient concept classifier can be implemented using deep convolutional neural network (CNN) by extracting semantic features from video. In this work, the same approach is used for implementing a classifier. The main contribution of this paper is a video retrieval framework based on concept co-occurrence information and proposed Ranked Intersection Filtering (RIF) approach for efficient video retrieval.
The paper is organized as follows: related and existing work is summarized in Sect. 2. Proposed method with detailed framework and implementation of CNN classifier its architecture, concept of asymmetrically trained deep CNNs, FDCCM and Ranked intersection filtering approach is described in Sect. 3. Experimental results and performance evaluation measure is given in Sect. 4 and lastly, Sect. 5 concludes the paper.
2 Related work
In semantic concept-based video retrieval system, retrieval of videos is a last step after initial steps of shot boundary detection, extraction of key-frames and concept detection.
Video search and retrieval process can be effectively carried out on the indexed database. A good survey on concept-based video retrieval is presented by Snoek and Worring . Feng and Bhanu  discussed the concept and use of concept co-occurrence patterns for image annotation and retrieval. Kuo et al.  presented the work on the use of deep convolutional neural network for image retrieval. Podlesnaya and Podlesnyy  and  exhibits the work on deep learning based video indexing and retrieval. Kikuchi et al.  presented the work on video semantic indexing using object detection-derived features. Awad et al.  discussed a 6-year retrospective on TRECVid semantic indexing of video. The use of co-occurrence information can be used to classify videos is presented in . Multi-label image classification with a probabilistic label enhancement model is discussed by . Donahue et al.  presented the work on use of deep learning for visual recognition. A discussion on the effectiveness of semantic concepts for improved video retrieval is given by . Many robust concept-based video retrieval methods have been presented in [13, 14, 15, 16]. Effective recognition of objects in videos is important for effective object-based video concept detection and retrieval. Visser et al.  and [18, 19] exhibited works on video retrieval based on concepts using detected objects. Concept-based image retrieval approaches could be useful for video retrieval methods [20, 21]. Mezaris et al.  exploited automatically extracted video semantics for improved interactive video retrieval. Shirahama et al. , and [24, 25] implemented robust video retrieval system. The work  presents selection of concepts and concept detectors for video search.
3 Proposed method
The proposed framework retrieves the key-frames and the underlying shots using following modules—(1) Multiclass-classifier for performing video concept detection (2) Score refinement using Foreground Driven Concept Cooccurrence Matrix (FDCCM) to improve concept detection rate and (3) Use of proposed novel RIF approach to shortlist common concepts from input key-frames for efficient video shots retrieval.
A fusion of classifiers and its implementation, idea of asymmetric training and its performance gain has been discussed in detail in next subsections. The concepts detected for a key-frame/s will serve as an index for corresponding video shot and will be helpful and useful in retrieval process. As this index is based on semantic concepts, it is called semantic index. The entire dataset will be maintained based on semantic index.
In testing module, key-frames of a query shot are given as input and the aim is to search for top-k most semantically relevant shots from dataset matching to the input. If the input is through query by example then key-frames are extracted from a query shot. After applying concept detection step for each individual key-frame, common concepts among these key-frames are identified using RIF method. Using a set of common concepts, a database indexed on semantic concepts is searched for finding most relevant key-frames and associated shots. The performance is evaluated by a parameter Precision.
3.1 Building CNN classifier
3.2 Foreground driven concept co-occurrence matrix
The FDCCM concept, an approach proposed by us in our previous work , is discussed in this section. In a process of detecting semantic concepts, concept’s visual co-existence helps in providing semantic information. In video shot, concepts co-exist with each other. Current methods use concept pairs for deriving semantic concepts from shot. If Road-Bus is a semantic pair, then the presence of concept Road increases the confidence of Bus by its co-existenance, as Bus runs on a Road. A concept co-occurrence matrix (CCM) needs to be implemented to maintain concept co-occurrence data. The CCM we compute is a combination of two CCMs computed for local and global levels respectively. The local level CCM is derived by considering prelabelled dataset and their concept visual co-existence. The global level CCM is derived from images retrieved using Google Images. Final CCM is computed by averaging CCMs at local and global levels as given by Eq. 1.
In a concepts list, we notice two concepts types: foreground or actor concepts (such as moving Car, Ball) and background or passive concepts (such as Road and Crowd in football ground). Once the CCM is prepared, we derive FDCCM from it. FDCCM is a matrix of foreground and background concepts. It is prepared in such a way that, when we pass on any foreground concept, it returns us a list of background concepts co-exits with it. Its use in proposed method is for refinement of concept scores and to get improved detection rate.
3.3 Proposed ranked intersection filtering approach
4 Experimental setup
The dataset and the measure used for performance evaluation for the proposed method is discussed in this section.
4.1 Video dataset
TRECVID dataset and details of its partitions
# of videos
# of key-frames
TRECVID development dataset
4.2 Evaluation measure
The measure used to evaluate performance of proposed video retrieval method is Precision (P). It is defined as the ratio between the relevant shots retrieved by the method (Hit) from the total relevant and non-relevant (False-hit) shots retrieved in the output.
4.3 Experimental results
Sample test key-frames retrieval performance
Car and person
Police_Security, Person, Face, Crowd, Outdoor
This work has introduced a novel method for semantic concept-based video indexing and retrieval using a state-of-the-art classifier built using a fusion of asymmetrically trained deep CNNs to deal with imbalance dataset combined with FDCCM, and a novel RIF approach. Its evaluation on TRECVID dataset using Precision parameter shows that, proposed method is lot superior than other contemporary methods in the class.
Compliance with ethical standards
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
- 4.Kuo CH, Chou YH, Chang PC (2016) Using deep convolutional neural networks for image retrieval. Soc Imaging Sci Technol. https://doi.org/10.2352/issn.2470-1173.2016.2.vipc-231 CrossRefGoogle Scholar
- 5.Podlesnaya A, Podlesnyy S (2016) Deep learning based semantic video indexing and retrieval. arXiv:1601.07754[cs.IR]
- 6.McCormac J, Handa A, Davison A, Leutenegger S (2016) SemanticFusion: dense 3D semantic mapping with convolutional neural networks. arXiv:1609.05130v2[cs.CV]
- 7.Kikuchi K, Ueki K, Ogawa T, Kobayashi T (2016) Video semantic indexing using object detection-derived features. In: Proceedings 24th European signal processing conference (EUSIPCO), Budapest, Hungary, pp 1288–1292Google Scholar
- 9.Modiri S, Amir A, Zamir R, Shah M (2014) Video classification using semantic concept co-occurrences. https://doi.org/10.1109/cvpr.2014.324
- 10.Li X, Zhao F, Guo Y (2014) Multi-label image classification with a probabilistic label enhancement model. In: UAI’14 proceedings of the thirtieth conference on uncertainty in artificial intelligence, pp 430–439Google Scholar
- 11.Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In: Proceedings of the international conference on machine learning, ICML, Beijing, China, pp 647–655Google Scholar
- 15.Dalton J, Allan J, Mirajkar P (2013) Zero-shot video retrieval using content and concepts. In: 22nd international conference on information & knowledge management. ACM, New York, NY, USA, pp 1857–1860Google Scholar
- 17.Visser R, Sebe N, Bakker EM (2002) Object recognition for video retrieval. In: International conference on image and video retrieval, pp 262–270Google Scholar
- 18.Browne P, Smeaton AF (2005) Video retrieval using dialogue, keyframe similarity and video objects. In: IEEE international conference on image processing. IEEE, pp 1208–1211Google Scholar
- 19.Jiang Y, Ngo C, Yang J (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. In: 6th ACM international conference on image and video retrieval. ACM, Amsterdam, The Netherlands, pp 494–501Google Scholar
- 21.Siddiquie B, Feris RS, Davis LS (2011) Image ranking and retrieval based on multi-attribute queries. In: IEEE conference on computer vision and pattern Recognition, pp 801–808Google Scholar
- 22.Mezaris V, Sidiropoulos P, Kompatsiaris I (2011) Improving interactive video retrieval by exploiting automatically-extracted video structural semantics. In: Fifth IEEE international conference on semantic computing. IEEE, pp 224–227Google Scholar
- 23.Shirahama K, Kumabuchi K, Uehara K (2012) Video retrieval by managing uncertainty in concept detection using Dempster–Shafer theory. In: Fourth international conferences on advances in multimedia, pp 71–74Google Scholar
- 25.Dong X, Chang SF (2017) Visual event recognition in news video using kernel methods with multi-level temporal alignment. In: IEEE international conference on computer vision and pattern recognition, MinneapolisGoogle Scholar
- 28.NIST. http://www.nist.gov
- 29.Vedaldi A, Lenc K (2015) MatConvNet: convolutional neural networks for MATLAB. In: International conference on multimedia. ACM, pp 689–692Google Scholar
- 31.Chatterjee M, Leuski A (2015) CRMActive: an active learning based approach for effective video annotation and retrieval. In.: ICMR’15. ACMGoogle Scholar