Keywords

1 Introduction

During a social event, the audience typically uses its personal devices to record video clips related to the most interesting moments of the event. As a result, several videos will be related to the same visual contents, and this redundancy can be exploited to infer the most interesting moments of the event over time, according to the people interests on the observed scenes. The issue of crowd-popularity estimation through automatic video processing is not trivial due to the variability of the visual contents observed by multiple devices: different points of view, pose and scale of the objects, lighting conditions and occlusions. The differences between device models should be also taken into account, since they imply different characteristics of the lens, color filter arrays, resolution and so on. For instance, even using two devices with similar (or equal) sensors the colors recorded will not necessarily be the same because devices responses are processed with different non-linear transformations due to the differences on the Imaging Generation Pipelines (IGPs). They can vary from device to device and even on an per-image basis [1, 2].

We propose a system called RECfusion to estimate the popularity of scenes related to multiple video streams. The streams are analyzed with the aim to create a continuous video flow, obtained by mixing the several input channels, taking into account the most popular scenes over time to reflect the interests of the crowd. Then, the clusters of the different scenes are tracked over time. This allows to have not only the most popular scene at each time, but also the other scenes of interest and give the possibility to introduce a scenes story log allowing the user to select the scene of interest among all the detected ones.

The reminder of the paper is structured as follows: in Sect. 2 we discuss related studies about crowd-saliency inference from multi-device videos. In Sect. 3 an overview of the RECfusion framework is given together with the description of its three main modules: intraflow analysis, interflow analysis and cluster tracking. In Sect. 4 a proper dataset is introduced, whereas in Sect. 5 we report the experimental settings and the results. We conclude the paper with a final discussion and hints for possible future works in Sect. 6.

2 Related Works

Different papers about crowd-saliency inference from multi-device videos have been proposed in literature in the past. The works in [3, 4] exploit Structure from Motion (SfM) to estimate a 3D reconstruction of the scene and the pose of employed devices. Hoshen et al. [5] uses egocentric video streams considering a single camera model acquired by different participants to create a single popular video of an event. However, in [3,4,5] the number of the different popular scenes and the number of the devices are known a priori. Saini et al. [6] developed the framework MoViMash with the purpose of replicate the behavior of a movie director: the system learns from a labeled sets of video frames “how to” and “when” perform transitions between different views. However, this technique is hardly adaptable for a real-time context, since for each different recorded scene a proper learning phase should be tuned to. ViComp is another framework similar to MoViMash [7]. In ViComp the final output video consists in a combination of several video streams from multiple sources. The combination is obtained by selecting high quality video segments according to their audio-visual ranking scores. It selects the best video stream among a pool of available ones basing on degradation and noise caused by video compression [8] and estimated camera pan and tilt [9].

The aforementioned approaches achieve significant results but, compared to them, our approach (RECfusion) does not need any prior knowledge or training stage and is able to combine videos from an unknown number and types of recording devices. RECfusion is a framework with a popularity-based video selection approach: it clusters the video streams and selects the best video stream from each cluster exploiting clustering metrics.

3 RECfusion System Overview

RECfusion is a framework designed for automatic video curation driven by the popularity of the scenes acquired by multiple devices. Given a set of video streams as input, the framework can group these video streams according to the viewed similarity and popularity of the scenes over time, then it automatically suggests a video stream to be used as output acting like a “virtual director”. With the aim to mitigate the aforementioned differences in the color representation of the devices, due their different IGPs, the video frames are pre-processed by an equalization algorithm. This step helps the further computations that compares frames captured by different devices [1, 10,11,12]. After this normalization, the system extracts an image representation from each frame. The algorithm takes a frame as input and returns a descriptor. The aim is to have a descriptor that maximize the differences between semantically different frames and minimize the differences between semantically similar ones. In [1] a definition of light conditions (and almost devices) independent representation is given. The method is based upon the observation that changes of light conditions or device directly change the RGB values of the frame, while order of sensors response remains the same. Finally, equalization of RGB channels, as described in [1], is performed. After the normalization of the color domain, the video streams are analysed in our approach in three phases (Fig. 1), detailed in the followings.

Fig. 1.
figure 1

RECfusion results applied on Foosball dataset. The chronograms show the results of the three main steps of RECfusion (intraflow analysis, interflow analysis and cluster tracking). Foosball dataset is composed by 4 video streams having a duration of \({\sim }2300\) frames (\({\sim }90\) s). Each video stream is represented as a row in the chronograms. Vertical red lines mark the end of time-slots. (a) Intraframe analysis: red, blue and green frames are respectively the first, second and third scene of each video stream. Noisy frames are depicted in black. (b) Interframe analysis: yellow and green clusters are respectively the first and second cluster of each time-slot. (c) Cluster tracking: red, blue and green clusters are respectively the first, second and third cluster of the whole video set. Noisy clusters are depicted in black. (Color figure online)

3.1 Intraflow Analysis

The intraflow analysis segments the sequence of frames of a single video stream (Fig. 1(a)). During intraflow analysis the frames of each video are processed comparing their visual contents. For each frame of the video flow, we extract keypoints using the SIFT detection algorithm [13]. The set of the extracted SIFT features represents a template for the acquired scene. In this way, the comparison between frames could be done as the comparison between SIFT templates. When the comparison between the current frame end the reference template generates a sensible variation of features (i.e., low matching score), then the algorithm refreshes the reference template and splits the video producing a new segment. To make the matching more reliable, we reject the matchings where the keypoints are too far in terms of spatial coordinates by assuming smooth transition between consecutive frames [14]. For major stability, a new template can be defined only if it has a duration greater than 2 s, otherwise it is considered as noise. In other words, a template is considered a stable template if the number of matching SIFTs do not change too much in time. A backward checking is required in order to understand if a new defined template regards a new scene or it is related to a previously observed one. The algorithm compares the new defined template with the past ones, starting from the last found template. Each reference template is labeled with a SceneCounter and all video frames achieving a robust match are labeled with the same SceneCounter. Note that all the frames required to decide if a template should be considered as a new or an updated one are labeled as a transition interval.

3.2 Interflow Analysis

The interflow analysis is computed for each time-slot. It segments video frames labeled by intraflow analysis and assigns a ClusterCounter with respect to all the video streams in that specific time-slot (Fig. 1(b)). We want to group together the devices that are looking at the same scene over time. The descriptor used in the interflow analysis is based on weighted color histograms [15]. In this context the device invariance should be granted as well as possible. For this reason we firstly apply an histogram equalization, as suggested in [1]. The equalization is followed by a quantization of the color space (8 colors for each channel). The weights are obtained by using a gradient map as suggested in [15]. The gradient map is useful to highlight the structures of the objects that appear in the scene, making more robust the descriptor.

The different scenes obtained with the intraflow analysis could be considered as nodes of a complete graph in which arcs are weighted with the interflow distances between the scenes acquired by the devices. The clustering procedure selects a frame among the unclustered frames and assigns it to the most similar cluster. We used an average linkage approach to compare a frame with a cluster: the distance between a frame and a cluster is given by the average distance between the frame and all the elements within the cluster [14].

3.3 Cluster Tracking

To understand the meaning of the Cluster Tracking module we have to step back to intraflow analysis. The intraflow analysis segments the sequence of frames of a single video stream, and assigns a SceneCounter to each segmented scene. However, frames taken by two different video streams but labeled with the same SceneCounter can represent different scenes, since SceneCounters are discriminative only within a single video stream. The interflow analysis segments video frames in a time-slot and assigns a ClusterCounter to the scenes of the video streams. Interflow analysis exploits the SceneCounters and the set of SIFT features templates from intraflow analysis. Similarly to SceneCounters, the ClusterCounters are to be considered only within a single time-slot. Therefore, we developed a cluster tracking procedure in order to track the clusters representing the same scene in every video stream and time-slot (Fig. 1(c)). In [16] a Graphical User Interface implementing the cluster tracking typical video player commands (like Start, Pause, Stop, ...) is described (Fig. 2).

Fig. 2.
figure 2

RECfusion Graphical User Interface showing the Cluster Tracking framework. On the left, active clusters with respective amount of recording devices and automatically suggested video stream (called RECfusion: most popular) are shown. User can browse the Virtual Director panel to dinamically change the active video stream. On the right side, active video stream with classic video player commands is shown.

We propose a cluster tracking procedure based on a voting routine that combines the results of the intraflow and interflow analyses. Once interflow procedure has assigned a ClusterCounter to several SceneCounters, this set of scenes will characterize the same cluster also in further time-slots, so cluster tracking procedure an unique \(LoggedCluster_{ID}\) to this set of scenes. Differently from the ClusterCounters, the \(LoggedCluster_{IDs}\) are intended to be always discriminative. Cluster tracking procedure tracks the clusters in each time-slot assigning them \(TrackedCluster_{IDs}\) equals to the most similar \(LoggedCluster_{ID}\). In order to define the most similar \(LoggedCluster_{ID}\), cluster tracking procedure requires an initialization phase (at first time-slot). In this phase, the assigned \(LoggedCluster_{IDs}\) are equals to the ClusterCounters. Then, from the second time-slot on, the clusters will be associated to an existent \(LoggedCluster_{ID}\) or to a new one, depending on a voting routine. The same routine is also used to track the \(LoggedCluster_{IDs}\) with proper \(TrackedCluster_{IDs}\).

The voting routine can be divided into 2 phases: casting of vote and voting decision. In the former phase, for each time-slot, each scene votes with three different possible values: \(TrackedCluster_{ID}\) at the previous time-slot, \(LoggedCluster_{ID}\) or unlogged scene (\(V_N\)), if the scene is Noise, already logged or unlogged, respectively. Once all the votes are casted in a time-slot, then we look for a non ambiguous voting decision (i.e., a majority is found). Majority of unlogged scenes is not admitted, so in this case we simply remove these votes from the voting decision. Depending on the reached decision, new \(LoggedCluster_{IDs}\) might be instantiated, while \(TrackedCluster_{IDs}\) at current time-slot is eventually updated. We will compare the new proposed method with respect to a cluster tracking method based on a threshold \(T_{CT}\) [16]. This threshold was used as an hyperparameter to decide whenever to create a new \(LoggedCluster_{ID}\) or not. The issue with this threshold employed in [16] is that its value should be fine tuned for each video set in order to achieve the best results in cluster tracking procedure.

4 Datasets

To perform experiments we have used the RECfusion dataset [14] which is publicly available at the following URL: http://iplab.dmi.unict.it/recfusionICIAP17. This dataset is made up of three video sets:

  1. 1.

    Foosball: indoor context, some people appear in the scene. The number of contributing devices for this video set is 4, with an average number of frames per video stream of 2250 (44 time-slots). There are three main subjects in this video set: a foosball, a couch and a bookcase.

  2. 2.

    Meeting: indoor context, two people appear in the scene. The number of contributing devices for this video set is 5, with an average number of frames per video stream of 2895 (60 time-slots). There are two main subjects in this video set (the two guys).

  3. 3.

    S. Agata: outdoor context, lots of people appear in the scene. The number of contributing devices for this video set is 7, with an average number of frames per video stream 1258 (34 time-slots). There are two main subjects in this video set: the reliquary of S. Agata and the facade of a church.

In the experiments we exploit also a video set from the dataset used in Ballan et al. [17]. This dataset is called Magician. It is related to an indoor context, where one person appear in the foreground. The number of contributing devices for this video set is 6, with a fixed number of 3800 frames per video stream (77 time-slots). There are two main points of view in this video set: one above and one in front of the magician. We have chosen Magician video set because it is slighty different from the videos currently in RECfusion dataset. In Magician all the video streams are focused on a single target and are acquired as a “casual multi-view video collection” [17]. This means that backgrounds in the video streams are very different from each other and that severe camera motion could often appear. The casually filmed events represent a challenging scenario for detector like SIFT (exploited in our intraflow analysis, see Sect. 3.1), so we add Magician video set to our tests in order to stress and evaluate scene analysis and cluster tracking performances. We have also compared the obtained results with the benchmark dataset proposed in Hoshen et al. [5]. This dataset has been acquired with wearable devices and, like Magician video set, it is challenging since every video is strongly affected by motion.

5 Experimental Settings and Results

We select the last instant of time for every time-slot as the representative of that interval. Validation are made exploiting the Ground Truth related to these representative frames. To evaluate the performances of the proposed method, we compute the two quality measures described in [14]. Specifically, for each clustering step we consider:

  • \(P_r\): ground truth popularity value (number of cameras looking at the most popular scene) obtained from manual labelling;

  • \(P_a\): popularity score computed by the system (number of the elements in the popular cluster);

  • \(P_g\): number of the correct videos in the popular cluster (i.e., the number of inliers in the popular cluster).

From the above defined scores, the weighted mean of the ratios \(P_{a}/P_{r}\) and \(P_g/P_r\) over all the clustering steps are computed. The ratio \(P_{a}/P_{r}\) provides a score for the popularity estimation, whereas the ratio \(P_{g}/P_{r}\) verifies the visual content of the videos in the popular cluster and provides a measure of the quality of the popular cluster. Note that \(P_{a}/P_{r}\) is a score: when is lower than 1 it means that system is under-estimating the popularity of the cluster, while, conversely, if it is higher than 1 it results in an over-estimation.

Table 1. Validation results of popularity estimation.

The results of the comparison between the tested video sets are shown in Table 1. The first five rows are related to RECfusion dataset, whereas the last three rows are related to the dataset proposed in [5]. Although the constantly head motion of the wearable recording devices in videos from [5], the framework reaches good results and seems to be promising room for improvement in the field of wearable devices. Conversely, we found a drop in the performances when there is a severe difference of scale between videos in a video set. Indeed, we exploited Meeting video set to evaluate the drawback in performances when there are high differences between resolution of devices. We compared three cases, with 2, 4 and all the 5 devices in Meeting video set, respectively. Other analysis outputs could be found at the following URL http://iplab.dmi.unict.it/recfusionICIAP17.

Fig. 3.
figure 3

A comparison of TPR (True Positive Rate, or Recall), TNR (True Negative Rate, or Specificity) and ACC (Accuracy) between RECfusion_dataset_2015 and Magician video set cluster tracking validations using the threshold-based procedure from [16]. As can be seen, Magician requires a fine tuned threshold to increase TPR, TNR and ACC values.

Table 2. Validation results between cluster tracking procedure threshold-based and vote-based.

In the new proposed procedure we removed the threshold \(T_{CT}\), used as an hyperparameter to decide whenever to create a new logged-cluster or not. In [16] the value of \(T_{CT}\) was empirically set equals to 0.15 founding the best overall value between True Positive Rate, True Negative Rate and Accuracy of clustering tracking procedure on RECfusion dataset. In Fig. 3 a comparison between the average values of TPR (True Positive Rate, or Recall), TNR (True Negative Rate, or Specificity) and ACC (Accuracy) of RECfusion dataset and Magician video set whit several values of \(T_{CT}\) is shown. As can be seen, the value of \(T_{CT}\) equals to 0.15 is not the best value to be used by cluster tracking procedure, while \(T_{CT}=0.5\) should be used instead. For this reason we proposed the new threshold independent cluster tracking procedure described in Sect. 3.3. We computed TPR, TNR and ACC values for each video set described in Sect. 4 and compared them with the results obtained in [16]. The comparative validation results are shown in Table 2.

These results show that the proposed vote-based cluster tracking procedure reaches TPR values much higher than the threshold-based procedure, while results on TNR and ACC are comparable between the two procedures. Just in the Meeting video set the proposed vote-based procedure is slighty outperformed: this is a limitation of the procedure. Indeed, cluster tracking procedure relies on intraflow analysis, so if the latter defines N scenes, then the former is able to distinguish at most N scenes. Hence, differently by threshold-based procedure used in [16], that can generate a bunch of small sparse clusters if \(T_{CT}\) is not fine tuned, in this case only a limited number of clusters is tracked. In Meeting video set two people are recorded and there are only two distinguished clusters focusing on each one of them. Sometimes interflow analysis generates a cluster containing both of the two people. This is treated by the cluster tracking vote-based procedure as Noise, since intraflow analysis has never labeled a scene in which the people are recorded together.

A final remark is about Magician video set. We added it to our dataset in order to evaluate scene analysis and cluster tracking performances in a video collection with a single scene, where all the user are focused on the same target and videos are affected by severe camera motion. Cluster tracking results with threshold-based procedure from [16] are really bad, indeed we got the worst average performance on this video set (Table 2). On the other hand, the proposed vote-based procedure reached good values of TPR, further assessing the soundness of this new cluster tracking approach. The output videos showing the result of cluster tracking vote-based procedure could be found at the following http://iplab.dmi.unict.it/recfusionICIAP17.

6 Conclusion and Future Works

In this paper we described RECfusion, a framework designed for automatic video curation driven by the popularity of the scenes acquired by multiple devices. Given a set of video streams as input, the framework can group these video streams by means of similarity and popularity, then it automatically suggests a video stream to be used as output, acting like a “virtual director”. We compared RECfusion intraflow and interflow analysis validations with Hoshen [5]. We have added a video set from Ballan et al. [17] to our RECfusion dataset showing that RECfusion is capable of recognize and track the scenes of a video collection even if there is a single scene, where all the user are focused on the same target and videos are affected by severe camera motion. We proposed a novel and alternative vote-based cluster tracking procedure and compared it with the one, threshold-based, described in [16]. From this comparison we found that vote-based procedure reaches very good results totally automatic and independently by a hyperparameter fine tuning phase, but with the tradeoff of be unable to create and track an unlimited number of clusters. As future works and possible applications, we are planning to augment the framework with features specifically focused on Assistive Technology or Security issues (i.e., highlight/track bad behaviour in the life style, log the visited places, search something or someone that appears in the scene).