Keywords

1 Introduction

With the advent of big data era [5, 12], the exploding information and data become increasingly aggravating. According to the statistical data by IDC, in the near future, there will be about 18EBs of storage capacity in China. The joint-report by IDC and EMC points that there will be 40000EBs globally in around 2020.

Such enormous Internet data brings not only abundant information to extensive Internet users, but also new problems and challenges to Cyber-security. Under the cover of Internet big data, many lawbreakers disseminate violence and religious extremism through the Internet. Such videos or audios are usually implanted in seemingly common data, under which it’s much complicated to figure out whether it is a normal case or not. Recent years, many videos and audios in referring to violence and extreme religious beliefs have been uploaded to the Internet. These illegal data contributes a lot to the propaganda of violent events and extreme religious thoughts. How to find these illegal hidden videos or audios over mass data and get rid of them to manipulate the healthy development of Cyber-space has become a core problem to be solved immediately.

There are two types of sensitive data to be detected on the Cyber-space: one is visual objects detection, the other is audio contents detection.

Object detection [2, 26] has been a hot topic in the field of computer vision and image processing. A lot of works about specific target detection have been done at home and abroad, e.g., pedestrian detection [6, 8, 18, 20], vehicle detection [4, 19], face detection [1, 3, 25], etc. By analyzing their work, we can find that early works focused on artificial definition based visual features detection. It is difficult to gain the semantic features because artificial definition based visual features have highly to do with low-level visual features. For example, Dalal and Triggs [6] raised gray gradient histogram features, which are applied to pedestrian detection. Ahonen et al. [1] presented LBP features, which are used to detect human faces. Due to the lack of interpretation of image semantics, these methods have a disappointing generalization. Recently, deep neural network has been widely applied to the domain of object detection. Not only can it learn feature descriptors automatically from object images, but also it can give a full description from low-level visual features to high-level semantics. Hence, deep learning has become popular in object detection and achieved a series of success, e.g., Tian et al. [4] transferred datasets of scene segmentation to pedestrian detection through combining the deep learning with transfer learning and gain a good achievement. Chen et al. [4] parallelized the deep convolutional network which has been applied to vehicle detection on satellite images. In [1], a deep convolutional network was proposed to detect human face with 2.9% recall improvement on FDDB datasets.

Audio retrieval has become a main direction of multi-media retrieval since the 1990s [9, 14]. Based on the used data features, existing techniques are simply divided into three major categories: physical waveform retrieval [14, 15], spectrum retrieval [10] and melody feature retrieval [9, 11, 17, 23, 24]. Physical waveform retrieval is time domain signal based. In [15], a prototype of audio retrieval system is designed through splitting audio data into frames with 13 physical waveform related information extracted as a feature vector and Mahalanobis distance used as a similarity metric. Spectrum retrieval is frequency domain signal based. Foote [7] extracted audio data’s MFCC features and then got histogram features, which were applied to audio retrieval. In [10], a feature descriptor method based on global block spectrum has been proposed, which can present the whole spectrum information but lack anti-noise capacities. Melody feature retrieval is based on voice frequency. In 1995, Ghias et al. [9] first suggested humming melody clips to be used as music retrieval, setting a foundation of humming retrieval. McNab et al. [17] extended Ghias’ idea of pitch contour and proposed to find out the continuous pitch to split notes with the help of related core technologies, like approximate melody matching or pitch tracking. Roger Jang and Gao [11], Wu et al. [24], Wang et al. [23] contributed a lot to voice frequency based melody feature retrieval successively.

In a word, with the prosperities of Internet big data, Cyber-space security are facing an increasingly serious challenge. Here are the organizations of this paper. The different bricks -sensitive visual object detection and sensitive audio information detection- are presented in Sects. 2 and 3, with proposed methods and experiments included. Conclusion are described in Sect. 4.

2 Iterative Semi-supervised Deep Learning Based Sensitive Visual Object Detection

Usually, sensitive visual information contains some particular illegal things, e.g., designated icons. Hence, to some extent, visual detection can be transformed into specific object detection.

One big obstacle of specific object detection is to grab labeled data, which is kinda a waste of human resources. What’s more, human-labeled data contains noise, affecting the performance. In real life, usually we can only get data with few labeled and most unlabeled. To solve the lack of labeled data, we proposed an iterative semi-supervised deep learning based sensitive visual object detection. This algorithm can make full use of the supervised information and will focus on more and more specific objects and reinforce them as iterations.

2.1 Iterative Semi-supervised Deep Neural Network Model

Given a set of N labeled vectors

$$\begin{aligned} D=\{(x_1,y_1),...,(x_i,y_i),...,(x_N,y_N)\}, \end{aligned}$$
(1)

among which, \(x_i\) is the ith data and \(y_i\) is its corresponding label, the learning process adjusts the set D each iteration, after which, new set is applied to update the neural network model.

First, extract M image blocks with sliding window from each training data in D. A total of \(N \times M\) blocks are gained, denoted as R

$$\begin{aligned} R=\{r_{11},....r_{ij},...r_{NM}\} \end{aligned}$$
(2)

Here, \(r_{ij}\) denotes the jth block from ith training data in D.

Then, classify blocks R in the neural network model learned by D and we get a new set named P, with each element a triplet

$$\begin{aligned} P=\{(r_{11},t_{11},s_{11}),...,(r_{ij},t_{ij},s_{ij}),...,(r_{NM},t_{NM},s_{NM})\} \end{aligned}$$
(3)
Fig. 1.
figure 1

An example of proposed iterative model. First, sensitive training set is collected. Then, apply this training set to training a neural network model and extract region proposals. Third, classify extracted region proposals with trained model. Lastly, rebuild the training set.

figure a

Here, \(r_{ij}\) stands for the element in R, namely, the jth block from ith training data. And \(t_{ij}\), \(s_{ij}\) are its corresponding class and score, resulting from the neural network learned by set D. \(s_{ij}\) is a confidence coefficient of \(r_{ij}\) belonging to class \(t_{ij}\). Based on this, we can construct a new set \(D'\), which can be used to update the neural network model.

$$\begin{aligned} D'=\{(r_{ij},t_{ij})|(r_{ij},t_{ij},s_{ij}) \in P , t_{ij} = y_{i}, s_{ij} > \tau \} \end{aligned}$$
(4)

This shows that the new set consists of the block that its predicted class agree with the label of its original training data and its predicted confidence coefficient exceeds a particular threshold \(\tau \).

We show a single version of our iterative model in Fig. 1 and the full algorithm is described in Algorithm 1.

2.2 Experiment and Analysis

To verify the effectiveness of proposed algorithm, we compare on Flickr-32 LOGO dataset our algorithm with RCNN. This dataset contains 32 different LOGO and is split into three groups: training set, validation set and test set. Training set consists of 320 images with 10 per class. Validation set and test set consist 960 images with 30 per class, respectively. Also, we use ILSVRC2012 to pre-train CNN neural network model \(M^0\). Selective Search Algorithm [21] is used as region proposals. For the consideration of fairness, we remove the last softmax layer and add a linear SVM. One thing should be noticed that the proposed method is kinda like RCNN. RCNN belongs to supervised algorithm, which needs the position label of LOGO, but the proposed method doesn’t need.

All the experiments were complemented with python and conducted on a Dell workstation with 2 Intel E5 processors, 64G memories, 4G Navidia Quadro GPU and 8 T hard disk.

Figure 2 shows how our proposed algorithm updates the dataset. As we can see, the logo object becomes a focus as iterations with a stronger confidence coefficient.

Fig. 2.
figure 2

An example of proposed iterative based algorithm. As iteration goes (from (a) to (d), from (e) to (h)), object becomes clear.

Table 1. Experiment results on different logo classes comparing proposed method with RCNN

We conduct experiments on Flickr dataset, and the results are compared with the art-of-state RCNN algorithm. We use mAP as an evaluation criterion. The results are shown on Table 1.

The first shows the evaluation of accuracy of R-CNN and second shows ours. In third line and fourth line, position regression are added to RCNN and proposed algorithm, denoted as R-CNN-BB and OUR-BB respectively. We should take care that the CNN network in RCNN uses 200 thousand training images for fine tune. But for our model, only 320 images are used for first fine ture and in the 12nd iteration, we acquire up to 4 thousand images. What’s more, as we can see from the table, our proposed method significantly improves over RCNN, with 0.14% improvement comparing R-CNN-BB with OUR-BB.

Compared with RCNN, our proposed method - Iterative semi-supervised deep neural network model shows advances. Three general advantages are summarized as following:

First, our method can find the most stable and important inner-class features. If an image is discrete point, only a few training data can be derivated.

Second, our method has low demand on training data that there is no need to know the position of logo in the image. The training data in the next round is complemented by the confidence coefficient, while RCNN model needs strong supervised information, where positive data is defined by value of IoU (above 0.5).

Third, 33 softmax-layers were used in RCNN while ours only use 32 channels in softmax output. We focus on classifying different classes.

3 Sensitive Audio Information Detection on the Internet

Audio data is also a kind of inter-media for illegal information, through which, lawbreakers spread violence and religious extremism, like religious music, oath slogan and so on. Even identical audio context can have disparate voice properties for different individuals in various scenes. However, the melody information that music has, is identical even though individuals have unlike voice properties.

3.1 Humming Based Sensitive Audio Information Detection

The essence of Query by humming [10, 11, 13, 16, 17, 22,23,24] is to detect a specific context of voice by utilizing these unchangeable melody information. In this paper, we put forward a new audio detection method which is based on melody feature. In this proposed method, Empirical Mode Decomposition(EMD) is introduced, with Dynamic Time Warping(DTW) combined. The whole framework is shown as Fig. 3.

Fig. 3.
figure 3

A whole framework of sensitive audio detection.

The whole system can be loosely translated into three parts. First part focuses on dataset construction, in which various sensitive audio information is collected. And then, note feature and pitch feature are collected for each audio. Second part conducts pitch feature extraction of query data, after which feature transformation is applied to extract note feature. Third part is matching stage. Top N nearest neighborhoods with minimum EMD distance of note feature, are selected as candidates. Then, DTW is applied to these candidates to match distance of pitch feature. We re-rank candidates by linear weighting.

3.2 Experiment and Analysis

To verify feasibility of our framework, we conduct simulation experiment on MIREX competition dataset, where a total of 2,048 songs exist, including 48 target humming songs and others belonging to noise data. Also, 4,431 humming songs are used as queries. Partial searching results are shown in Fig. 4. As we can see, the vast majority of humming query can find its corresponding source songs with a 93% retrieval rate.

Fig. 4.
figure 4

A whole framework of sensitive audio detection.

4 Conclusion

Abnormal sensitive information on the Internet lies in various multimedia, like text, video or audio. As far as text type, existing algorithms can figure it out with efficient results and instantaneity. For video or audio, though enough works are insisting on them, they are still un-solved, which remains an open problem. In this paper, we propose two algorithms, i.e., iterative based semi-supervised deep learning model and Humming melody based search model, to detect abnormal visual and audio objects respectively. And experiments show the feasibility of our proposed methods.