1 Introduction

Music genre is important to many applications, such as music recommender system and information retrieval. Automatic genre classification system has been developed using machine learning technique recent years. Most of these systems have the ability of cataloguing different music genres from raw music contents [1,2,3].

Mel-frequency cepstral coefficient (MFCC) and Mel-spectrogram are widely used in genre classification task. Because they can extract variant features from raw data for the learning process. But the performance of genre classification benefits from features over long-time scale (>500 ms) while MFCC is efficient around time scale of 25 ms, and enlarging the time scale leads the information loss when using mel-spectrogram [4, 5]. Differently, scattering transform can recover the information loss by wavelet decompositions, meanwhile, extract long-time scale features by lowpass filters [6, 7].

Deep learning makes massive of success in different areas, for instance, computer vision [8,9,10], speech recognition [11, 12], and natural language processing [13, 14]. These algorithms can extract high-level features automatically layer by layer, different from traditional machine learning classifiers, such as Support Vector Machine (SVM), Nearest Neighbors, and Decision Trees, which are heavily dependent on the result of feature extraction. Among its several typical models, Recurrent Neural Network (RNN) is widely used for sequential data. And RNN is good at learning the relationship through time [15]. But in purpose of achieving good performance, deep neural network needs large amount of data. In condition of the target dataset need to be classified is not enough, we can use a large data, which is the same or similar to the target dataset, to pre-train the deep neural network, then replace the connections to classifier according to the target classification number and fine-tune the parameters of the pre-trained network. This process is called transfer learning [16]. In this paper, we use Magnatagatune dataset [17] and GTZAN dataset [18] as the large and the target dataset respectively. 5-layer RNN using Gated Recurrent Unit (GRU) [19] and softmax classifier are used. Additionally, for reducing the input of deep RNN, we use scattering transform as its preprocessing.

The results of the experiment show that the proposed 5-layer RNN reaches a high accuracy when using transfer learning, and the same architecture using random initialization converges more slowly to a lower accuracy.

Fig. 1.
figure 1

The architecture of the proposed transfer learning process

2 Transfer Learning Process

The architecture of the proposed method is shown in Fig. 1. The overall process consists of two parts. One part is deep RNN training on a large musical dataset (Magnatagatune dataset is used in this paper). The other part is genre classification process after fine-tuning the previous trained deep RNN by target dataset (GTZAN dataset is used in this paper). Specifically, scattering transform is applied at the beginning of each part, in order to reduce the raw music data and to extract features preliminarily for the next process of neural network training. 5-layer RNN with GRU and softmax classifier are trained with tagged music clips as the deep RNN we mentioned. At last, we use the target genre classification dataset (GTZAN) to fine-tune the trained parameters of RNN.

2.1 Scattering Transform

In genre classification task, large time scale (>500 ms) invariant signal representation is important. As widely used methods in audio processing, mel-spectrogram can enlarge the time scale but remove information which is crucial to genre classification. And MFCC is efficient at time scales up to 25 ms. Unlike the previous methods. Scattering transform can provide invariants over large time scales without too much information loss.

For an audio signal x, scattering transform defined as \(S_{n}x\), where n represent the order. \(S_{0}x=x\star \phi (t)\) has locally invariant property because of the time averaging operation, but it leads to high frequency information loss which can be retrieved by the wavelet modulus coefficients \(|x\star \psi _{\lambda _{1}}(t)|\). To make the wavelet modulus coefficients invariant to translation, a time averaging unit is applied. The first layer of scattering transform defined as:

$$\begin{aligned} S_{1}x(t,\lambda _{1})=|x\star \psi _{\lambda _{1}}|\star \phi (t) \end{aligned}$$
(1)

Andén [7] indicates that if wavelets filter-bank \(\psi _{\lambda _{1}}\) have the same frequency resolution as the mel-windows, then \(S_{1}x\) coefficients can be approximate to the mel-filter-banks coefficients. The difference is that applying a bank of higher frequency wavelet filters \(\psi _{\lambda _{2}}\) with a modulus to the wavelet modulus coefficients can recover the lost information. The same as previous operation, adding a low-pass filter \(\phi (t)\) make the coefficients translation invariant. Then the second layer of scattering transform defined as:

$$\begin{aligned} S_{2}x(t,\lambda _{1},\lambda _{2})=||x\star \psi _{\lambda _{1}}|\star \psi _{\lambda _{2}}|\star \phi (t) \end{aligned}$$
(2)

2.2 Deep Recurrent Neural Network

RNNs have an aptitude for handling sequential information, such as speech recognition and NLP. RNN structure can be described as transitions from previous to current states. For classical RNN, this transition is formulized as:

$$\begin{aligned} h_{t}=f(W_{h}[x_{t},h_{t-1}]+b_{h}) \end{aligned}$$
(3)

In order to solve the problem of vanishing gradients of RNN. Gated structure named LSTM introduced by Hochreiter [15]. The LSTM unit allows that information of more timesteps can be memorized. And the memories are stored by memory cells. Then the LSTM can decide to forget, output, or change the saved memories. As a popular variant of LSTM, GRU is simpler and effective as well. It uses gate Zt and gate Rt to update the hidden state. Theses gates are given by:

$$\begin{aligned} \begin{aligned} \left( \begin{array}{lcl} z_{t} \\ r_{t} \end{array} \right)&= \left( \begin{array}{lcl} \sigma (W_{z}[x_{t},h_{t-1}]+b_{z}) \\ \sigma (W_{r}[x_{t},h_{t-1}]+b_{r}) \end{array} \right) \\ g_{t}&=f(W_{g}[x_{t},r_{t}*h_{t-1}]+b_{g})\\ h_{t}&=(1-z_{t})*h_{t-1}+z_{t}*g_{t} \end{aligned} \end{aligned}$$
(4)

We use 5-layer GRU neural network which is constructed by stacking each hidden layer on the top of previous layer, in order to improve the ability of representation of our architecture in this paper. Additionally, generalization of the proposed deep RNN is improved by applying dropout between each layer [20].

3 Datasets and Experiment Setup

Magnatagatune and GTZAN dataset are used as the large and target dataset respectively. All the clips are transformed to mono and sampled by 16 kHZ. Magnatagatune has 25863 clips and each clip is annotated with 188 different musical tags such as genre, mood, and instrument. We use the last 2105 clips (distributed in folder ‘f’) for validation, others for training. We use 512 hidden states in each layer. Dropout is set as 0.7. Learning rate is 0.00001. And we use AUC-ROC score [21] to evaluate the performance of our model to avoid imbalance of the dataset. When the AUC-ROC score is stable, we stop the training and save the model. GTZAN dataset has 1000 clips of 10 genres and each genre contains 100 clips evenly. As the target dataset, it is randomly shuffled and the mean accuracy of 10 times of 10-fold cross validation is used for the final test accuracy. Among the 10 folds in total, we use 1 fold for testing, and the others for training. Each time of 10-fold cross validation, we change the output number of the softmax classifier to 10 (the genre number of GTZAN dataset), then fine-tune the parameters of pre-trained model from Magnatagatune dataset.

4 Experiment Results and Analysis

As shown in Fig. 3, both random initialization and transfer learning models (pre-training process is shown in Fig. 2) of 5-layer RNN with GRU using scattering transform preprocessing converge to quite high accuracy in training. And the models using transfer learning need about 100 epochs to be stable. But the random initialed models need more. This phenomenon not only appears in the three random picked training processes, but also in the unpicked to be shown. It indicates that the transfer learning initials the model better, and improves the speed of convergence.

Comparing with other works of recent years in Table 1, our approach shows a competitive accuracy (95.8%) in genre classification task on GTZAN dataset. Even the model using random initialization can also reach a high accuracy (93.5%) relatively. The combination of scattering transform and deep RNN has been evaluated, and by using this architecture, it performs well in music genre classification.

Fig. 2.
figure 2

Validation AUC-ROC score of 5-layer GRU neural network using scattering transformed input

Fig. 3.
figure 3

Three random picked training processes of 10-cross validation, Blue lines represent the RNN using random initialization, and the orange lines represent the RNN using transfer learning. And the accuracy is tested by a random batch of training data (Color figure online)

Table 1. Average test accuracy of different models on GTZAN dataset

5 Conclusion

In this paper, we use transfer learning in music genre classification by using 5-layer RNN with GRU and scattering coefficients as its input. When applying the transfer learning from a large music dataset (Magnatagatune is used in this paper), our model shows a faster convergence and higher average accuracy than the same model of random initialization on the target dataset (GTZAN is used in this paper). And the accuracy of transfer learning approach is competitive comparing with the state-of-the-art models as well. The effectiveness of deep RNN combined with scattering transform and transfer learning has been verified in music genre classification task.