Keywords

1 Introduction

Recently, human activity and climate change put a negative effect on frog biodiversity, which makes frog monitoring become ever more important. Compared with the traditional monitoring method such as field observation, acoustic sensors have greatly extended acoustic monitoring into larger spatio-temporal scales [1]. Correspondingly, large volumes of acoustic data are generated, which makes it essential to develop automatic methods.

Several papers have already described automated methods for the classification of frog calls. Han et al. combined spectral centroid, Shannon entropy, Renyi entropy for frog call recognition with a k-nearest neighbour classifier [2]. Gingras et al. proposed a method based on mean value for dominant frequency, coefficient of variation of root-mean square energy, and spectral flux for anuran classification [3]. Bedoya et al. used Mel-frequency cepstral coefficients (MFCCs) for the recognition of anuran species with a fuzzy classifier [4]. Xie et al. proposed a method based on track duration, dominant frequency, oscillation rate, frequency modulation and energy modulation to do frog call [5]. All those previous methods achieve a high accuracy rate in recognition and classification, but recordings used in those papers are assumed that there is only a single frog species present in each recording.

Unfortunately, all the recordings used in this study are low signal to noise ratio and contain many overlapping animal vocal activities including frogs, birds, crickets and so on. To solve this problem, the multiple-instance multiple-label classifier for supervised classification is formulated [6]. In the previous study, Briggs et al. has already introduced the MIML classifiers for acoustic classification of multiple simultaneous bird species [7]. In their method, a supervise learning classifier was employed for segmenting acoustic events, which required lots of annotations.

In this study, we introduced the MIML algorithm for frog call classification. Rather than using a supervised learning method for syllable segmentation, acoustic event detection is first employed to separate frog syllables. Then, three feature sets, mask descriptor, profile statistics, and the combination of mask descriptor and profile statistics, are calculated from each syllable. After applying a bag generator to those extracted feature sets, three classifiers, MIML-SVM [6], MIML-RBF [8], and MIML-kNN [9], are lastly used for the recognition of multiple simultaneous frog species. Experimental results show that our proposed method can achieve high classification accuracy.

2 Materials and Methods

2.1 Materials

Digital recordings in this study were obtained with a battery-powered, weatherproof Song Meter (SM2) box. Recordings were two-channel, sampled at 22.05 kHz and saved in WAC4 format. Here, a representative sample of 342 10-s recordings was selected to train and evaluate our proposed algorithm for predicting which frog species are present in a recording. All those examples were collected between 02/2014 to 03/2014, because it is the frog breeding season with high calling activity. All the species that are present in each 10-s recording were manually labelled by an ecologist who studies frog calls. There are totally eight frog species in the recordings: Canetoad (CAD) (\(F_{0}=560\) Hz), Cyclorana novaehollandiae (CNE) (\(F_{0}=610\) Hz), Limnodynastes terraereginae (LTE) (\(F_{0}=610\) Hz), Litoria fallax (LFX) (\(F_{0}=4000\) Hz), Litoria nasuta (LNA) (\(F_{0}=2800\) Hz), Litoria rothii (LRI) (\(F_{0}=1800\) Hz), Litoria rubella (LRA) (\(F_{0}=2300\) Hz), and Uperolela mimula (UMA) (\(F_{0}=2400\) Hz). Here, \(F_{0}\) is the mean dominant frequency for each frog species. Each recording contains between 1 and 5 species. Following the prior work [7], we assume that recordings without any frog calls can be detected by acosutic event detection.

2.2 Signal Processing

All the recordings were re-sampled at 16 kHz and mixed to mono. A spectrogram was then generated by applying short-time Fourier transform to each recording. Specifically, each recording was divided into frames of 512 samples with 50 % frame overlap. A fast Fourier transform was then performed on each frame with a Hamming window, which yielded amplitude values for 256 frequency bins, each spanning 31.25 Hz. The final decibel values (S) were generated using \(S_{tf} = 20*log_{10}(A_{tf})\), where A is the amplitude value, \(t=0,...,T-1\) and \(f=0,...,F-1\) represent frequency and time index, T and F are 256 frequency bins and 625 frames, respectively.

2.3 Acoustic Event Detection for Syllable Segmentation

Acoustic event detection (AED) aims to detect specified acoustic event in an audio stream. In this study, we use AED to segment frog syllables. Since all the recordings are collected from the field, there are much overlapping vocal activities. Traditional methods for audio segmentation are based on time domain information [10, 11], which cannot address those recordings. Here, we modified the AED method developed by Towsey et al. [12] to segment recordings with overlapping activities. The detail of our AED method is described as follows:

Step 1: Wiener filter

To de-noise and smooth the spectrogram, a 2-D Wiener filter is applied to the spectrogram image over a \(5 \times 5\) time-frequency grid, where the filter size is selected after considering the trade-off between removing the background graininess and blurring the acoustic events.

$$\begin{aligned} \hat{S_{tf}} = \mu + \frac{(\sigma ^{2}-\nu ^{2})}{\sigma ^{2}}(S_{tf}-\nu ) \end{aligned}$$
(1)

where \(\mu \) and \(\sigma ^{2}\) are local mean and variance, respectively. \(\nu ^{2}\) is the noise variance estimated by averaging all local variances.

Step 2: Spectral subtraction

After Wiener filter, the graininess has been removed. However, some noises such as wind, insect, motor engine that cover the whole recording cannot be removed. Here, a modified spectral subtraction is used for dealing with those noise [13].

figure a
Fig. 1.
figure 1

Acoustic event detection results before (Left) and after (Right) event filtering based on dominant frequency. Here, blue rectangle means the time and frequency boundary of each detected event.

Step 3: Adaptive thresholding

After noise reduction, the next step is to convert the noise reduced spectrogram \(\hat{S^{'}_{tf}}\) into the binary spectrogram \(S^{b}_{tf}\) for events detection. Different from the hard threshold in Towseys work, an adaptive thresholding method named Otsu thresholding is used to convert the smoothed spectrogram into binary spectrogram. Otsus method assumes that the spectrogram is composed of two classes: acoustic events and background noise. An optimal threshold value is used for the decision. After thresholding, each group of contiguous positive pixels will be regarded as a candidate event (Fig. 1).

Step 4: Events filtering using dominant frequency and event area

After aforementioned process, not all detected events are correspond to frog vocalizations. To further remove those events that are from the listed frog species in Sect. 2.1, dominant frequency (\(F_{0}\)) and area within the event boundary (Ar) are used for filtering.

Step 5: Region growing

Region growing algorithm is utilized to obtain the contour of the particular acoustic event [14]. To get the accuracy boundary of each acoustic event and improve the discrimination of extracted features, a 2-D region growing algorithm is applied for obtaining the accuracy event shape within each segmented event. First, a maximal intensity value within each segmented event is selected as the seed point. Then, if the difference between the neighbourhood pixels and the seed(s) is smaller than the threshold, the neighbourhood pixels will be located and assigned to the output image. Next, the new added pixels are used as seeds for further processing until all the pixels that satisfy the criteria are added to the output image. The final results after region growing are shown in Fig. 2. Here, the threshold value is empirically set as 5 dB.

figure b
Fig. 2.
figure 2

Acoustic event detection results after region growing. Left: binary segmentation results; Right: segmented frog syllables.

2.4 Feature Extraction

Based on acoustic event detection results, two feature sets are first calculated to describe each event (syllable): mask descriptor and profile statistic [7]. Here, we exclude histogram of orientation from our feature sets, because the previous study has already demonstrated its poor classification performance [7]. For mask descriptor, it is used to describe the syllable shape including minimum frequency, maximum frequency, bandwidth, duration, area, perimeter, non-compactness, rectangularity. For profile statistics, there are time-Gini, frequency-Gini, frequency-mean, frequency-variance, frequency-skewness, frequency-kurtosis, frequency-max, time-max, mask-mean, and mask standard deviation. The third feature set consists of all features.

Table 1. Accuracy measure for MIML classifiers with different feature sets. Here, \(\downarrow \) indicates the smaller the better, while \(\uparrow \) indicates the bigger the better.

3 Multiple-Instance Multiple-Label Classifiers

After feature extraction, three MIML algorithms are evaluated for the classification of multiple simultaneous frog calls: MIML-SVM, MIML-RBF, and MIML-kNN. With some form of event-level distance measure, the MIML problem has been reduced to a single-instance multiple-label problem by associating each event with a event-level feature [7]. Here, the maximal and average Hausdorff distances between two syllables are used by MIML-SVM and MIML-RBF, separately. For MIML-kNN, the nearest neighbour is used to assign syllable-level features.

4 Experiment Results

4.1 Parameter Tuning

There are three modules whose parameters need to be discussed: signal processing, acoustic event detection, and classification. For signal processing, the window size and overlap are 512 samples and 50 %, respectively. During the process of acoustic event detection, four thresholds for event filtering need to be determined, which are small and large area threshold, and frequency boundary for events filtering. All those thresholds were determined empirically by applying various combinations of thresholds to a small number of randomly selected 10 s clips. For MIML-SVM classifiers, the parameters used are (\(C,\gamma ,r\)) and set as (0.1, 0.6, 0.2) experimentally. For MIML-RBF, the parameters are (\( r, \mu \)) and set as (0.1,0.6). For MIML-kNN, the number of references (k) and citers (\(k^{'}\)) are 10 and 20, respectively.

4.2 Classification

In this study, all the algorithms were programmed in Matlab 2014b. Each MIML algorithm is evaluated with five-fold cross-validation on the collection of 342 species-labelled recordings. Five measures including Hamming loss, rank loss, one-error, coverage, and micro-AUC are used to characterize the accuracy of each algorithm [15, 16]. The definition of each measure can be found in [7], the positive/negatives is defined as \(1-\)Hamming loss and it is 0.818 for MIML-RBF with MD. Mask descriptor (MD) and profile statistical (PS), and all features (AF) are put into the three classifiers, respectively. The performance of each MIML classifier is shown in Table 1. Here, the best classification accuracy is achieved by MIML-RBF using MD. For each classifier, the classification accuracy of MD is higher than PS and AF, which shows that the event shape have higher discrimination power than the event content. To give a concrete view of predictions, the results of 5 randomly selected recordings using MIML-RBF are shown in Table 2. Recordings of No. 1 and No. 3 are accurately predicted.

Table 2. Example predictions with MIML-RBF.

5 Conclusion

In this study, we propose a novel method for the classification of multiple simultaneous frog species in environmental recordings. To the best of our knowledge, this is the first study that applies the MIML algorithm to frog calls. Since frogs tend to call simultaneously, the MIML algorithm is more suitable for dealing with those recordings than single-instance single-label classification. After applying acoustic event detection algorithm to each 10 s recording, each frog syllable is segmented. Then, three feature sets are calculated based on those segmented syllables. Finally, three MIML classifiers are used for the classification of frog calls with the best accuracy (81.8 % true positive/negatives). Future work will focus on the study of novel features and MIML classifiers for further improving the classification performance.