Keywords

1 Introduction

Recently, frog biodiversity has been threatened due to human activity and climate change [1]. Therefore, frog monitoring is becoming ever more important. Compared with traditional monitoring methods such as field observation, acoustic sensors can extend the monitoring into larger spatiotemporal scales [2]. Correspondingly, the use of acoustic sensor generates large volumes of acoustic data, which makes it essential to develop automatic acoustic data processing techniques.

Several papers have already described automated methods for detection and classification of animal calls. Since an elementary unit for frog call classification is one syllable [3], the first step of one frog call classification system is often syllable segmentation. In prior work, different features have already been explored for syllable segmentation, including energy [4, 5], zero-crossing rate (ZCR) [4, 5], amplitude [3], spectrogram [6, 7]. Compared with energy, ZCR, and amplitude, syllable segmentation based on spectrogram is more robust to the background noise [8]. With segmented syllables, feature extraction is the next crucial step for the performance of classification system. Lee et al. used Mel-frequency cepstrum coefficients (MFCCs) for classifying frog and cricket calls with linear discriminant analysis [9]. Chen et al. developed a method for frog call classification based on syllable duration and a multi-stage average spectrum [4]. Bedoya et al. used MFCCs as the feature for the recognition of anuran species with a fuzzy clustering technique [10]. Jie et al. explored image features for frog call classification with a k-nearest neighbour classifier [11]. All the previous work achieves a high accuracy rate in recognition and classification of frog calls. However, most features used are transplanted from speech processing directly, which might be not suitable for studying frog calls.

In this paper, one novel feature based on an adaptive frequency scale bandpass filter bank is proposed for frog call classification. Following our prior work [7], spectrogram is first investigated for segmentation. Then, spectral peak track is extracted from each segmented syllable for feature calculation: track duration, dominant frequency and oscillation rate. Next, a frequency scale is constructed by applying a k-means clustering technique to the dominant frequency of segmented syllables. Furthermore, a bandpass filter bank is designed based on the frequency scale, and applied to the spectral of each frog call syllable for extracting bandpass filter bank cepstral coefficients (BFCCs). Finally, a k-nearest neighbour (k-NN) classifier is used for frog call classification with extracted features. The experimental results show that our proposed feature can achieve the highest classification accuracy for classifying frog calls, which outperforms MFCCs and syllable features (SFs).

2 Materials and Methods

Our frog call classification system consists of four steps including pre-processing, syllable segmentation, feature extraction and classification. Detailed description of each step is shown in the following parts.

2.1 Data Description and Pre-processing

In this study, eighteen frog species which are widely spread in Queensland, Australian are selected for the experiment. All the recordings are obtained from David Stewart’s CD [12]. Each recording includes one frog species, with duration ranged from 8 to 55 s. For pre-processing, human voice are first excluded from the recordings. Then, all the recordings are re-sampled at 16 kHz and mixed to mono.

2.2 Syllable Segmentation

After pre-processing, each recording consists of continuous frog calls, which is made up of multiple syllables. Here, one syllable is an elementary unit of frog vocalizations for species detection [3]. For syllable segmentation, the iterative amplitude-frequency information is explored based on H\(\ddot{a}\)rm\(\ddot{a}\)’s method [6]. The amplitude-frequency information is generated by applying STFT to the frog calls, where the window function is Kaiser window with the size and overlap being 512 samples and 25 %. A Gaussian filter (7\(\times \)7) is optionally used before applying H\(\ddot{a}\)rm\(\ddot{a}\)’s method for segmentation. The filter size used is set taking into account a trade-off between connecting gaps within one syllable and separating adjacent syllables.

2.3 Spectral Peak Track Extraction

For frogs, related species often share more similar advertisement calls than distant species [13]. Applying STFT to those advertisement calls, each frog species is found to occupy one particular frequency band. Therefore, we explore the spectral peak track (SPT) to represent the dominant frequency trace of frog calls. The reasons for using SPT is (1) Isolate the desired signal from background noise; (2) Extract corresponding features based on SPT. The SPT extraction method used is briefly summarized here, with further details provided in [7]. In this SPT extraction algorithm, seven parameters need to be pre-defined (Table 1). The process for selecting those parameters is explained in Sect. 3.

Table 1. Parameters used for spectral peak extraction

Before applying SPT extraction algorithm, each syllable is transformed to the spectrogram by dividing it into frames of 128 samples with 85 % overlap. For the generated spectrogram, selecting the maximum intensity (real peak) from each frame with a minimum required value I is the first step. Then, the time and frequency domain intervals between two successive peaks are calculated for satisfying \(T_{c}\) and \(f_{c}\). If so, one initial track will be generated, then linear regression is applied to the generated track for calculating the position of next predicted peak. Next, the time and frequency domain intervals between predicted peak and the real peak are recalculated for satisfying \(T_{c}\) and \(f_{c}\). If so, the real peak will be added to the initial track. This iterative process continues until \(T_{s}\) is no longer satisfied. After one track stops growing, comparing the duration and density of the track with \(d_{min}\), \(d_{max}\), and \(\beta \) is the next step. If all conditions are satisfied, then the track will be saved to the track list. The SPT results for Neobatrachus sudelli are shown in Fig. 1. During the process of track extraction, time domain gaps of the track are generated where the intensity threshold I is not reached. These gaps can be filled by predicting the correct frequency bin using linear regression, as illustrated in Fig. 1.

Fig. 1.
figure 1

Spectral peak track extraction results for Neobatrachus sudelli (left: selected peaks whose frequency are zero mean that the peaks do not satisfy the intensity threshold I and are set to zero, right: spectral peak track with predicted peaks using linear regression).

Based on each spectral peak track, syllable features are calculated including track duration, dominant frequency and oscillation rate [7]. Here, track duration is the length of track; dominant frequency is calculate by averaging the frequency of the peak within the track; oscillation rate represents the pulse rate within one track.

2.4 Bandpass Filtering for Feature Extraction

After calculating the dominant frequency for all frog species, some frog species are found to have similar dominant frequency but different spectral distribution. In this study, we explore the bandpass filtering technique for capturing the spectral information. First, frequency scale is generated by applying k-means clustering algorithm to the dominant frequency of all frog syllables. Here, k is selected as 18, which is the same with the number of frog species to be classified; the distance function is city block function. After applying the k-means clustering, 18 centroids (\(C_{i},(i=1,...,18)\)) are saved for design the bandpass filter.

Bandpass Filter Design for Feature Extraction. In this study, a cascade of a 20th-order equalizer and a band-pass filter (Butter-worth filter) is used to design a filter bank for feature extraction.

figure a

With the generated bandpass filter bank, we apply it to the spectrum of each frog syllable x(n). Detailed steps for calculating bandpass filter bank cepstral coefficients (BFCCs) are described as follows:

Step 1: Apply bandpass filter bank to X(k)

Filter X(k) with the generated filter bank, and save the filtered results of each bandpass filter as \(B(i,j),j=1,...,J\). Where X(k) is the result after applying fast Fourier transform to the windowed signal x(n), i is the number of coefficients for each bandpass filter, j is the index of the filter.

Step 2: Calculate the energy of filtered result for each frequency band

$$\begin{aligned} E_{i,j} = \sum _{i=1}^{M_{i}}[B(i,j)]^{2} \end{aligned}$$
(1)

where \(M_{i}\) means the number of coefficients after bandpass filtering.

Step 3: Perform discrete cosine transform on the logarithm energy and obtain the feature BFCCs for each windowed signal

$$\begin{aligned} BFCCs(d) = \sum _{i=1}^{I}logE_{i,j}cos(\frac{d(i-0.5)}{I}\pi ) \end{aligned}$$
(2)

where \(d = 1,2,...,D\), and D is the dimension of BFCCs and set as 12. i is the index of energy for each bandpass filter.

Step 4: Average BFCCs over the temporal direction

$$\begin{aligned} BFCCs = \frac{\sum _{f=1}^{F}BFCCs(d,f)}{F} \end{aligned}$$
(3)

where f is the index of windowed signal, F is the number of windowed signal after windowing.

3 Experiments and Discussion

3.1 Parameter Tuning

In this study, parameters of two parts need to be discussed: spectral peak track extraction and feature calculation. For spectral peak track, seven parameters were determined empirically by applying various combinations of thresholds to a small randomly selected syllables. Here, minimum and maximum duration are 60 ms and 1000 ms. The density value is 0.8, which describes the integrity of one syllable. The minimum intensity value is 3 dB. The maximum time interval for connecting peaks is 1.5 ms, and the maximum frequency interval is 500 Hz. For feature extraction, MFCCs are used as the baseline [9], where the window size and overlap are 512 samples and 50 %.

3.2 Classification

The k-NN classifier has been successfully employed for classifying bioacoustic signal [3, 7, 14]. In this experiment, the k-NN classifier is used to learn a model on the training examples with 10-fold cross-validation for frog classification. Since the k-NN classifier is sensitive to the local structure of the data as well the initial cluster centroids, we run the k-NN classifier for 10 times based on different initial points. The feature performance is evaluated by the classification accuracy, which is defined as

$$\begin{aligned} Accuracy(\%)=\frac{N_{c}}{N_{t}}*100\,\% \end{aligned}$$
(4)

where \(N_{c}\) is the number of syllables that are correctly classified, and \(N_{t}\) is the total number of syllables for one frog species. Three features are put into the classifier: syllable features (SFs), MFCCs, and BFCCs. The averaged classification accuracy is shown in Table 2.

In this experiment, the averaged classification accuracy for MFCCs and SFs is 81.4 % and 88.1 % respectively. Our proposed feature achieves the highest classification accuracy (94.3 %). For MFCCs, the classification accuracy of Neobatrachus sudelli and Philoria kundagungan is 100 %, because their spectrum is different from other frog species. Compared with MFCCs and SFs, the classification accuracy of all frog species using BFCCs are higher than 90 % except Mixophyes fleayi. Since the spectrum of Mixophyes fleayi and Limnodynastes terraereginae are similar, the classification accuracy of them are relatively low. However, the duration and oscillation rate between Mixophyes fleayi and Limnodynastes terraereginae are different, which leads to a higher classification accuracy for SFs. For MFCCs, the spectrum is extracted based on the Mel-scale filter bank, which is designed based on the human auditory rather than the character of frog calls. With derived dominant frequency which has shown its ability for discriminating frog species [7, 13], the designed bandpass filter bank is more suitable for the frequency scale of frog species to be classified. Compared with SFs, the use of bandpass filter bank captures not only the information of dominant frequency but also the distribution of the frog calls through all frequency bands.

Table 2. Classification accuracy (mean and standard deviation) for 18 frog species with three different features.

For testing the robustness of the syllable features, a Gaussian white noise signal, with signal to noise ratio (SNR) of 40 dB, 30 dB, 20 dB, and 10 dB was added to the original audio data. The results from running the classifier on audio data with artificially added background noise are shown in Fig. 2, which show the ability of our feature extraction method for dealing with background noise.

Fig. 2.
figure 2

Classification accuracy with MFCCs, SFs, and BFCCs under different levels of noise contamination.

4 Conclusion

This study presents a novel feature extraction method for frog call classification. After segmenting the audio data into syllables, the SPT algorithm is applied to each syllable. Then, syllable features are calculated including track duration, dominant frequency and oscillation rate. Based on the dominant frequency, a frequency scale is constructed with a k-means clustering algorithm for generating the bandpass filter bank. Finally, a feature set is extracted with generated filter bank for classifying frog calls using a k-NN classifier. The experimental results are promising with an average classification accuracy of 94.3 % for BFCCs. Future work will include additional experiments that test a wider variety of frog calls from different geographical and environment conditions.