1 Introduction

Interest in music information retrieval has been increasing, since many users of vast audio repositories seek to find audio files with their favorite tune, or try to automatically obtain scores for given audio data. Research has been performed on query-by-humming, aiming at finding pieces of music with the tune sung by the user. Automatic transcription (i.e., automatic extraction of musical scores) is much more difficult, as it requires working on polyphonic and polytimbral data. The possibility of identification of the instrument playing each sound would be useful for this purpose, and it would also be desirable for users searching for pieces with their favorite instrument playing.

Automatic recognition of musical instrument sound has been investigated by various research centers, starting from recognition of isolated sounds, or sounds in mixes. The investigations were performed for various set-ups, including different sets of sound data (different sources, number of instruments, sounds, etc.), different parameterization of the audio data (feature vectors of various lengths, based on various methods of sound analysis), and classifiers applied. Audio data for research on musical instrument sound identification are often excerpted from big repositories of single instrument sound recordings, e.g. MUMS CD’s, RWC Music Database (with Musical Instrument Sound Database), or IOWA Musical Instrument Samples (Goto et al. 2002; Opolko and Wapnick 1987; The University of IOWA Electronic Music Studios 2009). Although some of the sounds recorded in these repositories have some imperfections, these problems were already identified. Therefore, these three repositories can be considered standard in research on musical instrument sounds.

Parameterization used in research on musical instrument sound recognition includes features describing properties of DFT spectrum, wavelet analysis coefficients, MFCC (Mel-Frequency Cepstral Coefficients), multidimensional scaling analysis trajectories, etc.; more information can be found in Herrera et al. (2000). Even though there is no standard set of parameters used in audio research, still, the MPEG-7 features reflect parameterization used in experiments and MPEG-7 parameters are applied in research on musical instrument sound identification (ISO/IEC JTC1/SC29/WG11 2004; Manjunath et al. 2002; Wieczorkowska and Kolczynska 2008). The classifiers applied in this research include k-nearest neighbors, artificial neural networks, rough set based algorithms, support vector machines, etc., see e.g. Cosi et al. (1994), Fujinaga and McMillan (2000), Kaminskyj (2002), Martin and Kim (1998), Wieczorkowska (1999). The obtained results are hard to compare because of various numbers of classes used, number of objects per class, and audio data itself. For instance, the recognition of instruments for isolated sounds can reach 100% for a small number of classes, more than 90% if the instrument or articulation family is identified, or about 70% or less for recognition of an instrument when there are more classes to recognize.

Identification of musical instrument sound is even more difficult when mixes are considered, but research on such data has already been performed as well (Dziubinski et al. 2005; Wieczorkowska and Kolczynska 2008; Zhang 2007). The possibility to recognize sounds in mixes is especially desirable for aiding automatic music transcription, see e.g., Klapuri (2004), as this is one the most important tasks in Music Information Retrieval (MIR). Many users of digital audio files might be interested in extracting semantic information from these files, including melodic, timbral and harmonic information; practically speaking, this can be of interest to anybody who listens to digital audio recordings. Therefore, research in MIR is very important, since the obtained results can be of interest to everybody.

The research presented in this paper is a follow-up to an article presented at the ISMIS’08 conference (Wieczorkowska and Kolczynska 2008), addressing the problem of instrument identification in sound mixes. In Wieczorkowska and Kolczynska (2008), we proposed methodology for identification of dominating instruments in mixes of sounds of the same pitch for instruments representing chordophones and aerophones, choosing sustained harmonic sounds, and we performed validating experiments that yielded promising results. In this paper, we significantly extend this research, adding idiophones to the investigated sounds, and thus also investigating non-sustained and non-harmonic sounds. Additionally, we present classification tests performed on sound samples representing recordings different from those used in the training phase.

Automatic recognition of the dominating musical instrument in sound mixes, when spectra overlap, is quite a difficult issue, most difficult when the interfering sounds are of the same pitch and harmonic partials in their spectra overlap (see Fig. 1). The number of possible combinations of sounds in mixes is very high even if same-pitch mixes are considered, because of the number of existing instruments and sounds within their scale ranges. Therefore, it would be desirable to obtain a classifier performing such recognition, and also to train the classifier on a limited data set. The motivation for this paper is to perform experiments on a limited range of sounds for a selection of instruments, and use mixed artificial sounds with broadband spectrum, overlapping with the sound under consideration, in order to check if classifiers trained this way would work for sounds mixes of real instruments. In other words, our goal was to check if using a limited number of artificial sounds mixed with the original ones can be sufficient to train a classifier to recognize dominating musical instrument in polytimbral mix of one pitch. The main focus of the paper was on construction of training and testing data, which will lead, if successful, to further experiments with other mixes of musical instrument sounds.

Fig. 1
figure 1

Sounds of the same pitch and their mixes. Time domain representation of sound waves (left) and spectrum of these sounds (right) is shown. After mixing, it is very difficult to see the contribution of the flute sound to the mix, both in time and frequency (spectrum) domains. The diagrams were prepared using Adobe Audition (Adobe Systems Incorporated 2003)

The research presented here is a pilot study; therefore, the number of instruments and sounds investigated was limited, as our goal was to test the methodology, rather than performing full range experiments on all possible instruments in full musical scale. If the methodology yields satisfactory results, it can be applied in broad experiments in identification of instruments in mixes, and then aid extraction of further semantic information from audio data, performed in MIR tasks.

In this research, we decided to choose 12 instruments producing sounds recognized by humans as of definite pitch, choosing sustained sounds if possible, and to limit the range to the octave no. 4 in MIDI notation. This was an arbitrary choice; still, the octave no. 4 is available for many instruments, so we could quite easily find the data for experiments, and also the reference pitch A4 is from this octave, so the choice of the octave no. 4 seemed to be justified.

We are aware that the timbre of sound differs if sounds of different pitch of the same instrument are concerned, but since our investigations were focused on same-pitch environment (rather than identification of all single sounds of each instrument), we decided to limit the frequency range to one octave available for all the investigated instruments. The sounds mixed with the original sounds in the training set include noises and artificial sound waves of harmonic spectrum. The test set contains the original sounds mixed with sounds of other 11 instruments, always of the same pitch. The level of the added sounds was processed to make sure that the sound of the main instrument is louder than the other sound all the time.

We have already performed experiments on training classifiers recognizing dominating musical instrument in polytimbral environment, i.e., when the sound of other instruments is accompanying the main sound. However, the pitch of the accompanying sound or sounds was usually different than the pitch of the main sound. Additionally, the added sounds were diminished in amplitude in a very simple way, i.e., by re-scaling their amplitude through multiplying the sample values by a constant factor for all sounds. Since the main and the added sounds were not edited in any other way, in many cases the added sounds were starting earlier or ending later that the main ones, being actually louder in some parts. Therefore, we decided to change the experiment set up to make sure that the main sound is actually louder all the time, and thus that the classifiers are trained properly. Additionally, previous experiments were performed using Support Vector Machine classifier with linear kernel, but the results were not satisfactory, since the recognition accuracy did not show any clear dependence on the level of the sounds added in mixes. Therefore, a non-linear kernel was used in the experiments described here.

Experiments have also been performed with testing on instrumental sounds that were not used in training in any form; in this case, we used audio data from different recordings.

The paper is organized as follows. In Section 2, the construction of the training and testing data sets is described, including description of features used for sound parameterization. Section 3 briefly describes Support Vector Machine classifiers used in our research. Experiments and their results are presented in Section 4. We conclude our paper in Section 5, also presenting directions for future experiments.

2 Experimental settings

The goal of our investigation was to prepare classification methodology for the purpose of automatic identification of a dominating instrument in mixes of sound of the same pitch. In this section, we provide details on what audio data we choose in our investigations, both for training and testing of the classifiers, and also about parameters applied in the feature vector used for classification purposes.

The main audio data in our experiment consist of sounds (of no more than ten seconds) from 12 musical instruments limited to octave number four (MIDI notation) containing 12 pitches. From this data, multiple training and testing data sets were constructed by mixing with noise (white, pink) and harmonic waves (saw-tooth and triangular) or mixing instrument sounds at various volume levels.

We have already worked on sounds mixes in previous research (Wieczorkowska 2008). However, the choice of the accompanying instrument sounds in the training and testing sets was arbitrary, and the spectra were not overlapping in most cases. Also, the length and the level of accompanying sound was not normalized. As a result, a clear dependency between the quality of the classifiers and the level of accompanying sound was not observed. This is why we decided to normalize the sound length and level of the added sounds with respect to the main sounds, and make sure that the level of added sounds does not exceed the level of the main sounds at any time.

Musical instrument sounds were added as background sounds in the testing data because the final goal is to recognize dominating instruments in musical recordings. However, in the case of the training data, artificial sounds having rich spectra were added instead, thereby reducing training data size because not all possible combinations of added musical instrument sounds had to be considered. In addition, the complexity necessary to be useful for testing purposes (it is hoped) was retained because the added sounds had spectra that overlapped with the main sound’s spectrum. This procedure had the effect of ensuring different data sets were constructed for training and testing, which is important in the case of sound data (Livshin and Rodet 2003).

2.1 Parameterization

The audio data we deal with represent digital sounds in .snd format, recorded stereo with 44.1 kHz sampling rate and with 16-bit resolution. Such a representation of sound waves requires parameterization for successful classification.

Features used to parameterize musical audio data may describe temporal, spectral, and spectral-temporal properties of sounds. The research on musical instrument sound recognition conducted worldwide is based on various parameters, including features describing properties of DFT spectrum, wavelet analysis coefficients, MFCC (Mel-Frequency Cepstral Coefficients), multidimensional scaling analysis trajectories, and so on (Aniola and Lukasik 2007; Brown 1999; Herrera et al. 2000; Kaminskyj 2002; Kitahara et al. 2005; Martin and Kim 1998; Wieczorkowska 2000). MPEG-7 sound descriptors can also be applied for musical sound parameterization (ISO/IEC JTC1/SC29/WG11 2004), but these parameters are not dedicated to recognition of particular instruments in recordings.

The construction of a feature set is an important part of creating the database for sound classification purposes. The results may vary depending on the applied parameterization. In our research, we decided to use the feature vector already applied for the recognition of musical instruments in polyphonic (polytimbral) environments (Zhang 2007). The feature set we have chosen consists of 219 parameters, based mainly on MPEG-7 audio descriptors, and on other parameters used in similar research. Most of the parameters represent average values of frame-based attributes, calculated for consecutive frames of a single sound using a sliding analysis window, moved through the entire sound. The calculations were performed for the left channel of stereo data, using a 120 ms analyzing frame with Hamming window and hop size of 40 ms; such a long analyzing frame allows analysis of the low-pitched sounds even for the lowest audible fundamental frequencies. We also decided to perform parameterization for a shorter analyzing window, 24 ms, and 8 ms hop size. However, since parameterization includes features based on spectral analysis, we realize that shorter frame yields lower resolution of spectrum, and thus lower quality of features. For calculating spectrum, zero-padding was performed in order to use FFT, so the length of the analyzing frame was set to the next power of two greater than for the basic frame length. The experiments described here were performed on a limited data set (12 instruments), but we plan to extend our research to a much bigger set of data, and the feature set presented below can be used for this purpose.

Below are listed parameters included in our feature vector; detailed description of popular features can be found in the literature, and equations are given for less commonly used features (Zhang 2007):

  • MPEG-7 audio descriptors (ISO/IEC JTC1/SC29/WG11 2004; Peeters et al. 2000; Wieczorkowska and Kolczynska 2008):

    • Audio Spectrum Spread—a RMS value of the deviation of the Log frequency power spectrum with respect to the gravity center in a frame; the value was averaged through frames for the entire sound;

    • Audio Spectrum Flatness, flat1, ..., flat25—multidimensional parameter describing the flatness property of the power spectrum within a frequency bin for selected bins; 25 out of 32 frequency bands were used for a given frame, and next these values were averaged for the entire sound;

    • Audio Spectrum Centroid—power weighted average of the frequency bins in the power spectrum of all the frames in a sound segment, calculated with a Welch method (Ifeachor and Jervis 2002);

    • Audio Spectrum Basis: basis1, ..., basis165; spectral basis parameters are calculated for the spectrum basis functions. In our case, the total number of sub-spaces in the basis function is 33, and for each sub-space, minimum/maximum/mean/distance/standard deviation are extracted to decrease the size of these data. Distance is calculated as the summation of dissimilarity (absolute difference of values) of every pair of coordinates in the vector (Zhang 2007). Spectrum basis function is used to reduce the dimensionality by projecting the spectrum (for each frame) from high dimensional space to low dimensional space with compact salient statistical information. The calculated values were averaged over all analyzed frames of the sound;

    • Harmonic Spectral Centroid—the average (over the entire sound) of the instantaneous Harmonic Centroid, calculated for each analyzing frame. The instantaneous Harmonic Spectral Centroid is the mean of the harmonic peaks of the spectrum, weighted by the amplitude in linear scale;

    • Harmonic Spectral Spread—the average over the entire sound of the instantaneous harmonic spectral spread, calculated for each frame. Instantaneous harmonic spectral spread represents the standard deviation of the harmonic peaks of the spectrum with respect to the instantaneous harmonic spectral centroid, weighted by the amplitude;

    • Harmonic Spectral Variation—mean value over the entire sound of the instantaneous harmonic spectral variation, i.e., of the normalized correlation between amplitudes of harmonic peaks of each 2 adjacent frames;

    • Harmonic Spectral Deviation—average over the entire sound of the instantaneous harmonic spectral deviation, calculated for each frame, where the instantaneous harmonic spectral deviation represents the spectral deviation of the log amplitude components from a global spectral envelope;

    • Log Attack Time—the decimal logarithm of the duration from the time when the signal starts to the time when it reaches its maximum value, or when it reaches its sustained part, whichever comes first; in our experiments, the sustained part is considered to begin when the pitch stays constant or changes no more than 50 cents (half of semitone) up or down;

    • Temporal Centroid—energy weighted mean of the sound duration; this parameter shows where in time the energy of the sound is focused;

  • other audio descriptors:

    • Energy—average energy of spectrum in the parameterized sound;

    • MFCC—min, max, mean, distance, and standard deviation of the MFCC vector, through the entire sound; again, distance was calculated the same way as in the case of Audio Spectrum Basis;

    • Zero Crossing Density—zero-crossing rate averaged through all frames for a given sound, where zero-crossing is a point where the sign of time-domain representation of sound wave changes;

    • Roll Off—the frequency below which an experimentally chosen percentage equal to 85% of the accumulated magnitudes of the spectrum is concentrated (averaged over all frames). It is a measure of spectral shape, used in speech recognition to distinguish between voiced and unvoiced speech;

    • Flux—the difference between the magnitude of the DFT points in a given frame and its successive frame, averaged through the entire sound. This value was multiplied by 107 to comply with the requirements of the classifier applied in our research;

    • Average Fundamental Frequency—fundamental frequency of the steady state of a given sound, averaged through all consequent analysed frames (maximum likelihood algorithm was applied for pitch estimation Zhang et al. 2007). Steady state is considered to begin when the pitch stays constant, or changes no more than 50 cents (half of semitone) up or down; similarly the end of the steady state is defined here;

    • Ratio r1, ..., r11—parameters describing various ratios of harmonic partials in the spectrum;

      • r1: energy of the fundamental to the total energy of all harmonic partials,

      • r2: amplitude difference [dB] between 1st partial (i.e., the fundamental) and 2nd partial,

      • r3: ratio of the sum of energy of 3rd and 4th partial to the total energy of harmonic partials,

      • r4: ratio of the sum of partials no. 5–7 to all harmonic partials,

      • r5: ratio of the sum of partials no. 8–10 to all harmonic partials,

      • r6: ratio of the remaining partials to all harmonic partials,

      • r7: brightness—gravity center of spectrum,

      • r8: contents of even partials in spectrum,

        $$r_{8}=\frac{\sqrt{\sum_{k=1}^{M}A_{2k}^{2}}}{\sqrt{\sum_{n=1}^{N}A_{n}^{2}}}$$

        where A n —amplitude of nth harmonic partial,

        N—number of harmonic partials in the spectrum,

        M—number of even harmonic partials in the spectrum,

      • r9: contents of odd partials (without fundamental) in spectrum,

        $$r_{9}=\frac{\sqrt{\sum_{k=1}^{L}A_{2k-1}^{2}}}{\sqrt{\sum_{n=1}^{N}A_{n}^{2}}}$$

        where L = number of odd harmonic partials in the spectrum,

      • r10: mean frequency deviation for partials 1–5,

        $$r_{10}=\frac{\sum_{k=1}^{5}A_{k}\cdot \triangle f_{k}/kf_{1}}{\sum_{k=1}^{5}A_{k}}$$

        where \(\triangle f_{n}=((f_{n}-nf_{1})/nf_{1})\cdot 100\)% for a given frame,

      • r11: partial (i = 1,...,5) of the highest frequency deviation.

These parameters describe basic spectral, timbral spectral and temporal audio properties, and also spectral basis descriptor, as described in the MPEG-7 standard. The spectral basis descriptor is a series of basis functions derived from the Singular Value Decomposition (SVD) of a normalized power spectrum. In order to avoid too high dimensionality of the feature vector, a few other features were derived from the spectral basis attribute. Other audio descriptors used in our feature vector include time-domain and spectrum-domain properties of sound, used in research on audio data classification (Zhang 2007).

We are aware that the feature vector used in these experiments is rather long, and maybe could be limited. We have already performed research aiming at analysis of importance of the attributes of our feature vector (Kursa et al. 2009). This analysis has shown that in most cases all descriptors used in the MPEG-7 set may be useful for the classification tasks; there is no clear cut-off value between important and non-important attributes, and the number of important attributes grows with the level of added sound, although a set of less significant attributes can still work well in the recognition task. Therefore, we decided to use the feature vector as presented above in this section.

2.2 Training data

The training data contain isolated sounds of single musical instruments, and also the same sounds mixed with additional sounds. The audio recordings from McGill University Master Samples (MUMS) CDs have been used as a source of these sounds (Opolko and Wapnick 1987). These CDs are commonly used worldwide for research on musical instrument sounds. The following 12 instruments have been chosen:

  1. 1.

    B-flat clarinet,

  2. 2.

    flute played vibrato,

  3. 3.

    oboe,

  4. 4.

    trumpet,

  5. 5.

    French horn,

  6. 6.

    tenor trombone,

  7. 7.

    violin—bowed, played vibrato,

  8. 8.

    viola—bowed, played vibrato,

  9. 9.

    cello—bowed, played vibrato,

  10. 10.

    piano (Steinway, played forte),

  11. 11.

    marimba, single stroke,

  12. 12.

    vibraphone, played with hard mallet.

The instruments chosen represent various families, including aerophones (wind instruments), chordophones (stringed instruments) and idiophones (instruments made of solid sonorous material; part of percussive instruments group). All these instruments produce sounds that can be recognized as of definite pitch, and their spectra show harmonic contents. MUMS CDs offer various versions of sound samples for many instruments, played with different articulation, i.e., playing technique. In this research we decided to experiment with only one version for each of these 12 instruments. Sustained sounds were chosen, if possible, so we discarded sounds played pizzicato (plucking the string—the resulting sound is very short and has no steady state) and tremolo (series of very quick repetitions of the same, short sound). For some instruments articulation vibrato was selected, i.e., playing with vibrato, if it was a typical articulation for this instrument. We decided to use sounds covering the octave no. 4 in MIDI notation, i.e., 12 sounds for each instrument. Therefore, each class (representing one instrument) consists of the same number of elements.

The ultimate goal of our research is to prepare classifiers for recognition of the main instruments in real recordings, where mixes of sounds constitute the prevailing part of the audio recording. In order to train classifiers for this purpose, we also prepared mixes of pairs of sounds, made of the instrumental sounds mentioned above and the following generated sounds:

  • white noise,

  • pink noise,

  • triangular wave,

  • saw-tooth wave.

As we can see in Figs. 2 and 3, such mixes represent well the situation when the spectra of the contributing sounds overlap and the instrumental sound of interest is hardly visible in the diagrams.

Fig. 2
figure 2

Time domain and spectra of an instrumental sound (flute), artificial harmonic sound (triangular wave), and their mix. The diagrams were prepared using Adobe Audition (Adobe Systems Incorporated 2003)

Fig. 3
figure 3

Time domain and spectra of an instrumental sound (flute), noise (pink noise), and their mix. Spectrum of noise has no harmonic partials. The diagrams were prepared using Adobe Audition (Adobe Systems Incorporated 2003)

For each mix, the added sound was of the same pitch as the main sound, but of lower level. All accompanying sounds have broadband spectra, continuous in the case of noises and harmonic in case of triangular and saw-tooth wave. Harmonic sounds were prepared for the frequencies from the octave no. 4. These sounds were produced using Adobe Audition (Adobe Systems Incorporated 2003), where only integer values were allowed. Therefore, the frequency values of the generated harmonic waves were rounded to the nearest integers, as below (standard values for A4=440 Hz shown in parentheses):

  • C4 - 262 Hz (261.6),

  • C#4 - 277 Hz (277.2),

  • D4 - 294 Hz (293.7),

  • D#4 - 311 Hz (311.1),

  • E4 - 330 Hz (329,6),

  • F4 - 349 Hz (349.2),

  • F#4 - 370 Hz (370.0),

  • G4 - 392 Hz (392.0),

  • G#4 - 415 Hz (415.3),

  • A4 - 440 Hz (440),

  • A#4 - 466 Hz (466.2),

  • B4 - 494 Hz (493.9).

The generated sounds were prepared with a duration of eight seconds, since that was the length of the longest musical instrument sound in the initial training data set. The mixes were prepared in such a way that for each pair the length of the added sound was truncated to the length of the main sound, and 0.1 s of silence replaced the beginning and the end of the added sound. Next, from the end of the silence at the beginning till 1/3 of the sound length the fade in effect was applied; similarly, fade out was applied from 2/3 of the sound. Thus we make sure that the added sounds (initially generated at constant loudness) are not louder that the main sound during transients of the instrumental sounds, as the recognition of the dominating sound was the assumption of our research. Before mixing, for each instrumental sound chosen to dominate in the mix with another sound, the level of the louder sound in a given pair (i.e., the added sound in case of the training data) was first re-scaled to match the RMS of the softer sound. After this pre-processing, eight versions of mixes were prepared. In order to assure that the instrumental sound is louder, thus actually dominating in the mix, the level of added sounds was diminished to

  1. 1.

    \(100\mbox{\%}/\sqrt{2}\approx 70.71\mbox{\%}\),

  2. 2.

    50%,

  3. 3.

    \(50\mbox{\%}/\sqrt{2}\approx 35.36\mbox{\%}\),

  4. 4.

    25%,

  5. 5.

    \(25\mbox{\%}/\sqrt{2}\approx 17.68\mbox{\%}\),

  6. 6.

    12.5%,

  7. 7.

    \(12.5\mbox{\%}/\sqrt{2}\approx 8.84\mbox{\%}\),

  8. 8.

    6.25%

of the volume level of the main sound. Therefore, we fulfill the assumption of our research as presented in the title of our paper, i.e., we prepare data for identification of the dominating instrument in the sound mix.

Since human perception is logarithmic with respect to changes of stimulus of any type, the levels of added sounds represent gradual logarithmic diminishing of amplitude (with scaling factor \(\sqrt{2}\)), and are perceived by our hearing system as uniform. Each mix was prepared as the average of the input sounds.

Altogether, the training data represent isolated sounds of 12 musical instruments (12 sounds per instrument), and also the same sounds mixed with triangular wave, saw-tooth wave, white noise, and pink noise, at 8 volume levels. The obtained training set consists of 144 (12 instruments * 12 sounds) isolated sounds of single musical instruments, and also the same sounds with added noises and harmonic waves in 8 level versions as described above, altogether 4752 sounds (12 instruments * 12 sounds * 4 mixes * 8 levels + 12 instruments * 12 sounds).

2.3 Testing data

The data for testing consisted of mixes of instrument sounds only. As in case of the training data, the testing data were prepared in 8 versions, i.e., for the same 8 volume levels of added sounds and following the same preprocessing procedure (adjusting RMS, truncating length, adding fade-in from silence, and fade out to silence). Thus, for each subset, the added sound was of the same pitch as the main sound, and was created as the average of the 11 remaining instruments, modified in amplitude during preprocessing as the sounds added in the training set, and diminished to the desired level. Therefore, we had 8 subsets of the test set, each one consisting of 144 sounds. Altogether the test data consist of 8 * 144 = 1152 sounds—same-pitch mixes of isolated single instrument sounds, representing the main sound to recognize, with all 11 other instruments’ sounds (representing this set of instruments).

Additionally, the testing was performed on audio data not used in training. We used sound data available from IOWA Musical Instrument Samples (The University of IOWA Electronic Music Studios 2009) for 10 instruments and MUMS recordings (Opolko and Wapnick 1987) for 2 instruments (vibraphone and marimba), in this case choosing different recordings than those used for training. Our goal was to check if the classifiers trained on sound data representing single recordings for each sound can be used for classification of different recordings of such musical instruments. The test set was based on 144 sounds of musical instruments, and contained 1152 sounds representing mixes at 8 volume level versions. Classifiers obtained for training as described in Section 2.2 were applied in these experiments.

Again, the test data represented mixes of the dominating instrument with same-pitch sounds of the remaining 11 instruments. As before, 8 level versions were prepared, altogether 8 * 144 = 1152 sound mixes.

3 Classification with support vector machines

Our experiments on predominant instrument identification in same-pitch mixes were performed using Support Vector Machines (SVM) classifiers. In this section we provide background information of SVM, including details of the construction of classifiers used in our research.

The classification experiments were performed using SVM classifiers, since we have multi-dimensional data for which SVM is suitable, as it aims at finding the hyperplane that best separates observations belonging to different classes in multi-dimensional feature space. Also, SVM classifiers were already reported successful in case of musical instrument sound identification (Herrera et al. 2000) and are generally a tool of trade for multi-dimensional data.

In a SVM algorithm, the training vectors x i are mapped into a higher-dimensional space by the function F, and next SVM finds a linear separating hyperplane with the maximal margin in this higher dimensional space, with C > 0 penalty parameter of the error term (Hsu et al. 2008). The so-called kernel function K, defined as \(K(x_{i}, x_{j}) = F(x_{i})^{T} F(x_{j})\) was selected as a non-linear one, in the radial basis function (RBF) form (Hsu et al. 2008):

$$K(x_{i}, x_{j}) = exp\big( - \gamma \left\| {x_i - } \right.\left. {x_j } \right\|^2 \big),\;\;\gamma > 0 $$

There are SVM implementations available in the Internet, including WEKA software (The University of Waikato 2009), and LIBSVM library (Chang and Lin 2001). These applications implement Sequential Minimal Optimization (SMO) algorithm versions for fast training of SVM (Fan et al. 2005; Platt 1998; Class SMO 2008). In this research, we used the LIBSVM library, which not only performs SVM training and testing for the investigated data, but also allows a grid-searching heuristic procedure for finding pseudo-optimal values of C and γ for the SVM with RBF kernel. Grid-search performs cross-validation CV-5 for the training data for pairs of exponentially growing sequences of C, γ and the one with the best cross-validation accuracy is selected (Hsu et al. 2008). Next, the training of SVM on the training set and testing on new data (test set) can be performed for these C, γ. In our research, different training and testing sets have been used, so the CV-5 procedure mentioned above was solely used to find pseudo-optimal C and γ. Before applying SVM training algorithm, each attribute of the training data was normalized to the [−1, 1] interval. The obtained scalling coefficients were then used to re-scale testing data accordingly.

In the presented research, we first performed investigations aiming at finding pseudo-optimal settings of SVM classifiers using the heuristic techniques presented above, and then performed training and testing for the obtained settings. Although different classifiers, other than SVM could also be investigated (see, e.g., Kursa et al. 2009 for comparison), we decided to focus on methodology using the proposed sound mixes in such experiments, rather than comparing classifiers.

4 Experiments and results

The experiments performed in our investigations are presented below. We started with experiments with the data represented by a feature vector obtained for a 120 ms analysis window. Also, experiments for parameterization based on 24 ms analysis window were performed, and the results are presented in this section. Firstly, the test set was chosen as data representing mixes of instrumental sounds used in the training phase—the dominating instrument mixed with other instruments (all sounds used in training). Secondly, we conducted experiments with testing of the obtained classifiers on sound data representing instrumental mixes based on recordings that were not used in training. The experiments and results are shown in this section, including problems we encountered, and our ideas to address them.

Our experiments aimed at configuring SVM classifiers for our data, starting with isolated single instrument sounds of octave no. 4 for the 12 musical instruments listed in Section 2.2, then for the dataset expanded by adding the same sounds mixed with additional artificial sounds (triangular wave, saw-tooth, white and pink noise) at 8 loudness levels for each level separately, and finally for all training sounds together. Pseudo-optimal values of C and γ were searched out for these SVM classifiers. The obtained classifiers were then tested on the test data, i.e., single instrument sounds mixed with the sum of the other 11 instrument sounds at the same pitch and the same level of added sounds as was used for training, for each of these 8 levels separately, and then on the entire test set. Therefore, we always had different types of sounds for training (single instrument sounds or mixes with artificial sounds) and for testing (instrumental sound mixes), and the sounds from the test set were never used for training, to reflect the real-world situation when the test data are usually different than the training data. One of the main purposes of our experiments was to check if classifiers trained on a limited set of especially prepared data can then be used on instrumental data. In each experiments, we had 12 classes representing 12 instruments to identify as dominating in the mix.

Main experiments were performed for the 120 ms analyzing frame, as this frame length yielded good results in other experiments, and it yields best resolution of spectrum, thus allowing better parametrization. Additionally, we performed experiments with shorter analyzing frame, 24 ms long, which is a more typical length (in case the 120 ms window turned out to be too long for the proper classification).

The settings of c, γ used in these experiments (as found using the heuristic procedure presented in Section 3) are shown in Table 1.

Table 1 Pseudo-optimal c, γ applied in our experiments, as found using the heuristic procedure presented in Section 3

General results of experiments described in Section 2 for MUMS data and 120 ms analyzing frame are shown in Table 2 for settings c, γ of the RBF support vector machine classifiers that yielded the best results in CV-5 based grid-search.

Table 2 Results of experiment with SVM classifiers using pseudo-optimal c, γ in each case for 120 ms analyzing frame, for training and testing on instrument sounds from MUMS CDs

As we could expect, the higher level of accompanying sounds in mixes, the lower the recognition rate (see Fig. 4). In experiments with linear SVM classifier such a tendency was not observed, so we consider the RBF-kernel SVM classifier to be the appropriate tool to deal with such data. Also, adding mixes to the training set improves the quality of the recognition of the dominating instrument. This can be observed in detail when comparing confusion matrices—as an example, we present here confusion matrices for testing at 6.25% level of sound added in mixes and training on single instrument sounds (Table 3) and training on the data representing both single instrument sounds and mixes at 6.25% level (Table 4).

Fig. 4
figure 4

Accuracy of identification of the dominating instrument for training and testing at various levels of sounds added in mixes, for the 120 ms analyzing frame and instrumental data from MUMS CDs

Table 3 Confusion matrix obtained using SVM classifier using pseudo-optimal c, γ for 120 ms analyzing frame, training on single instrument sounds, and testing on instrumental mixes at 6.25% level, for instrument sounds from MUMS CDs
Table 4 Confusion matrix obtained using SVM classifier using pseudo-optimal c, γ for 120 ms analyzing frame, training on single instrument sounds and mixes with artificial sounds at 6.25% level, and testing on instrumental mixes at 6.25% level, for instrument sounds from MUMS CDs

These results are a bit lower than those obtained in our earlier experiments for mixes of sounds of different pitch (Wieczorkowska 2008)—for 4 instruments, the obtained accuracy ranged from 54% (for 50% level of added sounds) to 92%, using the following classifiers available in WEKA (The University of Waikato 2009): Bayesian networks, logistic regression model, decision trees, and locally weighted learning. Since no clear trend was observed in those experiments, we decided to adjust the set-up of experiments, using SVM as recommended by many researchers, and focusing on the most difficult case of same-pitch mixes, with training on both isolated sounds and mixes, as presented here.

Although increase of recognition rate after adding mixes can be generally caused by increase of the size of the training set, still this could not be achieved if the additional training examples were not properly chosen. We conclude that the classifier for recognition of the dominating instrument in sound mixes should be trained not only on pure sounds for each instrument, but also on some mixes as well. When training was performed on instrumental sounds only, and testing on the combined test data, then the test set was significantly bigger than the training set. Still, about 65% accuracy was obtained, which is quite a satisfactory result for 12 classes and test performed on new data. However, adding mixes to the training set increased recognition accuracy to 85%.

More insight into the details of recognition of instruments can be given by presenting confusion matrices for particular classifiers. Probably the most interesting is the confusion matrix for training on the combined training set and testing on the combined test set, as shown in Table 5. As we can see, most difficult to identify are marimba, vibraphone, and violin. Still, the performance of SVM is improved after adding mixes, since for training on single instrumental sounds the recognition is even lower, as we can see in Table 6.

Table 5 Confusion matrix obtained using SVM classifier using pseudo-optimal c, γ for 120 ms analyzing frame, training on combined training data and testing combined test data, for instrument sounds from MUMS CDs
Table 6 Confusion matrix obtained using SVM classifier using pseudo-optimal c, γ for 120 ms analyzing frame, training on single instrument sounds and testing on mixes of various levels, for all instrument sounds originating from MUMS CDs

Since a 120 ms analyzing frame seems to be rather long for sounds from the 4th octave (in MIDI notation), we also performed experiments for a 24 ms analyzing frame. General results of these experiments are shown in Table 7. As we can see, results for 120 ms are quite similar to results obtained for 24 ms analyzing frame, but the former are slightly better in most cases.

Table 7 Results of experiment with SVM classifiers using pseudo-optimal c, γ in each case for 24 ms analyzing frame, for training and testing on instrument sounds from MUMS CDs

Again, we can observe a clear general tendency: the higher the level of additional sounds mixed with the main instrument sound, the lower the achieved accuracy (see Fig. 5). This is reasonable and we expected such results. Since we investigated quite a difficult task of recognizing the dominant instrument in same-pitch environment, the obtained accuracy for 12 classes is quite satisfactory (when the training set contains mixes).

Fig. 5
figure 5

Accuracy of identification of the dominating instrument for training and testing at various levels of sounds added in mixes, for the 24 ms analyzing frame and instrumental data from MUMS CDs

When comparing the results for the 120 ms and 24 ms frames, we can observe that even though adding mixes to the training sets improves the accuracy in all cases, the improvement is more visible for the 24 ms frame.

More details on classification accuracy for particular instruments can be observed in confusion matrices. The confusion matrix for experiments on audio data based on MUMS sounds for 24 ms analyzing frame is shown in Table 8, illustrating results for training on the combined training data and combined testing data for MUMS instrument sounds.

Table 8 Confusion matrix obtained using SVM classifier using pseudo-optimal c, γ for 24 ms analyzing frame, training on combined training data and testing combined test data, for instrument sounds from MUMS CDs

Comparing results for 24 ms and 120 ms analyzing frame, we can observe that again marimba, vibraphone and violin were the most problematic instruments, but marimba and vibraphone are now more difficult to identify. Still, overall results are similar, so we conclude that both 120 ms and 24 ms windows can be used for parameterization, and since the the results for the long, 120 ms window outperform the results for the typical, 24 ms window, we infer that the use of a longer window should be recommended in such experiments.

Generally, the most difficult instruments to recognize in our research are marimba and vibraphone, both in case of 24 and 120 ms analyzing frame. This is not surprising, since they belong to the percussive group, and their sounds are not sustained. Changes in time domain also may explain why various instruments were mistaken for string instruments (cello, viola, violin)—the vibrato present in sounds of this instrument causes difficulties at the parameterization level. We presume that if a better parameterization is elaborated, with careful parameterizing changes of sound features in time, it may improve classification accuracy. Also, violin, viola, and cello belong to the same family of instruments, and violin vs. viola or viola vs. cello are not so easy to distinguish even for the human ear, because of similarity of instrument construction and timbre. Therefore, it is not surprising that the classifiers often mistook these instruments.

On the other hand, our classifiers also indicated some difficulties in identifying sustained harmonic sounds played without vibrato. However, these mistakes were not so frequent, and can be quite easily explained. For instance, French horn and tenor trombone represent the same family of brass instruments and their timbres are similar, so it is not surprising that French horn was often mistaken by the classifier as tenor trombone.

When comparing detailed results of all our experiments (including all confusion matrices, not presented here), for various set-ups of our training and testing data, we can observe the following:

  • the lower the level of sounds added in mixes, the less often various instruments were mistaken for cello,

  • marimba is more often correctly identified for lower levels of sounds added in mixes, although most often is correctly recognized for 6.25% level

  • vibraphone was easier to recognize correctly for lower levels of sounds added in mixes, with best results for 6.25% and 8.84% levels,

  • the lower the level of added sounds, the higher accuracy of recognition of French horn,

  • violin was most often mistakenly identified as viola for 25% and 17.68% levels,

  • violin was often mistaken for viola, whereas viola was almost never mistaken for violin.

The obtained results show quite high capabilities of classifiers trained as shown in this paper, because the recognition of the main sound in a mix of sounds of the same pitch is one of the most difficult tasks to perform in multi-timbral environment. We can conclude that this training prepares quite well the classifier to recognize the training instrumental sounds in any mix where the main sound is dominating. Still, it is interesting to investigate the generalization property of such classifiers, i.e., if they can be used to recognize sounds representing different recordings of such instruments.

As we mentioned before, the experiments were also performed for testing sounds not used in any way in training. These experiments were performed for both 120 ms and 24 ms analyzing frames. As before, training was performed for recordings of the selected 12 instruments from MUMS CDs, but testing was performed on the recordings from the IOWA database (10 out of 12 instruments) and other recordings from MUMS CDs (vibraphone and marimba), not used in training (Opolko and Wapnick 1987; The University of IOWA Electronic Music Studios 2009). Training was performed on pure instrumental sounds and on mixes with artificial sounds at 8 levels, as previously. Testing was performed on mixes of the main instrument sound with other instrument sounds (from the test set) of the same pitch. Experiments were performed on all level versions separately (including training on pure instrumental sounds), and also for the combined training data and combined testing data.

The results obtained for new data show significantly lower accuracy, up to 34.72% for 120 ms frame and up to 45.14% for 24 ms frame, for 6.25% level. Those results are not satisfactory, but still better than random choice (i.e., 8.33%). For training on combined training data (based on MUMS) and testing on combined test data (based on 10 IOWA and 2 other MUMS instrument recordings), 29.34% accuracy was obtained for 120 ms frame and 29.69% for 24 ms frame.

Confusion matrices for combined data (not shown here to avoid too many tables) indicate that clarinet, flute, violin and viola are almost never correctly identified, both in case of 24 ms and 120 ms frame (only violin was twice identified correctly for 120 ms frame). Therefore, we conclude that these sounds differ too much between MUMS and IOWA recordings. If the training data contained more samples representing each sound (each fundamental frequency) for each instruments, then probably new data could be more correctly identified.

In tests on new data we again observed difficulties to identify correctly idiophones, i.e., marimba and vibraphone. We are aware that some parameters are dedicated to sustained harmonic sounds and sustained sounds, and can obscure the results. In Table 9 we present a confusion matrix for training on MUMS data representing single instrument sounds and mixes at 6.25% level and testing on IOWA (10) plus MUMS (2) instrumental mixes at the same level, for 120 ms frame, for the feature vector without the last 17 parameters that were implemented rather for sustained harmonic sounds, i.e., without HarmonicSpectralCentroid, HarmonicSpectralSpread, HarmonicSpectralVariation, HarmonicSpectralDeviation, r 1, ..., r 11, and without LogAttackTime and TemporalCentroid. This table also illustrates general problems with recognition of new sounds for our SVM classifiers. Since the sounds used for tests originate from different recording than the one (and only one) applied for training, the classifier performed low even when trying to identify single sounds (not mixed). Therefore, we believe that the classifier should be trained on more sounds for each instrument and for each fundamental frequency, otherwise it may not generalize properly.

Table 9 Confusion matrix obtained using SVM classifier using pseudo-optimal c, γ for 120 ms analyzing frame, training on single instrument sounds and mixes with artificial sounds at 6.25% level for MUMS data, and testing on instrumental mixes at 6.25% level for instrumental sounds not used in training: from IOWA (10 instruments) and MUMS (2 instruments), for shorter feature vector—without the last 17 parameters

We infer from these results that for instruments representing idiophones probably some other features should be included in the feature set, to capture properties of sound of idiophones.

To conclude, we are aware that the classifiers used in our research can overfit the data. This is because we were using single sound data for each fundamental frequency for each instrument. We would expect that using more samples (more recordings for each instrument) as training data would improve the classification results.

5 Summary and conclusions

The purpose of this research was to perform experiments on recognizing the dominating instrument in mixes of sounds of the same pitch. This case is the most difficult for classification, since spectra of mixed sounds overlap to a high extent. The set-up of experiments for training and testing of classifiers was designed in such a way that a relatively small training data set can be used for learning: instead of using all possible pairs for training, we used mixes with artificial sounds with spectra overlapping with the main sounds, i.e., of the same pitch in case of added harmonic sounds, or noises. The results show that adding mixes to the training set generally yield significant improvement in classification accuracy. Also, the lower the level of sounds added in mixes, the higher the accuracy of identification of the main instrument, as we have expected and as is the case for human listeners.

Since we were investigating the most difficult task of identification of the dominating instrument when playing in unison (i.e., for mixes of sounds of the same pitch), the obtained accuracy is also quite satisfactory when mixes based on sounds from the training set are tested. This indicates that the classifiers we used were properly chosen and trained, and training on mixes with artificial sounds works well for the recognition purposes when mixes of original sounds with other instrumental sounds are the subject of the recognition process. Still, the identification of the dominating instrument poses difficulties when performed on mixes based on new sounds, different than those used in training. Apparently, the training set should be based on more sounds representing these instruments, in order to ameliorate the generalization property of the classifiers, no matter what kind of classifier is applied. We have already conducted experiments with random forest (Kursa et al. 2009) applied for identification of dominating instrument in same-pitch mixes, and the results show that such classifiers outperform SVM—but the experiments were performed with a test set based on the training data, and these results should be validated on new sounds, i.e., not used in training in any form.

Generally, our goal is to perform experiments on a bigger set of data, including more instruments, and octaves. We also plan to continue our research using percussive instruments (including membranophones); this is one of the reasons why we chose noises for mixes. Also, experiments with training on mixes with artificial sounds are planned for sounds of different pitch. We consider investigating classification for various articulations for each instruments, or variants of the same instruments, although this would mean great increase of research efforts. Still, we plan to add more sound samples to the training data, to avoid overfitting the classifiers to particular recordings, and we hope that adding different recordings to the training data should improve recognition accuracy on new (not used in any form in the training) audio data.

Another issue to investigate is the sound parameterization. Our feature vector is relatively high-dimensional. In Wieczorkowska and Kubik-Komar (2009), we applied PCA (principal component analysis) to reduce the original feature set, and the most significant parameters found using this method coincided with the features indicated by the algorithm based on random forests (RF), applied in Kursa et al. (2009). Also, the importance of most of the features turned out to be high, when investigating it basing on RF (Kursa et al. 2009). Still, since some classifiers have limitations regarding the size of the data, we could perform feature selection before classification experiments. One of the possibilities is to apply rough-set based algorithms and extract reducts, i.e., reduced sets of features. Also, genetic algorithms can be applied, or rough sets and genetic algorithms combined.

On the other hand, sounds can be grouped according to articulation or instrument family, so hierarchic classification can yield good results, too, as we have already observed in some experiments (Wieczorkowska et al. 2008). Careful statistical analysis of results may indicate when improvement, if any, is statistically significant. It also could be interesting to check how classifiers trained on one level of sounds added in mixes perform in tests on sounds with different levels.

Additionally, we are aware that our feature vector may not be fit well enough to recognize all the investigated instruments. This can be a good direction for future research, focusing on parameterization for instrument identification purposes, especially for recognition of non-harmonic and non-sustained sounds.