Research on vocal sounding based on spectrum image analysis
- 141 Downloads
The improvement of vocal singing technology involves many factors, and it is difficult to achieve the desired effect by human analysis alone. Based on this, this study based on spectrum image analysis uses the base-2 time-selection FFT algorithm as the research algorithm, uses the wavelet transform algorithm as the denoising algorithm, and combines comparative analysis to discuss the mechanism of vocal music, the state of vocalization, and the vocal quality of vocalists in vocal music teaching. Simultaneously, this study compares the singer’s frequency, pitch, overtones, harmonics, singer formants, etc., and derives the characteristics of vocal vocalization under different conditions, and can be extended to all music vocal studies. Research shows that this research method has certain practicality and can provide theoretical reference for related research.
KeywordsVocal music Image analysis Music spectrum Singing Image processing
The research on vocal singing technology has a long history. In the process, many experts and scholars have continuously studied and summarized this. After repeated exploration and continuous practice, some relatively perfect technical methods have been formed. However, through my own review, it is found that the domestic summary of these singing methods is only in the teaching of words, and the intuitive unified description is basically blank, which is quite inconvenient for us to correctly learn, recognize, and evaluate vocal music. Although the current relevant research report has certain results, considering the vocal cord conditions, the cultural background will bring some deviations to the experimental results. Therefore, we doubt whether their data results can represent the level of the subjects in our country. Whether these tests are in line with the cultural background is indeed a difficult problem to be verified, and it is also an urgent need for domestic research to carry out such research. In addition, there are some differences in different kinds of music, and there are few studies related to vocal music in related research. Therefore, this study combines image processing to study the occurrence of vocal music .
Wang et al. proposed a normalized model of pitch perception based on the unique resolution of the human ear to different frequency sources. This algorithm has great significance in multi-frequency estimation techniques . Wu et al. simplified the pitch model by using the Bark scale to simulate the human ear frequency response, which greatly reduced the amount of computation. The method calculates the autocorrelation coefficient of the signal through two channels, and the multi-channel signal autocorrelation coefficient in the non-ERB scale, and finally calculates the respective fundamental frequency values one by one by scale stretching and linear interpolation. Similarly, the cycle-based algorithm was used by Staley and Sethares. Similarly, based on the establishment of the pitch model, Cheveigne et al. recursively screened the estimated fundamental frequencies in the poetry selection .
Guan et al. used the new technology to develop one of the most advanced automatic music annotations at that time. The multi-frequency estimation method in this system was also in a leading position at that time. They applied computer scene analysis in the system, assuming that the sinusoidal trajectory is a musical note, and in terms of methods, they cite the principle of perceptual aggregation, including harmonicity, frequency, timbre, and melody. Guan et al. designed musical instrument models for the instruments corresponding to each note. These models can solve the problem of spectral energy distribution when notes of different frequencies overlap. In addition, they analyze the probability of occurrence of notes in the case of a given chord and use Markov chains to encode and transfer chord probabilities. The method uses a Bayesian network for linear prediction for bottom-up analysis, and top-down processing includes predictive note spectral components and chord prediction .
Meng models the short-time spectrum of the music signal. In the model, a certain number of harmonics constitute each tone model, and each harmonic component is modeled by a Gaussian distribution centered on the fundamental frequency integer multiple. Meng designed a multi-agent structure to track the fundamental frequency weights of consecutive frames and iteratively updated each pitch model and its weight with the method of calculating the expectation maximization. The algorithm can successfully track the pitch lines of the music signal and its melody. Although the system is more complicated, the core EM algorithm is relatively easy to implement in the experiment. In addition, the algorithm can correctly estimate the weight of all fundamental frequencies, but the disadvantage is that the primary fundamental frequency can usually only be estimated .
Qiu et al. improved on the idea of weighted mixed-tone models. However, they modeled each set of harmonics in the tone model instead of modeling each harmonic individually. They believe that although the amplitude of the notes in the signal will change, the harmonic structure of each note is unique and stable. The harmonic structure here refers to the relative relationship between the harmonics. Based on the relationship between harmonic frequencies, Qiu et al. chose to model a set of harmonics using a constrained mixed Gaussian model. It can be seen in the experiment that the constraint to be used in the system reduces the parameters to be estimated in the system. In this modeling mode, the signal model is established and finally the parameter estimation result is given based on the expectation maximization .
Spectrum iterative deletion method was adopted by Claudio. In order to reduce the influence of instruments of different timbres as much as possible, the algorithm firstly pre-whitens the signals and then calculates the sum of the energy of the corresponding harmonic frequencies in the frequency domain for each candidate fundamental frequency. Among them, the person with the greatest energy is considered to be the fundamental frequency. Then, the algorithm subtracts the fundamental frequency and its harmonic energy from the original spectrum according to a certain ratio. At this time, the algorithm estimates the next fundamental frequency, and then iteratively estimates. The algorithm has a lot of technical links that need special attention, such as parameter estimation during pre-whitening, proportional deletion coefficient, and judgment of dominant fundamental frequency estimation. The algorithm tested chords randomly synthesized from a variety of instrumental notes, and the test yielded the desired results .
Zabalza sees the fundamental frequency estimation as a multi-label classification problem, which is completely different from the other methods. Each signal frame in the algorithm is considered to be a sample, the fundamental frequency is treated as a label, and the algorithm extracts the features of the spectrum of the signal. They collect all samples of this fundamental frequency for each fundamental frequency, thus training a support vector machine classifier. The algorithm considers the classification result obtained by the classifier as the result of the fundamental frequency estimation. Zabalza et al. chose the piano signal for testing and the results were as expected. However, although the algorithm is novel, it needs to deal with two difficulties: First, although the algorithm uses a one-to-many classifier, the parameter space is greatly reduced, but the fundamental frequency classifier is split, and it also violates the close relationship between the notes. Second, the algorithm needs to train up to dozens of classifiers, which greatly increases the amount of computation .
In view of the current understanding of image processing in vocal research topics, the purpose of this study is as follows: First, this article is to facilitate a clear understanding of the current status and trends of research on this topic in the world, as well as some unresolved issues and development limitations, so as to better fill the gap. Second, the author’s collation of the adjectives of singing technology can be used as a basic research for other researchers, thus providing an innovative basis for future research.
2 Research methods
In the actual application process, it can be clearly felt that the FFT algorithm is greatly improved compared with the direct calculation of DFT, especially the larger N, the effect is more obvious. In addition, the computational complexity of the FFT algorithm is significantly less than that of the DFT algorithm, so the FFT algorithm is more suitable for this study.
Comparison of the computational quantities of basal-2 FFT and direct computation DFT
Direct computation DFT
N(N − 1)
The spectrum of the audio signal in the vocal utterance and the spectrum of which music chords are corresponding or similar are determined by the frequency comparison method. There are many ways to compare. The simplest can be determined by the maximum intensity of the spectrum, i.e., the maximum intensity method of the contrast spectrum. We need to use Bayesian formula to calculate the same or similar probability, which is widely used in scientific research with its profound thoughts.
When using the Bayes formula, it is necessary to construct a probability model based on the spectrum of the audio signal. The statistical calculation is used to compare the amplitude of the audio signal sampling spectrum with the peak of the standard music chord spectrum to determine the music chord corresponding to the played audio signal, and then the color can be determined.
The sound of vocal singing is a compound sound, which is composed of the pitch of the singing and other partial sounds. Vocal training is an extremely important part of the vocal teaching process. Both teachers and students need to use the auditory system to judge and identify the sounds produced during singing. These include distinguishing the pitch, strength, and rhythm, as well as a series of phonetic factors that distinguish the physiological state of the singing and the vocabulary pronunciation and language sensation. It can be said that the process of vocal music learning is actually a process of establishing the correct sound concept, and the establishment of this concept depends to a large extent on the sensitivity of the vocal learner’s auditory system. The quality of hearing has a very direct relationship with the physiological conditions and experience of the listener. The most obvious thing about the sound spectrum description and identification of the pronunciation of multimedia computer technology is that it has accuracy and intuitiveness. During the vocal training process, we recorded the singer’s singing voice through a computer and observed various subtle parameters of the emitted sound through the function chart and the sound wave table displayed on the computer screen. This is a positive complement to the human auditory system in an objective perspective. The use of the computer to participate in the vocal music teaching process has a positive effect on the quality of the vocalist’s pronunciation, as well as the analysis of singing and speech intelligibility. The following is a spectrum image analysis test for vocal sounds and the like .
By analyzing the vocalist’s pronunciation through spectral image analysis, you can observe the pitch of the singing voice and various overtones, partial sounds, and sounds in different frequency segments. The analysis results from the average of the sound frequency analysis in a certain time zone, which we call the “frequency spectrum.” In fact, it is a frequency and sound intensity distribution map of each harmonic column of singing. This frequency spectrum shows the sound intensity profile of the sound of singing at different frequencies, and its sound intensity is expressed in decibels. Usually, we give a more specific “sound form” reference frame from the perspective of musical acoustics. It allows us not only to understand the sound in the abstract concept of the “anapharyngeal cavity, oral cavity, chest cavity, head cavity, high position resonance” pointed out by the abstract general auditory organs, but also to visually see the visual form and specific numerical values. This method of analysis is specific, intuitive, and accurate . In addition, the graphics and data provided by the computer can not only objectively reflect the characteristics of pitch, sound intensity, etc., but also assist the auditory to further analyze the timbre, resonance, and other factors of sound of singing.
The voice of the mezzo-soprano is slightly longer than the soprano, and its length is between 11 and 14 mm. The relatively bright soprano is softer, fuller, deeper, and richer. The excellent mezzo-soprano singer has a wide range of sounds. The middle and low sound areas are wide, sweet, and sleek, and the treble sound area is not bright, but the sound is strong. Sound samples for testing: The experimental samples for the analysis of acoustic and acoustic spectrum images in this paper are famous mezzo-soprano singers, Honre and Bortoli.
From the spectrum image analysis chart of Fig. 4, we can see that the spectrum of the untrained person’s sound spectrum gradually decays after the peak of the pitch frequency, and we can clearly see that there is almost no peak in the high frequency in Fig. 4 (A). However, professional singers show higher formants in the frequency range of 2170 to 2756 Hz.
The method of acoustic spectrum testing breaks the traditional vocal teaching method that people have always been subjective. The method of sound spectrum testing injects objective and rational scientific factors into the teaching of vocal music and further enhances the knowledge of singing acoustics and vocal physiology from the simple physiological experience of the past. At the same time, this method has changed the bad learning habits of students who only pay attention to practice and despise principle theory. Vocal music teaching can not only cultivate singers, but should pay equal attention to vocal theory research, vocal practice singing, and vocal teaching methods. In addition, this method can fully reflect the comprehensive ability of vocal singing talents. The use of computer multimedia for spectrum image analysis not only provides a method for identifying and analyzing voice timbre for vocal training and teaching, but also stores and transmits the measured sound spectrum as a data file and also provides a more convenient management method for scientific research and teaching record data and organization. The use of this method can be used for singers, vocal learners (different degrees), or singers of different singing styles, and can make detailed and long-term data analysis and comparison. The analysis results have a more objective reference for the identification of the vocalist’s voice, the adjustment of the singing during the singer’s learning process, and the selection of talents.
In the spectrograms of Figs. 5 and 6, both soprano singers have a high degree of spectral consistency. First, there is no sharp spike-like map of its pitch and overtone. Through this relatively broad pitch peak, it is known that the application of vibrato during singing is reasonable. The amplitude is between a small second, and the frequency of change is about six times per second. Corresponding to such an auditory feeling is that the sound thickness is symmetrical and round without a thin sharpness. Both singers’ resonance peaks are around 3000 Hz and the sound intensity of the formant is high, and the sound intensity of the pitch is almost the same. It can be seen from the Fig. that the resonance energy of the formant has a wider “base” at 3000 Hz, which produces vocal transparency in the human ear but does not lose the heavy hearing experience. At the same time, its first overtone and second overtone are higher in intensity than other overtones. We know that the first overtone and the second overtone can be purely five degrees pure octave relationship with the pitch, and the increase of these two overtones can improve the harmony of the arpeggio tone, reduce the hollowness of the sound, and enhance the intensity of the sound to make the sound more textured.
It can be seen from Figs. 7 and 8 that the pitch of the pitch is not in a state of sharpness, indicating that the singing process uses a more reasonable vibrato technique. At the same time, we can see the obvious second, third, and even fourth overtones after the pitch. Rich and obvious overtones increase the texture and harmony of the sound, while enhancing the brightness of the sound. Both the pitch and the overtone do not appear spiked, but appear to have a certain width “base,” indicating that the singing state has a more reasonable vibrato. When the frequency is around 3200 Hz, the singer’s formant has a strong peak, and its amplitude energy value is even at the fundamental frequency. This shows that the singing voice has a strong penetrating power and is not easily masked by the accompaniment of the symphony orchestra. The singer’s mid- and low-range sounds are thick and sturdy. At the same time, the singer’s first overtone, second overtone, and third overtone sound energy are significantly improved, adding more harmonious components to the sound timbre .
As can be seen from Figs. 9 and 10, in contrast to the bright tone of the soprano, the tone of the mezzo-soprano is relatively rich and heavy, which can be clearly reflected from the spectrum image analysis. Similar to the soprano, the pitch and overtone of the mezzo-soprano are not the peak state, but the “base” has a certain width, which proves that the application of vibrato during singing is reasonable. The amplitude is between a small second, and the frequency of change is about six times per second. Similarly, the singer formant near 3200 Hz is evident in the spectral envelope, and the formant intensity energy is at the same level as the pitch sound energy. In contrast to the perception of the auditory system, the mezzo-soprano sound is thick, and although it does not have a bright sound quality, it still has a very strong penetrating energy, which fully reduces the acoustic masking effect of the band accompaniment. Similar to the soprano, the first overtone, the second overtone, the third overtone, and the fourth overtone of the mezzo-soprano are strongly magnified. However, unlike the soprano, the mezzo-soprano vocal cord is longer, which is more conducive to the third overtone and the fourth overtone. As I mentioned before, from the perspective of music and acoustics, the chord formed by the pitch and the first four overtones is just a major chord. While increasing the degree of harmony, it strengthens the distinction between the soprano tone.
Influencing factors for studying vocal vocalization, this study explores the new ideas and observation angles of traditional singing and vocal research through the analysis of sound spectrum data. Through spectrum image analysis, the singing voice can be digitized and quantified, and the abstract sounds that are difficult to ponder are presented in front of our eyes in the form of data charts. At the same time, through the good singing voice of typical singers as a relative reference standard, the vocal learners’ deficiencies in a certain aspect can be clearly compared. These shortcomings can be refined to each parameter, such as the pitch, tone, time value, and intensity of the vocal learner’s concern, and also clearly reflect the pitch, intensity, etc., of each overtone. In addition, it can get more clear feedback on the envelope of the singer formant. For the spectrum image analysis of singing, its advantages are accuracy and visibility. However, it also has certain disadvantages. For singing art, over-emphasizing dataization will inevitably lead to the loss of artistic personality. This requires that the single spectrum data should not be used as the main basis for the evaluation mechanism, and it should be under the premise of “people-oriented.” At the same time, spectral image analysis is required as an auxiliary reference. Otherwise, it will inevitably have a bad influence. Research shows that this research method has certain practicality and can provide theoretical reference for related research.
The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions.
Availability of data and materials
Please contact author for data requests.
The author takes part in the discussion of the work described in this paper. The author read and approved the final manuscript.
The authors’ declare that he have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.Xu G, Fowlkes J B, Tao C, et al. Photoacoustic spectrum analysis for microstructure characterization in biological tissue: analytical model[J]. Appl. Phys. Lett., 2015, 41(5):1473–1480Google Scholar
- 2.Wang X, Xu G, Carson P. Quantification of tissue texture with photoacoustic spectrum analysis[J]. Proc. SPIE - Int. Soc. Optical Eng., 2014, 9129:91291LGoogle Scholar
- 3.Hassani H, Heravi S, Zhigljavsky A. Forecasting UK industrial production with multivariate singular spectrum analysis[J]. J. Forecast., 2013, 32(5):395–408Google Scholar
- 4.Guan H, Xiao B, Zhou J, et al. Fast dimension reduction for document classification based on imprecise spectrum analysis[J]. Inf. Sci., 2013, 222(3):147–162Google Scholar
- 5.Xu G, Meng Z X, Lin J D, et al. The functional pitch of an organ: quantification of tissue texture with photoacoustic spectrum analysis[J]. Radiology, 2014, 271(1):248Google Scholar
- 7.Claudio M R S. Singular spectrum analysis and forecasting of failure time series[J]. Reliab. Eng. Syst. Saf., 2013, 114(6):126–136Google Scholar
- 8.Zabalza J, Ren J, Wang Z, et al. Singular spectrum analysis for effective feature extraction in hyperspectral imaging[J]. IEEE. Geosci. Remote. Sensing. Lett., 2014, 11(11):1886–1890Google Scholar
- 9.Fu K, Qu J, Chai Y, et al. Hilbert marginal spectrum analysis for automatic seizure detection in EEG signals[J]. Biomed. Signal. Proc. Control., 2015, 18:179–185Google Scholar
- 10.Chen Y. A wave-spectrum analysis of urban population density: entropy, fractal, and spatial localization[J]. Disc. Dynamics. Nat. Soc., 2014, 2008(4):47–58Google Scholar
- 11.CHooper G. Nicholas Cook, Beyond the score: music as performance (Oxford: Oxford University Press, 2013). xiv + 458 pp. £32.99 (hb). ISBN 978-0-19-935740-6.[J]. Music. Anal., 2016, 35(3):407–416Google Scholar
- 12.Kora S, Lim B B L, Wolf J. A hands-free music score turner using Google glass[J]. J. Comput. Inf. Syst., 2017(5):1–11Google Scholar
- 13.Fang Y, Teng G F. Visual music score detection with unsupervised feature learning method based on K-means[J]. Int. J. Mach. Learn. Cybernet., 2015, 6(2):277–287Google Scholar
- 14.Byo J L. Applying score analysis to a rehearsal pedagogy of expressive performance[J]. Music. Educ. J., 2014, 101(2):76–82Google Scholar
- 15.Fine P A, Wise K J, Goldemberg R, et al. Performing musicians’ understanding of the terms “mental practice” and “score analysis”[J]. Psychomusicology, 2015, 25:69–82Google Scholar
- 16.Louboutin C, Meredith D. Using general-purpose compression algorithms for music analysis[J]. J. New. Music. Res., 2016, 45(1):1–16Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.