Sound Localization in Mammals, Models
KeywordsSound Source Auditory System Delay Line Interaural Time Difference Coincidence Detector
Models of sound localization for mammals describe or simulate the process of how the mammalian auditory system determines the position and/or spatial extent of one or multiple sound sources from cues it extracts from signals captured at the eardrums.
The first theories describing how the human auditory system can determine the position of a sound source appeared in the late nineteenth century after W. Thompson (1877), S. P. Thompson (1882), and Steinhauser (1877) discovered that interaural time and level differences occur between both ear signals if a sound source arrives from the side. The focus continued to be on so-called lateralization models, which explain how a sound source is perceived to the left or right based on interaural cues. The first model that describes an actual physiological mechanism of how the central nervous system localizes sound is the Jeffress model, which proposes a combination of delay lines and coincidence detectors to estimate the lateral position of a sound source based on interaural time differences. Within the same time period, better electronic devices enabled psychophysicists to better understand how the auditory system extracts localization cues.
Devices like precise sine generators and circuits to produce interaural time and level differences made it possible to measure the performance of the auditory system in greater detail. One milestone was an experiment by Mills (1958) to demonstrate that the auditory system cannot lateralize a signal based on the interaural time differences of the left and right signal carriers above 1.5 kHz, correctly assuming that the underlying physiological mechanism can no longer utilize phase locking above this threshold. Later it was shown that interaural time differences in the signals’ envelopes can be extracted and used by the auditory system.
Another important factor in the development of localization models was advances in signal processing and communication theory. A milestone was reached, when Cherry and Sayers (1956) described a functional localization model using a cross-correlator. In his model cross-correlation is applied between the left and right ear signals to measure the delay and thus the ITD between both signals.
In the late 1970s, computers were powerful enough to develop computational models that predict the lateral position of an auditory event precisely for a given pair of ear signals by simulating the complete pathway from the eardrums to higher stages including band-pass filter banks to mimic the basilar membrane’s separation of the signals in bands of approximately a third of an octave wide (auditory bands) and simulating the stochastic processes of hair cells.
Aside from models simulating the general functionality of the localization process, models exist to simulate the underlying physiological process in greater detail, often simulating the response of a certain cell type measured in electrophysiological animal experiments. The latter are frequently termed pink box models to separate them from the functional black box models. Cai et al. (1998a, b), and Stecker et al. (2005) are good examples of this type of approach.
Understanding the mechanisms of how the auditory system determines the elevation and front/back direction of sound sources was much more difficult than understanding how the auditory system estimates lateral positions due to the complexity of the involved monaural cues. Blauert (1969/1970) was able to demonstrate that the auditory systems utilize characteristic, direction-dependent frequency boosts or reductions which occur when the pinnae alter the incoming sound waves. Zakarauskas and Cynader (1993) presented an early localization model based on spectral monaural cues using the second derivative of the frequency spectrum. Later, statistical methods such as a Bayes classifier were used to predict the three-dimensional location of auditory events by analyzing both interaural and monaural cues (e.g., see Hartung (1998) and Nix and Hohmann (2006)).
Interaural Cross-Correlation Models
The cross-correlation coefficient correlates strongly with “spaciousness” – or auditory source width – a psychoacoustical measure for the spatial extent of auditory events and an important parameter for room acousticians (Blauert and Lindemann (1986) and Okano et al. (1998)). Spaciousness decreases with an increasing correlation coefficient and, therefore, with the height of the cross-correlation peak (Fig. 5)).
Two-Channel ITD Models
A few years ago the Jeffress model and with it the cross-correlation approach was challenged by physiological studies on gerbils and guinea pigs. McAlpine and Grothe (2003) and others (Grothe et al. 2010; McAlpine 2005; McAlpine et al. 2001; Pecka et al. 2008) have shown that the ITD cells for these species are not tuned evenly across the whole physiologically relevant range, but heavily concentrate on two phases of ±45°.
Consequently, their absolute best-ITD values vary with the center frequency that the cells are tuned to. Dietz et al. (2011, 2008) and Pulkki and Hirvonen (2009) developed lateralization models that draw from McAlpine and Grothe’s (2003) findings. It is still under dispute whether the Jeffress delay-line model or the two-channel model correctly represents the human auditory system, since the human ITD mechanism cannot be studied directly on a neural basis. For other species, such as owls, a mechanism similar to the one proposed by Jeffress has been confirmed by Carr and Konishi (1990). For instance, opponents of the two-channel theory point out that the cross-correlation model has been tested much more rigorously than other ones and is able to predict human performance in great detail (Bernstein and Trahiotis 2002). From a practical standpoint, the result for both approaches is not as different as one might think. For the lower frequency bands, the cross-correlation functions always have a sinusoidal shape, due to the narrow width of the auditory bands – see the right panel of Fig. 5. Consequently, the whole cross-correlation function is more or less defined by two phase values 90° apart.
Models Utilizing Interaural Level Differences
Interaural level differences are the second major localization cue. They occur because of shadowing effects of the head, especially when a sound arrives sideways. For humans ILDs typically reach values of up to ±30 dB at frequencies around 5 kHz and azimuth angles of ±60°. At low frequencies the shadowing effect of the head is not very effective and ILDs hardly occur, unless the sound sources come very close to the ear canal entrance (Blauert 1997; Brungart and Rabinowtz 1999). This led Lord Rayleigh (1907) to postulate his duplex theory, which states that ILDs are the primary localization cue for high frequencies and ITDs for low frequencies. In the latter case, Lord Rayleigh assumed that unequivocal solutions for the ITDs can no longer exist for high frequencies. Then, the wavelength of the incoming sound is much shorter than the width of the head, which determines the physiological range for ITDs of approximately ±800 μs.
Mills (1958) later supported the duplex theory by demonstrating that the auditory system can no longer detect ITDs from the fine structure of signals above 1,500 Hz. This effect results from the inability of the human auditory system to phase lock the firing patterns of auditory cells with the waveform of the signal at these frequencies. Meanwhile, however, it has been shown that the auditory system can extract ITDs at high frequencies from the signals’ envelopes (Joris 1996; Joris and Yin 1995), and the original duplex theory had to be revised accordingly.
Many models have been established to predict the perceived left/right lateralization of a sound which is presented to a listener through headphones with an ITD or an ILD or both. Usually, those sounds are perceived inside the head on the interaural axis with a distance from the center of the head. This distance, the so-called lateralization, is usually measured on an interval or ratio scale. The simplest implementation of a decision device is to correlate the perceived lateralization with the estimated value of a single cue, e.g., the position of the cross-correlation peak in one frequency band. This is possible, because the laterality, the perceived lateral position, is often found to be nearly proportional to the value of the analyzed cue.
Localization in the Horizontal Plane
An ongoing challenge has been to figure out how the auditory system combines the individual cues to determine the location of auditory events – in particular to answer how the auditory system performs tasks to (i) combine different cue types, such as ILDs and ITDs, (ii) integrate information over time, (iii) integrate information over frequency, (iv) discriminate between concurrent sources, and (v) deal with room reflections.
A number of detailed overviews (Stern and Trahiotis 1995; Stern et al. 2006; Braasch 2005) have been written on cue weighting, and only a brief introduction will be given here. One big question is how the auditory system weights cues temporally. The two opposing views are that the auditory system primarily focuses on the onset part of the signal versus the belief that the auditory system integrates information over a longer signal duration. Researchers have worked with conflicting cues – such as trade-off between early onset and later ongoing cues – and it is generally agreed upon that the early cues carry a heavier weight (Freyman et al. 1997; Hafter 1997; Zurek 1993). This phenomenon can be simulated with a temporal weighting function. More recently it was suggested that the auditory system does not simply blindly combine these cues but also evaluates the robustness of these cues and discounts unreliable cues. A good example for this approach is a model by Faller and Merimaa (2004). In their model not only the positions of the cross-correlation peaks are calculated to determine the ITDs but also the coherence – as, for example, determined by the maximum value of the interaural cross-correlation function. Coherent time-frequency segments are considered to be more salient and weighted higher assuming that concurrent sound sources and wall reflections that can produce unreliable cues decorrelate the signal and thus show low coherence.
Frequency weighting also applies and, in fact, the duplex theory can be seen as an early model where ITD cues are weighted high at low frequencies, and ILD cues dominate at higher frequencies. Newer models have provided a more detailed view of how ITDs are weighted over frequency. Different curves have been obtained for different sound stimuli (Akeroyd and Summerfield 1999; Raatgever 1980) including straightness and centrality weighting (Stern et al. 1988, see Fig. 9).
Localization Models Using Monaural Cues
A model proposed in 1969/1970 (Blauert 1969/1970) analyzes monaural cues in the median plane as follows. The powers in different frequency bands, the directional bands, are analyzed and compared to each other. Based on the signal’s angle of incidence (front, above, or back), the pinnae enhance or deemphasize the power in certain frequency regions, which are the primary localization cues for sound sources within the median plane. Building on this knowledge, Blauert’s model uses a comparator to correctly predict the direction of the auditory event for narrowband signals.
Zakarauskas and Cynader (1993) developed an extended model for monaural localization, which is based on the assumption that the slope of a typical sound source’s own frequency spectrum only changes gradually with frequency, while the pinnae-induced spectral changes vary more with frequency. The model primarily uses the second-order derivative of the spectrum in frequency to determine the elevation of the sound source – assuming that the sound source itself has a locally constant frequency slope. In this case, an internal, memorized representation of a sound source’s characteristic spectrum becomes obsolete.
Baumgartner et al. (2013) created a probabilistic localization model that analyzes inter-spectral differences (ISDs) between the internal representations of a perceived sound and templates calculated for various angles. The model also includes listener-specific calibrations to 17 individual listeners. It had been shown earlier that, for some cases, ISDs can be a better predictor for human localization performance than the second-order derivative of the spectrum (Langendijk and Bronkhorst 2002). By finding the best ISD match between the analyzed sound and the templates, Baumgartner et al.’s model is able to demonstrate similar localization performance as human listeners.
In contrast to models which do not require a reference spectrum of a sound source before it is altered on the pathway from the source to the ear, it is sometimes assumed that listeners use internal representations of a variety of everyday sounds to which the ear signals are compared to in order to estimate the monaural cues. A database with the internal representations of a high number of common sounds has not been implemented in monaural model algorithms so far. Some models exist, however, that use an internal representation of a single reference sound, e.g., for broadband noise (Hartung 1998) and for click trains (Janko et al. 1997).
Three-Dimensional Localization Models
In free field, the signals are filtered by the outer ears and the auditory events are, thus, usually perceived as externalized in three-dimensional space. One approach to estimate the position of a sound source is to train a neural network to estimate the auditory event from the interaural cues rather than to combine the cues analytically, e.g., Hartung (1998) and Janko et al. (1997). When applying such a method, the neuronal network has to be trained on test material. The advantage of this procedure is that often very good results are achieved for stimuli that are very similar to the test material. The disadvantages are, however, the long time necessary to train the neural network and that the involved processing cannot easily be described analytically.
The frequency-dependent relationship between binaural cues and the azimuth and elevation angles can be determined from a catalog of HRTFs which contains the HRTFs for several angles. Spatial maps of these relationships can be set up, using one map per analyzed frequency band. It should be noted at this point that, although neurons that are spatially tuned were found in the inferior colliculus of guinea pigs (Hartung 1998) and in the primary field of the auditory area of the cortex of cats (Middlebrooks and Pettigrew 1981; Imig et al. 1990), a topographical organization of those types of neurons could not be shown yet.
Those types of localization models that analyze both ITDs and ILDs either process both cues in a combined algorithm (e.g., Stern and Colburn 1978) or evaluate both cues separately and combine the results afterward, in order to estimate the position of the sound source (e.g., Janko et al. 1997; Nix and Hohmann 2000, 2006; Hartung 1998). In Janko et al. (1997) it is demonstrated that, for filtered clicks, both the ITDs and ILDs contribute very reliable cues in the left–right dimension, while in the front–back and the up–down dimensions, ILDs are more reliable than ITDs. The findings are based on model simulations using a model that includes a neural network. The network was trained with a back-propagation algorithm on 144 different sound-source positions in the whole sphere, the positions being simulated using HRTFs. The authors could feed the neural network with either ITD cues, ILD cues, or both. Monaural cues could also be processed.
Precedence Effect Models
Dealing with room reflections remains to be one of the biggest challenges in communication acoustics across a large variety of tasks including sound localization, sound-source separation, as well as speech and other sound feature recognition. Typically, models use a simplified room impulse response, often only consisting of a direct sound and a single discrete reflection to simulate the precedence effect. Lindemann (1986a, b) took the following approach to the inhibition of location cues coming from reverberant information. Whenever his contralateral inhibition algorithm detects a signal at a specific interaural time difference, the mechanism starts to suppress information at all other internal delays or ITDs and thus solely focuses on the direct source signal component. The Lindemann model relies on onset cues to be able to inhibit reflections, but fairly recently Dizon and Colburn (2006) have shown that the onset of a mixture of an ongoing direct sound and its reflection can be truncated without affecting the precedence effect.
Based on their observation that human test participants can localize the on- and offset-truncated direct sound correctly in the presence of a reflection, Braasch and Blauert recently proposed an autocorrelation-based approach (Braasch and Blauert 2011). The model reduces the influence of the early specular reflections by autocorrelating the left and right ear signals. Separate autocorrelation functions for the left and right channels determine the delay times between the direct sound source and the reflection in addition to their amplitude ratios. These parameters are then used to steer adaptive deconvolution filters to eliminate each reflection separately. It is known from research on the apparent source width of auditory objects that our central nervous system is able to extract information about early reflections (Barron and Marshall 1981), which supports this approach. The model is able to simulate the experiments from Dizon and Colburn’s (2006) study.
Further it has been shown recently that for short test impulses such as clicks, localization dominance can be simulated using a simple cross-correlation algorithm without inhibition stages when a hair-cell model is included in the preprocessing stage (Hartung and Trahiotis 2001). To this end an adaptive hair-cell model (Meddis et al. 1990) was employed. Parts of the precedence effect are, thus, understood as results of sluggish processing in the auditory periphery. Further models to simulate the adaptive response of the auditory system to click trains, “buildup of the precedence effect,” can be found in Zurek (1987) and Djelani (2001).
Localization in Multiple-Sound-Source Scenarios
In general, physiologically motivated computational localization models have difficulties to localize two or more independent sound sources. A number of binaural localization models exist that are specialized to localize a test sound in the presence of distracting sound sources, but these models typically follow more traditional signal-processing methods and do not represent the neural mechanism with the same details some of the previously mentioned models do. In one class of models, termed “cocktail-party processors,” the information on the location of the sound sources is used to segregate them from each other (e.g., Bodden 1992; Lehn 2000). These algorithms can be used to improve the performance of speech recognition systems (Rateitschek 2000). For further improvement of binaural models, it has been proposed to implement an expert system, which completes the common signal-driven, bottom-up approach (Blauert 1999). The expert system should include explicit knowledge on the auditory scene and on the signals – and their history. This knowledge is to be used to set up hypotheses and to decide if they prove to be true or not. As an example, front–back differentiation can be named. The expert system could actively test whether the hypothesis that the sound source is presented in the front is true or false, by employing “auditory scene analysis (ASA)” cues. It could further analyze the monaural spectrum of the sound source in order to estimate the influence of the room in which the sound source is presented, determine the interaural cues, and even evaluate cues from other modalities, for example, visual cues. The expert system would evaluate the reliability of the cues and weight them according to the outcome of the evaluation. In the future, once the computational power increases furthermore and more knowledge on the auditory system is gained, one can expect that binaural models will become more complex and include the simulation of several binaural phenomena rather than the simulation of only a few specific effects.
Matters become more complicated if it is not clear how many sound sources currently exist. Then the cues do not only have to be weighted properly but also assigned to the corresponding source. Here one can either take a target + background approach (Nix and Hohmann 2006), where only the target sound parameters are quantified and everything else is treated as noise, or one can attempt to determine the positions of all sound sources involved (Braasch 2002, 2003; Roman and Wang 2008). Often in models that segregate the individual sounds from a mixture, the positions of the sources are known a priori, such as in Bodden (1993) and Roman et al. (2006).
- Akeroyd MA, Summerfield Q (1999) A fully temporal account of the perception of dichotic pitches. Br J Audiol 33:106–107Google Scholar
- Blauert J (1969/1970) Sound localization in the median plane. Acustica, 22:205–213Google Scholar
- Blauert J (1997) Spatial hearing: the psychophysics of human sound localization (2nd revised). MIT Press, Berlin/Heidelberg/New YorkGoogle Scholar
- Blauert J (1999) Spatial hearing (Revised edition). Massachusetts Institute of Technology, BostonGoogle Scholar
- Blauert J, Cobben W (1978) Some consideration of binaural cross correlation analysis. Acustica 39:96–104Google Scholar
- Bodden M (1992) Binaurale Signalverarbeitung: Modellierung der Richtungserkennung und des Cocktail-Party-Effektes [Binaural signal processing: modelling the recognition of direction and the cocktail-party effect]. Ph.D. thesis, Ruhr-University Bochum, BochumGoogle Scholar
- Bodden M (1993) Modeling human sound-source localization and the cocktail-party effect. Act Acust/Acustica 1:43–55Google Scholar
- Braasch J (2002) Localization in the presence of a distracter and reverberation in the frontal horizontal plane: II. Model algorithms. Act Acust/Acustica 88(6):956–969Google Scholar
- Braasch J (2003) Localization in the presence of a distracter and reverberation in the frontal horizontal plane: III. The role of interaural level differences. Act Acust/Acustica 89(4):674–692Google Scholar
- Braasch J, Blauert J (2011) Stimulus-dependent adaptation of inhibitory elements in precedence-effect models. In: Proceeding Forum Acusticum 2011. Aalborg, pp 2115–2120Google Scholar
- Braasch J, Clapp S, Parks A, Pastore T, Xiang N (2013) A Binaural Model that Analyses Acoustic Spaces and Stereophonic Reproduction Systems by Utilizing Head Rotations. In: Jens Blauert (ed) The Technology of Binaural Listening. Springer Berlin Heidelberg, pp 201–223Google Scholar
- Djelani T (2001) Psychoakustische Untersuchungen und Modellierungsansätze zur Aufbauphase des auditiven Prazedenzeffektes. Ph.D. thesis, Ruhr-Universität BochumGoogle Scholar
- Duifhuis H (1972) Perceptual analysis of sound. Ph.D. thesis, Techn Hogeschool Eindhoven, EindhovenGoogle Scholar
- Hafter E (1997) Binaural adaptation and the effectiveness of a stimulus beyond its onset. In: Gilkey RH, Anderson TR (eds) Binaural and spatial hearing in real and virtual environments. Lawrence Erlbaum, Mahwah, pp 211–232Google Scholar
- Hartung K (1998). Modellalgorithmen zum Richtungshören, basierend auf Ergebnissen psychoakustischer und neurophysiologischer Experimente mit virtuellen Schallquellen [Model algorithms regarding directional hearing, based on psychoacoustic and neurophysiological experiments with virtual sound sources]. Ph.D. thesis, Ruhr-University Bochum, BochumGoogle Scholar
- Janko J, Anderson T, Gilkey R (1997) Using neural networks to evaluate the viability of monaural and inter-aural cues for sound localization. In: Gilkey RH, Anderson TR (eds) Binaural and spatial hearing in real and virtual environments. Lawrence Erlbaum Associates, Mahwah, pp 557–570Google Scholar
- Lehn K (2000) Unscharfe zeitliche Clusteranalyse von monauralen und inter-auralen Merkmalen als Modell der auditiven Szenenanalyse [Fuzzy time-based cluster analysis of monaural and inter-aural cues as a model of auditory scene analysis]. Ph.D. thesis, Ruhr-University Bochum, BochumGoogle Scholar
- Middlebrooks JC, Pettigrew JD (1981) Functional classes of neurons in primary auditory cortex of the cat distinguished by sensitivity to sound localization. J Neurophysiol 1:107–120Google Scholar
- Nix J, Hohmann V (2000) Robuste Lokalisation im Störgeräusch auf der Basis statistischer Referenzen. In: Fortschr Akust DAGA 2000, pp 474–475Google Scholar
- Raatgever J (1980) On the binaural processing of stimuli with different interaural phase relations. Ph.D. thesis, Delft University of Technology, DelftGoogle Scholar
- Rateitschek K (2000) Ein binauraler Signalverarbeitungsansatz zur robusten maschinellen Spracherkennung in lärmerfüllter Umgebung [A binaural signal processing approach to robust speech recognition in noisy environments]. Ph.D. thesis, Ruhr-University Bochum, BochumGoogle Scholar
- Steinhauser A (1877) The theory of binaural audition. Phil Mag 7(181–197):261–274Google Scholar
- Stern R, Wang D, Brown G (2006) Binaural sound localization. In: Wang D, Brown G (eds) Computational auditory scene analysis. Wiley Interscience, Hoboken, pp 147–186Google Scholar
- Wolf S (1991) Untersuchungen zur Lokalisation von Schallquellen in geschlossenen Räumen. Ph.D. thesis, Ruhr-University Bochum, BochumGoogle Scholar