Encyclopedia of Computational Neuroscience

Living Edition
| Editors: Dieter Jaeger, Ranu Jung

Sound Localization in Mammals, Models

  • J. BraaschEmail author
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4614-7320-6_436-1


Sound Source Auditory System Delay Line Interaural Time Difference Coincidence Detector 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Models of sound localization for mammals describe or simulate the process of how the mammalian auditory system determines the position and/or spatial extent of one or multiple sound sources from cues it extracts from signals captured at the eardrums.

Detailed Description

Historical Overview

The first theories describing how the human auditory system can determine the position of a sound source appeared in the late nineteenth century after W. Thompson (1877), S. P. Thompson (1882), and Steinhauser (1877) discovered that interaural time and level differences occur between both ear signals if a sound source arrives from the side. The focus continued to be on so-called lateralization models, which explain how a sound source is perceived to the left or right based on interaural cues. The first model that describes an actual physiological mechanism of how the central nervous system localizes sound is the Jeffress model, which proposes a combination of delay lines and coincidence detectors to estimate the lateral position of a sound source based on interaural time differences. Within the same time period, better electronic devices enabled psychophysicists to better understand how the auditory system extracts localization cues.

Devices like precise sine generators and circuits to produce interaural time and level differences made it possible to measure the performance of the auditory system in greater detail. One milestone was an experiment by Mills (1958) to demonstrate that the auditory system cannot lateralize a signal based on the interaural time differences of the left and right signal carriers above 1.5 kHz, correctly assuming that the underlying physiological mechanism can no longer utilize phase locking above this threshold. Later it was shown that interaural time differences in the signals’ envelopes can be extracted and used by the auditory system.

Another important factor in the development of localization models was advances in signal processing and communication theory. A milestone was reached, when Cherry and Sayers (1956) described a functional localization model using a cross-correlator. In his model cross-correlation is applied between the left and right ear signals to measure the delay and thus the ITD between both signals.

In the late 1970s, computers were powerful enough to develop computational models that predict the lateral position of an auditory event precisely for a given pair of ear signals by simulating the complete pathway from the eardrums to higher stages including band-pass filter banks to mimic the basilar membrane’s separation of the signals in bands of approximately a third of an octave wide (auditory bands) and simulating the stochastic processes of hair cells.

Aside from models simulating the general functionality of the localization process, models exist to simulate the underlying physiological process in greater detail, often simulating the response of a certain cell type measured in electrophysiological animal experiments. The latter are frequently termed pink box models to separate them from the functional black box models. Cai et al. (1998a, b), and Stecker et al. (2005) are good examples of this type of approach.

Understanding the mechanisms of how the auditory system determines the elevation and front/back direction of sound sources was much more difficult than understanding how the auditory system estimates lateral positions due to the complexity of the involved monaural cues. Blauert (1969/1970) was able to demonstrate that the auditory systems utilize characteristic, direction-dependent frequency boosts or reductions which occur when the pinnae alter the incoming sound waves. Zakarauskas and Cynader (1993) presented an early localization model based on spectral monaural cues using the second derivative of the frequency spectrum. Later, statistical methods such as a Bayes classifier were used to predict the three-dimensional location of auditory events by analyzing both interaural and monaural cues (e.g., see Hartung (1998) and Nix and Hohmann (2006)).

Jeffress Model

The Jeffress model (1948) was the first model to propose a physiological mechanism for mammalian sound localization. The core idea of the Jeffress model is the combination of delay lines and coincidence cells. In this model, two separate delay lines exist for each ear that run parallel. The signals propagate on each line in opposite direction as shown in Fig. 1. A signal arriving at the left ear, y1(m), with m being the index for time, has to pass the first delay line, l(m, n), from left to right. The variable n is the index for the coincidence detectors at different internal delays. A signal arriving at the right ear, yr(m), travels on the other delay line, r(m,n), in the opposite direction. The discrete implementation of the delay lines can be described as follows:
Fig. 1

Coincidence mechanism as first proposed by Jeffress (1948)

$$ l\;\left(m+1,n+1\right)=l\left(m,n\right);\kern1.25em 1\le n<N\wedge l\left(m,1\right)={y}_{1\left(\mathrm{m}\right)}, $$
$$ r\;\left(m+1,n-1\right)=r\left(m,n\right);\kern1.25em 1<n\le N\wedge r\left(m,N\right)={y}_{\mathrm{r}\left(\mathrm{m}\right)}, $$
with N being the number of implemented coincidence cells. The time, t, and the internal delay, τ, can be easily estimated from the indices, m and n, and the sampling frequency, f s , as follows: t = (m − 1)/f s and τ = (n − (N + 1)/2)/f s . A coincidence detector, c(m,n), is activated when it receives simultaneous inputs from both delay lines at the positions that it is connected to. Each of the coincidence detectors is adjusted to a different ITD, due to the limited velocity of propagation of the signals on the delay line. For example, a sound source located in the left hemisphere will arrive at the left ear first, and therefore the signal will travel a greater distance on the delay line than the signal for the right ear before both of them activate the coincidence detector for the corresponding ITD.

Interaural Cross-Correlation Models

Actually, Jeffress himself never specified explicitly how two spikes would coincide. Stern and Colburn (1978) later pointed out that the coincidence model can be considered to be an estimator for the interaural cross-correlation (IACC) function – which had been proposed earlier by Cherry and Sayers (1956) to estimate ITDs. To this end they assumed that many parallel coincidence detector cells exist which are tuned to the same ITD. Then, the probability that two spikes from two opposite channels will activate a specific coincidence cell is given by the product of the number of spikes in those left and right channels, the interaural delay of which matches the internal delay of the coincidence cell. This product also appears in the running cross-correlation function, which is defined for a discrete system as follows (Note that l(m′,n) and r(m′,n) have to be determined recursively from Eqs. 1 and 2. The classical continuous form of the running cross-correlation function is
$$ {\varPsi}_{y_l,r}\left(t,\tau \right)={\displaystyle {\int}_{t^{\prime }=t}^{t+\Delta t}{y}_l\left({t}^{\prime }-\tau /2\right)}\cdot {y}_r\left({t}^{\prime }+\tau /2\right)\;d{t}^{\prime }. $$
$$ {\varPsi}_{yl,r}\left(m,n\right)=\frac{1}{\Delta m}{\displaystyle \sum_{m^{\prime }=m}^{m+\Delta m}c\left({m}^{\prime },n\right)}=\frac{1}{\Delta m}{\displaystyle \sum_{m^{\prime }=m}^{m+\Delta m}l\left({m}^{\prime },n\right)}\;r\left({m}^{\prime },n\right), $$
with c(m, n) = l(m, n) r(m, n) and the assumption that the amplitudes in the left and right channels are proportional to the number of spikes. In Eq. 3 a rectangular window of the length, Δm, was chosen within which the cross-correlation function for each time interval is calculated. Often other window shapes, e.g., Hanning window, triangular window, and exponential window, are used. The duration of the time window, Δt = Δm/f s , can be determined in psychoacoustical experiments measuring binaural sluggishness. Values come out to be on the order of tenths to hundreds of milliseconds depending on the listener and measurement method (Grantham and Wightman 1978, 1979; Grantham 1982, 1979; Kollmeier and Gilkey 1990). Sayers and Cherry (1957) used the interaural cross-correlation (IACC) to determine the ITDs, and in 1978 a computational lateralization model based on IACC and the simulation of the auditory periphery was introduced independently in Blauert and Cobben (1978) and Stern and Colburn (1978). The general structure of a common cross-correlation model is shown in Fig. 2. Besides the common cross-correlation function, there are alternative ways to implement the coincidence detectors. Figure 3 shows the results of different implementations, namely, the interaural cross-correlation functions of two 1-kHz sinusoidal tones with ITDs of 0 and 0.5 ms are depicted. In contrast to the models reported in Blauert and Cobben (1978) and Stern and Colburn (1978), Wolf (1991) assumed that two spikes from opposite channels would always coincide when they pass by each other on the delay lines. In this case the output of the coincidence function is not the product of the amplitudes in the left and right channel for each delay time, but rather the minimum of those two amplitudes. The signal amplitude, then, correlates with the number of spikes within the time interval of Δt as follows:
Fig. 2

General model structure of a binaural localization model utilizing head rotations according to Braasch et al. (2013). HRTF outer-ear simulation/HRTF filtering, BM basilar membrane/band-pass filtering, HC hair-cell/half-wave rectification, ITD and ILD analysis interaural time difference (ITD) cue extraction/interaural cross-correlation and interaural level difference (ILD) cue analysis with EI cells; remapping to azimuth angles with head-rotation compensation; binaural activity pattern analysis to estimate the sound-source positions

Fig. 3

Examples for the outputs of different coincidence detectors for a 500-Hz sinusoidal signal. (a) Cross-correlation algorithm; (b) direct coincidence algorithm, spikes always interact when passing each other; (c) Wolf’s algorithm; (d) Lindemann’s algorithm; (e) coincidence algorithm with partly eliminated spikes after coincidence; and (f) cross-correlation algorithm taken to the power of 10 (solid lines, 0-ms ITD; dashed lines, 0.5-ms ITD)

$$ {c}_d\left(m,n\right)= \min \left[l\left(m,n\right),r\left(m,n\right)\right]. $$
The output characteristics of this algorithm, Fig. 3b, are quite similar to the output characteristics of the cross-correlation algorithm, with the exception that the peaks are slightly narrower at the top. In his original work, Wolf further assumed that two spikes would be canceled out after they coincide. This approach, however, is very sensitive to interaural level differences and, thus, the signals in the left and right channels have to be compressed in amplitude beforehand. For this reason, Wolf used a hair-cell model (Duifhuis 1972) to transform the signal-level code into a rate code. The output of Wolf’s algorithm is shown in Fig. 3c. In contrast to the cross-correlation algorithm, the peaks are very narrow and the side peaks have vanished. Lindemann achieved a similar effect a few years earlier by introducing contralateral inhibition elements into his model. The implementation of the inhibition elements is achieved by modifying the computation of the delay lines from Eqs. 1 and 2 to
$$ l\left(m+1,n+1\right)=l\left(m,n\right)\left[1-{c}_{\mathrm{s}}\cdot r\left(m,n\right)\right];\kern1.25em 0\le l\left(m,n\right)<1, $$
$$ r\left(m+1,n-1\right)=r\left(m,n\right)\left[1-{c}_{\mathrm{s}}\cdot l\left(m,n\right)\right];\kern1.25em 0\le r\left(m,n\right)<1, $$
with c s being the static inhibition constant, 0 ≤ c s < 1. Now the signals in both delay lines inhibit each other before they meet and reduce the amplitude of the signal in the opposite channel at the corresponding delay unit as can be seen in Fig. 4. The side peaks that are found in the plain cross-correlation algorithm in Fig. 3a are eliminated in this way. Wolf’s algorithm becomes more similar to Lindemann’s algorithm, if only a certain percentage of the spikes is canceled in Fig. 3e. In this case, it is not even necessary to use a probabilistic hair-cell model. If only a smaller amount of the spikes is eliminated, there is, qualitatively spoken, only a little difference in whether the spikes are inhibited/canceled before or after they coincide. It should be noted that the outcome of these inhibitory algorithms is dependent on the ILDs.
In the simulation of binaural hearing, it is sometimes advantageous to reduce the peak widths of the cross-correlation curves when determining the position of the peak. Besides employing an inhibition stage, a peak reduction can be achieved by taking the signal to a power greater than one. Figure 3f shows this procedure for the power of 10. However, by this approach, the side peaks are hardly reduced.
Fig. 4

Structure of the Lindemann algorithm

When using the cross-correlation algorithm, not only the position of the cross-correlation peak but also its normalized height – the so-called interaural cross-correlation (IACC) coefficient or interaural coherence (IC) – can be used to gain information about the spaciousness of the environment, e.g., a room. It can be determined by taking the maximum of the normalized cross-correlation function,
$$ {\varPsi}_{y_{l,r}}\left(\tau \right)=\frac{{\displaystyle \underset{t=-\infty }{\overset{+\infty }{\int }}{y}_l(t)\cdot {y}_r\left(t+\tau \right) dt}}{\sqrt{{\displaystyle \underset{t=-\infty }{\overset{+\infty }{\int }}{y}_l^2(t) dt\cdot {\displaystyle \underset{t=-\infty }{\overset{+\infty }{\int }}{y}_r^2(t)} dt}}}, $$
with the internal delay, τ, and the left and right sound pressure signals, y l (t) and y r (t).
The left panel of Fig. 5 shows an example of an IACC function of a broadband noise signal for three different positions for a frequency band centered at 434 Hz. The peak of the solid cross-correlation function is located at 0 μs which corresponds to the position at 0°. The peak of the dashed IACC function is located at 400 μs, which indicates an azimuth of 30°. The height of the peak depicts the coherence, that is, the degree to which both signals are similar when shifted by the corresponding internal delay, τ. In both cases, the signal is fully correlated. In the third example, depicted by a dash-dotted line, the signal is partly decorrelated as indicated by the lower peak height of 0.6. The peak location at −600 μs belongs to an azimuth angle of 270°.
Fig. 5

Left: Interaural cross-correlation functions for a sound source at three different positions in the horizontal plane. The sound sources at 0° and 30° azimuth are fully correlated; the sound source at 270° is partly decorrelated. Right: Interaural cross-correlation functions for a sound source at three different positions in the horizontal plane. The same stimuli as in the left panel are used, but this time a two-channel model was applied with delay lines for ±45°. The actually measured values, x for the −45°-phase condition and x + for the +45°-phase condition, are shown by the “×” symbols. The simulated IACC curves were compensated for half-wave rectification. The gray curves show the actual IACC curves from the left panel. The normalized cross-correlation curves were estimated using standard trigonometric sine–cosine relationships for magnitude \( A=\sqrt{x_{-}^2+{x}_{+}^2} \) and phase ϕ = arctan(x /x +)

The cross-correlation coefficient correlates strongly with “spaciousness” – or auditory source width – a psychoacoustical measure for the spatial extent of auditory events and an important parameter for room acousticians (Blauert and Lindemann (1986) and Okano et al. (1998)). Spaciousness decreases with an increasing correlation coefficient and, therefore, with the height of the cross-correlation peak (Fig. 5)).

Two-Channel ITD Models

A few years ago the Jeffress model and with it the cross-correlation approach was challenged by physiological studies on gerbils and guinea pigs. McAlpine and Grothe (2003) and others (Grothe et al. 2010; McAlpine 2005; McAlpine et al. 2001; Pecka et al. 2008) have shown that the ITD cells for these species are not tuned evenly across the whole physiologically relevant range, but heavily concentrate on two phases of ±45°.

Consequently, their absolute best-ITD values vary with the center frequency that the cells are tuned to. Dietz et al. (2011, 2008) and Pulkki and Hirvonen (2009) developed lateralization models that draw from McAlpine and Grothe’s (2003) findings. It is still under dispute whether the Jeffress delay-line model or the two-channel model correctly represents the human auditory system, since the human ITD mechanism cannot be studied directly on a neural basis. For other species, such as owls, a mechanism similar to the one proposed by Jeffress has been confirmed by Carr and Konishi (1990). For instance, opponents of the two-channel theory point out that the cross-correlation model has been tested much more rigorously than other ones and is able to predict human performance in great detail (Bernstein and Trahiotis 2002). From a practical standpoint, the result for both approaches is not as different as one might think. For the lower frequency bands, the cross-correlation functions always have a sinusoidal shape, due to the narrow width of the auditory bands – see the right panel of Fig. 5. Consequently, the whole cross-correlation function is more or less defined by two phase values 90° apart.

Models Utilizing Interaural Level Differences

Interaural level differences are the second major localization cue. They occur because of shadowing effects of the head, especially when a sound arrives sideways. For humans ILDs typically reach values of up to ±30 dB at frequencies around 5 kHz and azimuth angles of ±60°. At low frequencies the shadowing effect of the head is not very effective and ILDs hardly occur, unless the sound sources come very close to the ear canal entrance (Blauert 1997; Brungart and Rabinowtz 1999). This led Lord Rayleigh (1907) to postulate his duplex theory, which states that ILDs are the primary localization cue for high frequencies and ITDs for low frequencies. In the latter case, Lord Rayleigh assumed that unequivocal solutions for the ITDs can no longer exist for high frequencies. Then, the wavelength of the incoming sound is much shorter than the width of the head, which determines the physiological range for ITDs of approximately ±800 μs.

Mills (1958) later supported the duplex theory by demonstrating that the auditory system can no longer detect ITDs from the fine structure of signals above 1,500 Hz. This effect results from the inability of the human auditory system to phase lock the firing patterns of auditory cells with the waveform of the signal at these frequencies. Meanwhile, however, it has been shown that the auditory system can extract ITDs at high frequencies from the signals’ envelopes (Joris 1996; Joris and Yin 1995), and the original duplex theory had to be revised accordingly.

ILDs, denoted as α, can be computed directly from the left and right ear signals – which is typically done for individual frequency bands:
$$ \alpha =10{ \log}_{10}\left({P}_{\mathrm{l}}\right)-10{ \log}_{10}\left({P}_{\mathrm{r}}\right), $$
with P l the power of the left and P r the power of the right signal. Reed and Blum (1990) introduced a physiologically motivated algorithm to compute ILDs based on the activity, E(α), of an array of excitation/inhibition (EI) cells:
$$ E\left(\alpha \right)= \exp \left[{\left({10}^{\alpha /{\mathrm{ILD}}_{\max }}\sqrt{P_{\mathrm{l}}}-{10}^{-\alpha /{\mathrm{ILD}}_{\max }}\sqrt{P_{\mathrm{r}}}\right)}^2\right], $$
with P l and P r being the power in the left and right channels, respectively, and ILDmax the maximal ILD magnitude that the cells are tuned to. Each cell is tuned to a different ILD. Figure 6 shows an example for a sound with an ILD of −12 dB. The curve depicts how the response of each cell is reduced the further the applied ILD is away from the value the cell is tuned to.
Fig. 6

Bottom: EI-cell structure. Top: Output of the EI cells for a signal with an ILD of −12 dB.

Lateralization Models

Many models have been established to predict the perceived left/right lateralization of a sound which is presented to a listener through headphones with an ITD or an ILD or both. Usually, those sounds are perceived inside the head on the interaural axis with a distance from the center of the head. This distance, the so-called lateralization, is usually measured on an interval or ratio scale. The simplest implementation of a decision device is to correlate the perceived lateralization with the estimated value of a single cue, e.g., the position of the cross-correlation peak in one frequency band. This is possible, because the laterality, the perceived lateral position, is often found to be nearly proportional to the value of the analyzed cue.

The decision device has to be more complex if different cues are to be analyzed or if one cue under consideration is observed in different frequency bands. A very early model that integrates information from different cues, ITDs and ILDs, is the position-variable model of Stern and Colburn (1978). The authors were able to predict the perceived lateralization for a 500-Hz sinusoidal tone for all combinations of ITDs and ILDs. The left panels of Fig. 7 show the results of the position-variable model for three different combinations of ITDs and ILDs. The ITDs are measured using the cross-correlation algorithm of Fig. 7a. Afterward, the cross-correlation curve is multiplied by a delay-weighting function in order to enhance the output for small ITDs; see Fig. 7b. This is done to take into consideration that the number of neural fibers which are tuned to small ITDs is higher than those tuned to large ITDs. The delay-weighting function (Colburn 1977) is shown in Fig. 8, left panel, dotted curve. The influence of the ILDs is represented in a second function of Gaussian shape and constant width of 1,778 μs, as depicted in Fig. 7c. The peak position of this second weighting is varied with the ILD of the signal. For this purpose, the signal’s ILD, α, is calculated according to Eq. 8 and transferred into a corresponding ITD, τ, using a function which can be derived from psychoacoustical data as follows:
Fig. 7

Results for the position-variable model (Stern and Colburn 1978) for a 500-Hz sinusoidal signal with different combinations of ITD and ILD, namely, 0 ms/0 dB (black solid line), 0 ms/15 dB (gray solid line), and 0.5 ms/15 dB (black dashed line): (a) interaural cross-correlation functions, (b) delay-line weighted function of (a) to emphasize small ITD values, (c) ILD functions, and (d) combined ITD and ILD analysis by multiplying (b) with (c) and calculating the centroid as represented by the vertical lines

Fig. 8

Delay weighting (left panel): Colburn (1977), dotted line; Shackleton et al. (1992), solid line; and Stern and Shear (1996), dashed line. Frequency weighting (right panel): Raatgever (1980), dotted line; Stern et al. (1988), solid line; and Akeroyd and Summerfield (1999), dashed line

$$ \tau =0.1\alpha -3.5\cdot {10}^{-5}{\alpha}^3\left[\mathrm{ms}\right]. $$

Finally, this function is multiplied with the weighted cross-correlation function from Fig. 7b, and the centroid of the resulting function correlates with the perceived lateralization (Fig. 7d).

Localization in the Horizontal Plane

Several methods to calculate sound-source positions from the extracted binaural cues exist. One method of achieving this is to create a database to convert measured binaural cues, namely, ITDs and ILDs, into spherical coordinates.
Fig. 9

Output of Stern et al.’s model (Stern et al. 1988) to a band-pass noise, 700-Hz center frequency, 1,000-Hz bandwidth, 1.5-ms ITD. The top-left panel shows the output without centrality and straightness weighting, the top-right panel with centrality weighting only, and the bottom-left panel with centrality and straightness weighting

Such a database or map can be derived from a measured catalog of head-related transfer functions (HRTFs) of a large number of sound-source directions. Here, the binaural cues are calculated frequency-wise from the left- and right-ear HRTFs of each position. Using this database, the measured binaural cues of a sound source with unknown positions can be mapped to spherical angles. The application of the remapping method to localize a signal in the horizontal plane is discussed in detail in Braasch et al. (2013). Figure 10 shows the results of remapped cross-correlation functions and ILD-based EI cell-array functions for different frequency bands. The results were obtained using a click source signal convolved with HRTFs from a human catalog at 0° elevation.
Fig. 10

Left: Interaural time differences. Right: Interaural level differences. Plotted for different frequency bands: Band 5, f c = 234 Hz (solid line); Band 15, f c = 1,359 Hz (dashed line); and Band 25, f c = 5,238 Hz (dotted line)

An ongoing challenge has been to figure out how the auditory system combines the individual cues to determine the location of auditory events – in particular to answer how the auditory system performs tasks to (i) combine different cue types, such as ILDs and ITDs, (ii) integrate information over time, (iii) integrate information over frequency, (iv) discriminate between concurrent sources, and (v) deal with room reflections.

A number of detailed overviews (Stern and Trahiotis 1995; Stern et al. 2006; Braasch 2005) have been written on cue weighting, and only a brief introduction will be given here. One big question is how the auditory system weights cues temporally. The two opposing views are that the auditory system primarily focuses on the onset part of the signal versus the belief that the auditory system integrates information over a longer signal duration. Researchers have worked with conflicting cues – such as trade-off between early onset and later ongoing cues – and it is generally agreed upon that the early cues carry a heavier weight (Freyman et al. 1997; Hafter 1997; Zurek 1993). This phenomenon can be simulated with a temporal weighting function. More recently it was suggested that the auditory system does not simply blindly combine these cues but also evaluates the robustness of these cues and discounts unreliable cues. A good example for this approach is a model by Faller and Merimaa (2004). In their model not only the positions of the cross-correlation peaks are calculated to determine the ITDs but also the coherence – as, for example, determined by the maximum value of the interaural cross-correlation function. Coherent time-frequency segments are considered to be more salient and weighted higher assuming that concurrent sound sources and wall reflections that can produce unreliable cues decorrelate the signal and thus show low coherence.

Frequency weighting also applies and, in fact, the duplex theory can be seen as an early model where ITD cues are weighted high at low frequencies, and ILD cues dominate at higher frequencies. Newer models have provided a more detailed view of how ITDs are weighted over frequency. Different curves have been obtained for different sound stimuli (Akeroyd and Summerfield 1999; Raatgever 1980) including straightness and centrality weighting (Stern et al. 1988, see Fig. 9).

Localization Models Using Monaural Cues

A model proposed in 1969/1970 (Blauert 1969/1970) analyzes monaural cues in the median plane as follows. The powers in different frequency bands, the directional bands, are analyzed and compared to each other. Based on the signal’s angle of incidence (front, above, or back), the pinnae enhance or deemphasize the power in certain frequency regions, which are the primary localization cues for sound sources within the median plane. Building on this knowledge, Blauert’s model uses a comparator to correctly predict the direction of the auditory event for narrowband signals.

Zakarauskas and Cynader (1993) developed an extended model for monaural localization, which is based on the assumption that the slope of a typical sound source’s own frequency spectrum only changes gradually with frequency, while the pinnae-induced spectral changes vary more with frequency. The model primarily uses the second-order derivative of the spectrum in frequency to determine the elevation of the sound source – assuming that the sound source itself has a locally constant frequency slope. In this case, an internal, memorized representation of a sound source’s characteristic spectrum becomes obsolete.

Baumgartner et al. (2013) created a probabilistic localization model that analyzes inter-spectral differences (ISDs) between the internal representations of a perceived sound and templates calculated for various angles. The model also includes listener-specific calibrations to 17 individual listeners. It had been shown earlier that, for some cases, ISDs can be a better predictor for human localization performance than the second-order derivative of the spectrum (Langendijk and Bronkhorst 2002). By finding the best ISD match between the analyzed sound and the templates, Baumgartner et al.’s model is able to demonstrate similar localization performance as human listeners.

In contrast to models which do not require a reference spectrum of a sound source before it is altered on the pathway from the source to the ear, it is sometimes assumed that listeners use internal representations of a variety of everyday sounds to which the ear signals are compared to in order to estimate the monaural cues. A database with the internal representations of a high number of common sounds has not been implemented in monaural model algorithms so far. Some models exist, however, that use an internal representation of a single reference sound, e.g., for broadband noise (Hartung 1998) and for click trains (Janko et al. 1997).

Three-Dimensional Localization Models

In free field, the signals are filtered by the outer ears and the auditory events are, thus, usually perceived as externalized in three-dimensional space. One approach to estimate the position of a sound source is to train a neural network to estimate the auditory event from the interaural cues rather than to combine the cues analytically, e.g., Hartung (1998) and Janko et al. (1997). When applying such a method, the neuronal network has to be trained on test material. The advantage of this procedure is that often very good results are achieved for stimuli that are very similar to the test material. The disadvantages are, however, the long time necessary to train the neural network and that the involved processing cannot easily be described analytically.

The frequency-dependent relationship between binaural cues and the azimuth and elevation angles can be determined from a catalog of HRTFs which contains the HRTFs for several angles. Spatial maps of these relationships can be set up, using one map per analyzed frequency band. It should be noted at this point that, although neurons that are spatially tuned were found in the inferior colliculus of guinea pigs (Hartung 1998) and in the primary field of the auditory area of the cortex of cats (Middlebrooks and Pettigrew 1981; Imig et al. 1990), a topographical organization of those types of neurons could not be shown yet.

Those types of localization models that analyze both ITDs and ILDs either process both cues in a combined algorithm (e.g., Stern and Colburn 1978) or evaluate both cues separately and combine the results afterward, in order to estimate the position of the sound source (e.g., Janko et al. 1997; Nix and Hohmann 2000, 2006; Hartung 1998). In Janko et al. (1997) it is demonstrated that, for filtered clicks, both the ITDs and ILDs contribute very reliable cues in the left–right dimension, while in the front–back and the up–down dimensions, ILDs are more reliable than ITDs. The findings are based on model simulations using a model that includes a neural network. The network was trained with a back-propagation algorithm on 144 different sound-source positions in the whole sphere, the positions being simulated using HRTFs. The authors could feed the neural network with either ITD cues, ILD cues, or both. Monaural cues could also be processed.

Precedence Effect Models

Dealing with room reflections remains to be one of the biggest challenges in communication acoustics across a large variety of tasks including sound localization, sound-source separation, as well as speech and other sound feature recognition. Typically, models use a simplified room impulse response, often only consisting of a direct sound and a single discrete reflection to simulate the precedence effect. Lindemann (1986a, b) took the following approach to the inhibition of location cues coming from reverberant information. Whenever his contralateral inhibition algorithm detects a signal at a specific interaural time difference, the mechanism starts to suppress information at all other internal delays or ITDs and thus solely focuses on the direct source signal component. The Lindemann model relies on onset cues to be able to inhibit reflections, but fairly recently Dizon and Colburn (2006) have shown that the onset of a mixture of an ongoing direct sound and its reflection can be truncated without affecting the precedence effect.

Based on their observation that human test participants can localize the on- and offset-truncated direct sound correctly in the presence of a reflection, Braasch and Blauert recently proposed an autocorrelation-based approach (Braasch and Blauert 2011). The model reduces the influence of the early specular reflections by autocorrelating the left and right ear signals. Separate autocorrelation functions for the left and right channels determine the delay times between the direct sound source and the reflection in addition to their amplitude ratios. These parameters are then used to steer adaptive deconvolution filters to eliminate each reflection separately. It is known from research on the apparent source width of auditory objects that our central nervous system is able to extract information about early reflections (Barron and Marshall 1981), which supports this approach. The model is able to simulate the experiments from Dizon and Colburn’s (2006) study.

Further it has been shown recently that for short test impulses such as clicks, localization dominance can be simulated using a simple cross-correlation algorithm without inhibition stages when a hair-cell model is included in the preprocessing stage (Hartung and Trahiotis 2001). To this end an adaptive hair-cell model (Meddis et al. 1990) was employed. Parts of the precedence effect are, thus, understood as results of sluggish processing in the auditory periphery. Further models to simulate the adaptive response of the auditory system to click trains, “buildup of the precedence effect,” can be found in Zurek (1987) and Djelani (2001).

Localization in Multiple-Sound-Source Scenarios

In general, physiologically motivated computational localization models have difficulties to localize two or more independent sound sources. A number of binaural localization models exist that are specialized to localize a test sound in the presence of distracting sound sources, but these models typically follow more traditional signal-processing methods and do not represent the neural mechanism with the same details some of the previously mentioned models do. In one class of models, termed “cocktail-party processors,” the information on the location of the sound sources is used to segregate them from each other (e.g., Bodden 1992; Lehn 2000). These algorithms can be used to improve the performance of speech recognition systems (Rateitschek 2000). For further improvement of binaural models, it has been proposed to implement an expert system, which completes the common signal-driven, bottom-up approach (Blauert 1999). The expert system should include explicit knowledge on the auditory scene and on the signals – and their history. This knowledge is to be used to set up hypotheses and to decide if they prove to be true or not. As an example, front–back differentiation can be named. The expert system could actively test whether the hypothesis that the sound source is presented in the front is true or false, by employing “auditory scene analysis (ASA)” cues. It could further analyze the monaural spectrum of the sound source in order to estimate the influence of the room in which the sound source is presented, determine the interaural cues, and even evaluate cues from other modalities, for example, visual cues. The expert system would evaluate the reliability of the cues and weight them according to the outcome of the evaluation. In the future, once the computational power increases furthermore and more knowledge on the auditory system is gained, one can expect that binaural models will become more complex and include the simulation of several binaural phenomena rather than the simulation of only a few specific effects.

Matters become more complicated if it is not clear how many sound sources currently exist. Then the cues do not only have to be weighted properly but also assigned to the corresponding source. Here one can either take a target + background approach (Nix and Hohmann 2006), where only the target sound parameters are quantified and everything else is treated as noise, or one can attempt to determine the positions of all sound sources involved (Braasch 2002, 2003; Roman and Wang 2008). Often in models that segregate the individual sounds from a mixture, the positions of the sources are known a priori, such as in Bodden (1993) and Roman et al. (2006).


  1. Akeroyd MA, Summerfield Q (1999) A fully temporal account of the perception of dichotic pitches. Br J Audiol 33:106–107Google Scholar
  2. Barron M, Marshall AH (1981) Spatial impression due to early lateral reflections in concert halls: the derivation of a physical measure. J Sound Vib 77(2):211–232CrossRefGoogle Scholar
  3. Baumgartner R, Majdak P, Laback B (2013) Assessment of sagittal-plane sound-localization performance in spatial-audio applications. In: Blauert J (ed) The technology of binaural listening, chapter 4. Springer, Berlin/Heidelberg/New York, pp 93–119CrossRefGoogle Scholar
  4. Bernstein L, Trahiotis C (2002) Enhancing sensitivity to interaural delays at high frequencies by using “transposed stimuli”. J Acoust Soc Am 112:1026–1036PubMedCrossRefGoogle Scholar
  5. Blauert J (1969/1970) Sound localization in the median plane. Acustica, 22:205–213Google Scholar
  6. Blauert J (1997) Spatial hearing: the psychophysics of human sound localization (2nd revised). MIT Press, Berlin/Heidelberg/New YorkGoogle Scholar
  7. Blauert J (1999) Spatial hearing (Revised edition). Massachusetts Institute of Technology, BostonGoogle Scholar
  8. Blauert J, Cobben W (1978) Some consideration of binaural cross correlation analysis. Acustica 39:96–104Google Scholar
  9. Blauert J, Lindemann W (1986) Auditory spaciousness: some further psychoacoustic analysis. J Acoust Soc Am 80:533–542CrossRefGoogle Scholar
  10. Bodden M (1992) Binaurale Signalverarbeitung: Modellierung der Richtungserkennung und des Cocktail-Party-Effektes [Binaural signal processing: modelling the recognition of direction and the cocktail-party effect]. Ph.D. thesis, Ruhr-University Bochum, BochumGoogle Scholar
  11. Bodden M (1993) Modeling human sound-source localization and the cocktail-party effect. Act Acust/Acustica 1:43–55Google Scholar
  12. Braasch J (2002) Localization in the presence of a distracter and reverberation in the frontal horizontal plane: II. Model algorithms. Act Acust/Acustica 88(6):956–969Google Scholar
  13. Braasch J (2003) Localization in the presence of a distracter and reverberation in the frontal horizontal plane: III. The role of interaural level differences. Act Acust/Acustica 89(4):674–692Google Scholar
  14. Braasch J (2005) Modeling of binaural hearing. In: Blauert J (ed) Communication acoustics. Springer Verlag, Berlin, pp 75–108CrossRefGoogle Scholar
  15. Braasch J, Blauert J (2011) Stimulus-dependent adaptation of inhibitory elements in precedence-effect models. In: Proceeding Forum Acusticum 2011. Aalborg, pp 2115–2120Google Scholar
  16. Braasch J, Clapp S, Parks A, Pastore T, Xiang N (2013) A Binaural Model that Analyses Acoustic Spaces and Stereophonic Reproduction Systems by Utilizing Head Rotations. In: Jens Blauert (ed) The Technology of Binaural Listening. Springer Berlin Heidelberg, pp 201–223Google Scholar
  17. Brungart D, Rabinowtz W (1999) Auditory localization of nearby sources. Head-related transfer functions. J Acoust Soc Am 106:1465–1479PubMedCrossRefGoogle Scholar
  18. Cai H, Carney LH, Colburn HS (1998a) A model for binaural response properties of inferior colliculus neurons. I. A model with interaural time difference-sensitive excitatory and inhibitory inputs. J Acoust Soc Am 103:475–493CrossRefGoogle Scholar
  19. Cai H, Carney LH, Colburn HS (1998b) A model for binaural response properties of inferior colliculus neurons. II. A model with interaural time difference-sensitive excitatory and inhibitory inputs and an adaptation mechanism. J Acoust Soc Am 103:494–506CrossRefGoogle Scholar
  20. Carr CE, Konishi M (1990) A circuit for detection of interaural time differences in the brain stem of the barn owl. J Neurosci 10(10):3227–3246PubMedGoogle Scholar
  21. Cherry EC, Sayers BMA (1956) “Human ‘cross-correlator’” – a technique for measuring certain parameters of speech perception. J Acoust Soc Am 28(5):889–895CrossRefGoogle Scholar
  22. Colburn HS (1977) Theory of binaural interaction based on auditory-nerve data. II. Detection of tones in noise. J Acoust Soc Am 61:525–533PubMedCrossRefGoogle Scholar
  23. Dietz M, Ewert SD, Hohmann V, Kollmeier B (2008) Coding of temporally fluctuating interaural timing disparities in a binaural processing model based on phase differences. Brain Res 1220:234–245PubMedCrossRefGoogle Scholar
  24. Dietz M, Ewert SD, Hohmann V (2011) Auditory model based direction estimation of concurrent speakers from binaural signals. Speech Commun 53(5):592–605CrossRefGoogle Scholar
  25. Dizon RM, Colburn HS (2006) The influence of spectral, temporal, and interaural stimulus variations on the precedence effect. J Acoust Soc Am 119:2947–2964PubMedCrossRefGoogle Scholar
  26. Djelani T (2001) Psychoakustische Untersuchungen und Modellierungsansätze zur Aufbauphase des auditiven Prazedenzeffektes. Ph.D. thesis, Ruhr-Universität BochumGoogle Scholar
  27. Duifhuis H (1972) Perceptual analysis of sound. Ph.D. thesis, Techn Hogeschool Eindhoven, EindhovenGoogle Scholar
  28. Faller C, Merimaa J (2004) Source localization in complex listening situations: SELECTION of binaural cues based on interaural coherence. J Acoust Soc Am 116:3075–3089PubMedCrossRefGoogle Scholar
  29. Freyman RL, Zurek PM, Balakrishnan U, Chiang YC (1997) Onset dominance in lateralization. J Acoust Soc Am 101:1649–1659PubMedCrossRefGoogle Scholar
  30. Grantham DW (1979) Discrimination of dynamic inter-aural intensity differences. J Acoust Soc Am 76:71–76CrossRefGoogle Scholar
  31. Grantham DW (1982) Detectability of time-varying inter-aural correlation in narrow-band noise stimuli. J Acoust Soc Am 72:1178–1184CrossRefGoogle Scholar
  32. Grantham DW, Wightman FL (1978) Detectability of varying interaural temporal differences. J Acoust Soc Am 63:511–523CrossRefGoogle Scholar
  33. Grantham DW, Wightman FL (1979) Detectability of a pulsed tone in the presence of a masker with time-varying inter-aural correlation. J Acoust Soc Am 65:1509–1517CrossRefGoogle Scholar
  34. Grothe B, Pecka M, McAlpine D (2010) Mechanisms of sound localization in mammals. Physiol Rev 90(3):983–1012PubMedCrossRefGoogle Scholar
  35. Hafter E (1997) Binaural adaptation and the effectiveness of a stimulus beyond its onset. In: Gilkey RH, Anderson TR (eds) Binaural and spatial hearing in real and virtual environments. Lawrence Erlbaum, Mahwah, pp 211–232Google Scholar
  36. Hartung K (1998). Modellalgorithmen zum Richtungshören, basierend auf Ergebnissen psychoakustischer und neurophysiologischer Experimente mit virtuellen Schallquellen [Model algorithms regarding directional hearing, based on psychoacoustic and neurophysiological experiments with virtual sound sources]. Ph.D. thesis, Ruhr-University Bochum, BochumGoogle Scholar
  37. Hartung K, Trahiotis C (2001) Peripheral auditory processing and investigations of the “precedence effect” which utilize successive transient stimuli. J Acoust Soc Am 110:1505–1513PubMedCrossRefGoogle Scholar
  38. Imig TJ, Irons WA, Samson FR (1990) Single-unit selectivity to azimuthal direction and sound pressure level of noise bursts in cat high frequency auditory cortex. J Neurophysiol 63:1448–1466PubMedGoogle Scholar
  39. Janko J, Anderson T, Gilkey R (1997) Using neural networks to evaluate the viability of monaural and inter-aural cues for sound localization. In: Gilkey RH, Anderson TR (eds) Binaural and spatial hearing in real and virtual environments. Lawrence Erlbaum Associates, Mahwah, pp 557–570Google Scholar
  40. Jeffress LA (1948) A place theory of sound localization. J Comp Physiol Psychol 41:35–39PubMedCrossRefGoogle Scholar
  41. Joris P (1996) Envelope coding in the lateral superior olive. II. Characteristic delays and comparison with responses in the medial superior olive. J Neurophysiol 76:2137–2156PubMedGoogle Scholar
  42. Joris P, Yin T (1995) Envelope coding in the lateral superior olive. I. Sensitivity to interaural time differences. J Neurophysiol 73:1043–1062PubMedGoogle Scholar
  43. Kollmeier B, Gilkey RH (1990) Binaural forward and backward masking: evidence for sluggishness in binaural detection. J Acoust Soc Am 87:1709–1719PubMedCrossRefGoogle Scholar
  44. Langendijk EHA, Bronkhorst AW (2002) Contribution of spectral cues to human sound localization. J Acoust Soc Am 112(4):1583–1596PubMedCrossRefGoogle Scholar
  45. Lehn K (2000) Unscharfe zeitliche Clusteranalyse von monauralen und inter-auralen Merkmalen als Modell der auditiven Szenenanalyse [Fuzzy time-based cluster analysis of monaural and inter-aural cues as a model of auditory scene analysis]. Ph.D. thesis, Ruhr-University Bochum, BochumGoogle Scholar
  46. Lindemann W (1986a) Extension of a binaural cross-correlation model by contralateral inhibition. I. Simulation of lateralization of stationary signals. J Acoust Soc Am 80:1608–1622PubMedCrossRefGoogle Scholar
  47. Lindemann W (1986b) Extension of a binaural cross-correlation model by contralateral inhibition. II. The law of the first wave front. J Acoust Soc Am 80:1623–1630PubMedCrossRefGoogle Scholar
  48. McAlpine D (2005) Creating a sense of auditory space. J Physiol 566(1):21–28PubMedCentralPubMedCrossRefGoogle Scholar
  49. McAlpine D, Grothe B (2003) Sound localisation and delay lines – do mammals fit the model? Trend Neurosci 26:347–350PubMedCrossRefGoogle Scholar
  50. McAlpine D, Jiang D, Palmer AR (2001) A neural code for low-frequency sound localization in mammals. Nat Neurosci 4(4):396–401PubMedCrossRefGoogle Scholar
  51. Meddis R, Hewitt MJ, Shakleton TM (1990) Implementation details of a computational model of the inner hair-cell auditory-nerve synapse. J Acoust Soc Am 87:1813–1816CrossRefGoogle Scholar
  52. Middlebrooks JC, Pettigrew JD (1981) Functional classes of neurons in primary auditory cortex of the cat distinguished by sensitivity to sound localization. J Neurophysiol 1:107–120Google Scholar
  53. Mills AW (1958) On the minimum audible angle. J Acoust Soc Am 30:237–246CrossRefGoogle Scholar
  54. Nix J, Hohmann V (2000) Robuste Lokalisation im Störgeräusch auf der Basis statistischer Referenzen. In: Fortschr Akust DAGA 2000, pp 474–475Google Scholar
  55. Nix J, Hohmann V (2006) Sound source localization in real sound fields based on empirical statistics of interaural parameters. J Acoust Soc Am 119:463–479PubMedCrossRefGoogle Scholar
  56. Okano T, Beranek LL, Hidaka T (1998) Relations among interaural cross-correlation coefficient (IACCE), lateral fraction (LFE), and apparent source width (ASW) in concert halls. J Acoust Soc Am 104:255–265CrossRefGoogle Scholar
  57. Pecka M, Brand A, Behrend O, Grothe B (2008) Interaural time difference processing in the mammalian medial superior olive: the role of glycinergic inhibition. J Neurosci 28(27):6914–6925PubMedCrossRefGoogle Scholar
  58. Pulkki V, Hirvonen T (2009) Functional count-comparison model for binaural decoding. Act Acust/Acustica 95(5):883–900CrossRefGoogle Scholar
  59. Raatgever J (1980) On the binaural processing of stimuli with different interaural phase relations. Ph.D. thesis, Delft University of Technology, DelftGoogle Scholar
  60. Rateitschek K (2000) Ein binauraler Signalverarbeitungsansatz zur robusten maschinellen Spracherkennung in lärmerfüllter Umgebung [A binaural signal processing approach to robust speech recognition in noisy environments]. Ph.D. thesis, Ruhr-University Bochum, BochumGoogle Scholar
  61. Rayleigh L (1907) On our perception of sound direction. Phil Mag 13:214–232CrossRefGoogle Scholar
  62. Reed M, Blum J (1990) A model for the computation and encoding of azimuthal information by the lateral superior olive. J Acoust Soc Am 88:1442–1453PubMedCrossRefGoogle Scholar
  63. Roman N, Wang D (2008) Binaural tracking of multiple moving sources. IEEE Trans Audio Speech Lang Processing 16(4):728–739CrossRefGoogle Scholar
  64. Roman N, Srinivasan S, Wang D (2006) Binaural segregation in multisource reverberant environments. J Acoust Soc Am 120(6):4040–4051PubMedCrossRefGoogle Scholar
  65. Sayers BM, Cherry EC (1957) Mechanism of binaural fusion in the hearing of speech. J Acoust Soc Am 29:973–987CrossRefGoogle Scholar
  66. Shackleton TM, Meddis R, Hewitt MJ (1992) Across frequency integration in a model of lateralization. J Acoust Soc Am 91:2276–2279CrossRefGoogle Scholar
  67. Stecker GC, Harrington IA, Middlebrooks JC (2005) Location coding by opponent neural populations in the auditory cortex. PLoS Biol 3(3):520–528CrossRefGoogle Scholar
  68. Steinhauser A (1877) The theory of binaural audition. Phil Mag 7(181–197):261–274Google Scholar
  69. Stern R, Colburn H (1978) Theory of binaural interaction based on auditory-nerve data. IV. A model for subjective lateral position. J Acoust Soc Am 64:127–140PubMedCrossRefGoogle Scholar
  70. Stern RM, Shear GD (1996) Lateralization and detection of low-frequency binaural stimuli: effects of distribution of internal delay. J Acoust Soc Am 100:2278–2288CrossRefGoogle Scholar
  71. Stern RM, Trahiotis C (1995) Models of binaural interaction. In: Moore BCJ (ed) Hearing. Academic Press, New York, pp 347–386CrossRefGoogle Scholar
  72. Stern RM, Zeiberg AS, Trahiotis C (1988) Lateralization of complex binaural stimuli: a weighted-image model. J Acoust Soc Am 84:156–165PubMedCrossRefGoogle Scholar
  73. Stern R, Wang D, Brown G (2006) Binaural sound localization. In: Wang D, Brown G (eds) Computational auditory scene analysis. Wiley Interscience, Hoboken, pp 147–186Google Scholar
  74. Thompson SP (1877) On binaural audition. Phil Mag 4:274–276CrossRefGoogle Scholar
  75. Thompson SP (1882) On the function of the two ears in the perception of space. Phil Mag 13:406–416CrossRefGoogle Scholar
  76. Wolf S (1991) Untersuchungen zur Lokalisation von Schallquellen in geschlossenen Räumen. Ph.D. thesis, Ruhr-University Bochum, BochumGoogle Scholar
  77. Zakarauskas P, Cynader M (1993) A computational theory of spectral cue localization. J Acoust Soc Am 94:1323–1331CrossRefGoogle Scholar
  78. Zurek PM (1987) The precedence effect. In: Yost WA, Gourevitch G (eds) Directional hearing. Springer, New York, pp 85–105CrossRefGoogle Scholar
  79. Zurek PM (1993) A note on onset effects in binaural hearing. J Acoust Soc Am 93:1200–1201PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.School of ArchitectureRensselaer Polytechnic InstituteTroyUSA