Keywords

1 Introduction

Numerous works about pitch perception attempt to identify and to distinguish between relevant temporal cues as envelope and fine structure. These cues are generally viewed as independent. In fact, mathematically, these cues are orthogonal and can be extracted and separated using a Hilbert transform. Using this mathematical decomposition, some works suggest that musical pitch would rely mostly on the fine structure and that speech perception would rely mostly on envelope (Smith et al. 2002). To empirically distinguish between envelope and fine structure cues in pitch perception experiments, a dedicated signal has been proposed as early as 1956 (De Boer 1956). This signal is equivalent of an unresolved harmonic complex tone with all harmonics shifted by the same amount of Hz. As the frequency distance between adjacent components of such a signal is regular and identical than in the original harmonic complex tone, such a signal has the same envelope but a different fine structure. As a consequence, any perceptual difference between the harmonic and the shifted complex is often interpreted as a pure fine structure based percept.

Figure 1 represents the temporal information that potentially convey some pitch related temporal information. These temporal cues can be classified in three categories of periodicities. First, the periodicities of the carrier related to the delay dT1 in Fig. 1. This periodicity provides some information about the frequency of the pure tone or about the frequency of the carrier. To extract dT1 periodicities, the auditory system is required to be able to phase lock on the instantaneous phase of the signal. dT1 periodicities will be called temporal fine structure periodicities (TSFperiod) in the following. Second, the periodicities of the envelope (delay dT2 in Fig. 1) provides some information about the F0 for harmonic complex tones. dT2 periodicities will be called temporal envelope periodicities (TEperiod) in the following. Third, the periodicities of the fine structure modulated by the envelope (delay dT3 in Fig. 1). This type of periodicities corresponds to the time delays between two energetic phases of the signal located nearby two successive maxima in the envelope. For harmonic complex tones, dT2 is equal to dT3 but for shifted complex tones, dT2 and dT3 are different. As dT3 is necessary a multiple of dT1, for a shifted complex, dT3 will be either equal to n × dT1 or (n + 1) × dT1 choosing the integer n to verify the chain of inequations: n × dT1 ≤ dT2 < (n + 1) × dT1. As such, dT3 depends on both dT1 (fine structure information) and dT2 (envelope information) and can then be described as a periodicity related to an interaction between fine structure and envelope. dT3 periodicities will be called interaction periodicities (TE × TSFperiod) in the following.

Fig. 1
figure 1

Potential temporal periodicity cues usable for pith perception of a pure tone (upper waveform) or a shifted complex tone (lower waveform). dT1 is the time delay between two adjacent peaks in the temporal fine structure, dT2 is the time delay between two adjacent peaks in the temporal envelope and dT3 is the time delay between two maximums of the temporal fine structure located nearby two successive maximums of the temporal envelope. dT2 and dT3 are confounded for harmonic complex tones but different for shifted complex tones

However, most of the time, when a harmonic complex tone and a shifted complex tone with the same envelope frequency conduct to a different pitch percept, the pitch percept is supposed to be elicited by fine structure only and to be independent of envelope. This can appear as trivial but the aim of this study is simply to convince the reader that this assumption is not true.

2 Methods

2.1 Simulation

A very basic simulation has been performed to check for the potential use of various types of periodicity cues (i.e. TSFperiod, TEperiod or TE × TSFperiod) with various signals used in the literature. The basic idea was to pass the stimulus through a dynamic compressive gammachirp (the default auditory filter-bank of the Auditory Image Model) (Irino and Patterson 2006) and to transform the output waveform into a spiketrain by using a threshold dependent model. This model has a higher probability to spike each time the output is over a threshold value (see Eq. 1). Moreover, a reasonable refractory period of 2500 Hz is added to the simulation. The periodicities of the spiketrain are finally extracted with an autocorrelation. So, the spiketrain generation is based on the following formula which depends on both envelope and fine structure temporal information:

$$\text{spiketrain}\ \text{(}t\text{)=}\left\{ \begin{aligned} & 1\ \text{if}\ \text{U}(t)\ \times \ \text{TF}{{\text{S}}_{\text{signal}>0}}(t)\ \times \ E(t)\ \times \ {{P}_{\text{refract}}}(t)>\text{Thres} \\ & 0\ \text{if}\ \text{U}(t)\ \times \ \text{TF}{{\text{S}}_{\text{signal}>0}}(t)\ \times \ E(t)\ \times \ {{P}_{\text{refract}}}(t)>\text{Thres} \\ \end{aligned} \right\}$$
(1)

Where

U(t) is an uniform intern noise between 0 and 1,

TSF signal > 0 (t) is the positive part of the fine structure at the output of an auditory filter,

E(t) is the envelop at the output of an auditory filter,

P refract (t) is the refractory period. This function is either equal to 0 or 1 related to a refractory period equal to 1/2500. So, if spiketrain(t 0)1, P refract (t)0 for t 0 < t < t 0 + 1/2500 and P refract (t 0 + 1/2500) = 1,

Thres is the discharge threshold sets here to 0.5,

Which is exactly equivalent to:

$$\text{spiketrain}\ \text{(}t\text{)=}\left\{ \begin{aligned} & 1\ \text{if}\ \text{U}(t)\ \times \ \text{Signa}{{\text{l}}_{\text{signal}>0}}(t)\ \times \ {{P}_{\text{refract}}}(t)>\text{Thres} \\ & 0\ \text{if}\ \text{U}(t)\ \times \ \text{Signa}{{\text{l}}_{\text{signal}>0}}(t)\ \times \ {{P}_{\text{refract}}}(t)>\text{Thres} \\ \end{aligned} \right\}$$
(2)

where

Signal signal > 0 (t) is the positive part of the input signal at the output of an auditory filter.

As an intern noise U has been added, each signal is passed 300 times in the simulation model to estimate the distribution of the periodicities of the spiketrain. Moreover, for each signal, a single auditory filter output, located in the passband of the input signal and centred on the carrier frequency (fc), has been used.

2.2 Stimuli

The stimuli used by Santurette and Dau (2011) and by Oxenham et al. (2011) have been generated and processed through the simulation. In Santurette and Dau (2011), the signals were generated by multiplying a pure tone carrier with frequency fc with a half-wave rectified sine wave modulator with modulation frequency fenv and low-pass filtered by a 4th order Butterworth filter with cut-off frequency of 0.2 × fc. All signals were generated at 50 dB SPL and mixed with a TEN noise at 34 dB SPL per ERB. All fc and fenv values are indicated in Fig. 2. When fc and fenv are not multiple from each other, this manipulation produce a shifted complex. In Oxenham et al. (2011), harmonic complex tones at various F0 values (indicated in Fig. 3) were generated by adding in random phase up to 12 consecutive harmonics, beginning on the sixth. Harmonics above 20 kHz were not generated. All harmonics were generated at 55 dB SPL per component and all signals were embedded in a broadband TEN noise at 45 dB per ERB. A shifted version of each harmonic complex tone was also generated by shifting all components of the complex tone by an amount of 0.5 × F0.

Fig. 2
figure 2

Outputs of the simulation fed with the signals used in Santurette and Dau (2011). The distributions of pitch estimation are always related to TE × TSFperiod and never on TEperiod. This is closely consistent with the data reported in Figs. 4 and 6 in Santurette and Dau. (2011)

3 Results

The outputs of the simulation provide the distributions of the temporal periodicities of the spiketrain. This is supposed to predict the perceived pitch evoked by the signal.

The results of the simulation plotted in Fig. 2 evidence that, as in Santurette and Dau (2011), the predicted pitch values are always related to TE × TSFperiod and never related to TEperiod. This is strongly consistent with the results reported by the authors.

The results of the simulation plotted in Fig. 3 are strongly consistent with Oxenham et al. (2011). Using a plausible refractory period of 2500 Hz, the simulation is able to extract the TE × TSFperiod even if the TSFperiod are too fast to be correctly encoded. The decrease in performances reported in Oxenham et al. (2011) when increasing the F0 from 1200 Hz to up to 2000 Hz is also simulated. This decrease is probably explained by a decrease in resolvability (less and less interactions between harmonic components) when increasing the F0.

Fig. 3
figure 3

Outputs of the simulation at the auditory filters centred on fc, fed with the signals used in Oxenham et al. (2011). The distributions of pitch estimation are closely consistent with the data reported in Fig. 2B in Oxenham et al. (2011). The gray arrow indicates the decrease in the number of predictions when increasing the resolvability of the complex by increasing the F0

Finally, the results of the simulation plotted on the right column of Fig. 4 are strongly consistent with the results found with the signals used by Santurette and Dau (2011), and also evidence that the predicted pitch values are always related to TE × TSFperiod and never related to TEperiod.

Fig. 4
figure 4

Output of the simulation fed with the signals used in Oxenham et al. (2011) with harmonic complex tones (left column) and shifted complex tones (right column). The distributions of pitch estimation are always related to TE × TSFperiod which predict a peak centred in the gray rectangle on the left and peaks on either side of the gray rectangle on the right. This is closely consistent with the data reported in experiment 1 and 2 in Oxenham et al. (2011)

As a control, the effect of threshold value used in the simulation has been tested with one complex tone having a F0 equal to 1400 Hz (Fig. 5). Varying the threshold value have some important incidence on the predictions. Using a threshold below 0.3 does not provide reliable periodicities estimations. Using a threshold from 0.4 to 0.8 provides reliable and consistent estimations. Using a high threshold value (over 0.9) prevents to report any periodicities. Using 0.5 as in the previous simulations appears then to be a good compromise. It is worth noting that such a threshold model is physiologically plausible and could be related to the thresholds of the auditory nerve fibres previously described in the literature (Sachs and Abbas 1974).

Fig. 5
figure 5

Output of the simulation fed with a single harmonic complex tone (F0 = 1400 Hz) used in Oxenham et al. (2011). Effect of threshold value from 0 to 1 on periodicity estimations

4 Discussion

4.1 Conclusions

These stimulations have a double interest.

First, this evidence that any perceptual effect that is empirically evidenced between harmonic complex tones and shifted complex tones should not been interpreted as a pure effect of fine structure. In fact, the pitch evoked by a shifted complex is based on interaction cues between envelope and fine structure (TE × TSFperiod). Using these signals to tease apart temporal envelope cues from temporal fine structure cues is then a conceptual error. This impaired the conclusion that the pitch of unresolved complex tones is based only on fine structure information.

Second, when thinking about pitch perception of unresolved complex tones in terms of interaction between envelope and fine structure, it appears that the limitation of phase locking is probably much less critical than when thinking in terms of fine structure only. In fact, it seems clear that there is no need to encode every phases of the signal to encode the most intense phases located nearby an envelope maximum. This explains that the simulation can extract a periodicity related to pitch when the carrier frequency is over 10 kHz (Fig. 3).

4.2 Limitations

First, the current simulations are not a physiologically-based model of pitch perception and the refractory period which is used here does not accurately describe the physiological constraints for the phase locking. Some further works that would use realistic models of the auditory periphery should be used for further explorations.

Second, this simulation does not explain all the data reported in the literature about pitch perception. For example, experiment 2c in Oxenham et al. (2011) reports some pitch perception using dichotic stimulations with even-numbered harmonics presented on the right ear and odd-numbered harmonics presented to the left ear. This experimental manipulation increases the resolvability of the signals and prevents our simulation to extract any temporal periodicities and then to predict some pitch perception.