Headphone screening to facilitate web-based auditory experiments
Psychophysical experiments conducted remotely over the internet permit data collection from large numbers of participants but sacrifice control over sound presentation and therefore are not widely employed in hearing research. To help standardize online sound presentation, we introduce a brief psychophysical test for determining whether online experiment participants are wearing headphones. Listeners judge which of three pure tones is quietest, with one of the tones presented 180° out of phase across the stereo channels. This task is intended to be easy over headphones but difficult over loudspeakers due to phase-cancellation. We validated the test in the lab by testing listeners known to be wearing headphones or listening over loudspeakers. The screening test was effective and efficient, discriminating between the two modes of listening with a small number of trials. When run online, a bimodal distribution of scores was obtained, suggesting that some participants performed the task over loudspeakers despite instructions to use headphones. The ability to detect and screen out these participants mitigates concerns over sound quality for online experiments, a first step toward opening auditory perceptual research to the possibilities afforded by crowdsourcing.
KeywordsPsychometrics/testing Stimulus control Audition
Online behavioral experiments allow investigators to gather data quickly from large numbers of participants. This makes behavioral research highly accessible and efficient, and the ability to obtain data from large samples or diverse populations allows new kinds of questions to be addressed. Crowdsourcing has become popular in a number of subfields within cognitive psychology (Buhrmester et al., 2011; Crump et al., 2013), including visual perception (Brady and Alvarez, 2011; Freeman et al., 2013; Shin and Ma, 2016), cognition (Frank and Goodman, 2012; Hartshorne and Germine, 2015), and linguistics (Sprouse, 2010; Gibson et al., 2011; Saunders et al., 2013). Experimenters in these fields have developed methods to maximize the quality of web-collected data (Meade and Bartholomew, 2012; Chandler et al., 2013). By contrast, auditory psychophysics has not adopted crowdsourcing to the same degree as other fields of psychology, presumably due in part to concerns about sound presentation. Interference from background noise, the poor fidelity of laptop speakers, and environmental reverberation could all reduce control over what a participant hears.
One simple way to improve the control of sound delivery online is to ensure that participants are wearing headphones or earphones (for brevity the term “headphones” will henceforth be used to refer to both). Headphones tend to attenuate external sources by partly obscuring the ear, and minimize the distance between eardrum and transducer, thus improving signal-to-noise ratios in favor of the sounds presented by the experimenter. Headphones also enable presentation of separate signals to the two ears (enabling binaural tests). Here we present methods to help ensure that participants are wearing headphones, along with validation of this method in the lab, where we knew participants to be listening over headphones or over loudspeakers.
Simulation and acoustic measurement of anti-phase attenuation
We used six trials of a 3-AFC “Which tone is quietest?” task: All three tones were 200-Hz pure tones with a duration of 1,000 ms, with 100 ms on- and off-ramps (produced by half of a Hann window). A 3-AFC task (rather than 2-AFC) was chosen to reduce the probability of passing the screen by randomly guessing. A low-tone frequency (200 Hz) was chosen to produce a broad region of attenuation (Figure S1) intended to make the test robust to variation in head position. One of the tones had a level of −6 dB relative to the other two (which were of equal intensity). However, one of the two equal-intensity tones was phase reversed between the stereo channels; the other two tones had no phase difference between stereo channels (starting phases in the L/R channels were therefore 0°/0°, 0°/0°, and 0°/180° for the less intense and two more-intense tones respectively). On each trial, the three tones were presented in random order with an interstimulus interval of 500 ms. The listener was asked to pick the interval containing the quietest tone, by selecting one of three buttons labeled “FIRST sound is SOFTEST,” “SECOND sound is SOFTEST,” and “THIRD sound is SOFTEST.”
Participants completed the task on a Mac minicomputer in a quiet office environment using the same Mechanical Turk interface used by online participants. Half of participants (N = 20, 15 females, mean age = 27.6 years, standard deviation [SD] = 12.7) completed the task while listening to stimuli over Sennheiser HD 280 headphones. The other half (N = 20, 11 females, mean age = 26.5 years, SD = 5.6) completed the task while listening over a pair of Harman/Kardon HK206 free-field speakers. The speakers were placed so that their centers were 40-cm apart and were set 40-cm back from the edge of the table at which the participant was seated (i.e., set at approximately ±30° relative to the listener). In both conditions, sound levels were calibrated to present tones at 70-dB SPL at the ear (using a Svantek sound meter connected either to a GRAS artificial ear or to a GRAS free-field microphone). In all other respects, the experiment was identical to the online experiment.
In a separate experiment, we invited participants to bring their own laptops into the lab (N = 22, 13 females, mean age = 27.3 years, SD = 10.6) and tested them over their laptop speakers in four different locations around the building (in random order). These testing spaces were selected to cover a range of room sizes and to offer different reflective surfaces nearby the listener. For example, in one room (Server room: Adverse) the laptop was surrounded by clutter including cardboard boxes and drinking glasses; in another room (Atrium), the laptop was placed alongside a wall in a very large reverberant space. Two of the spaces (Atrium and Ping-pong room) were open to use by others and had commensurate background noise. Participants were told to use the laptop as they normally would, without moving it from its predetermined location in the room.
The online screening task began with the repeated presentation of a noise sample for loudness calibration. This was intended to help avoid presentation levels that would result in uncomfortably loud or inaudible stimuli during the main experiment (after screening), rather than being calibration for the screening task. As such, the calibration noise was spectrally matched to stimuli used in our experiments (it was a broadband, speech-shaped noise). Participants were asked to adjust their computer volume such that the noise sample was at a comfortable level. The rms of the stored noise sample waveform was 0.30; this is as high as possible subject to the constraint of avoiding clipping. Relative to this calibration noise, the levels of the test tones presented in the screening task were −6.5 dB (for the two more intense tones) and −12.5 dB (for the less intense tone). We expect that this screening task should be robust to different level settings as long as the (in-phase) test tones are audible. Nonetheless, if presentation level was set such that test tones were inaudible, we would expect listeners to perform at chance.
To pass the headphone screening, participants must correctly answer at least five of the six level discrimination trials. No feedback was provided. Responses are scored only if all trials are completed. Because we use a three-alternative task, correctly answering five or more of the six trials by guessing is unlikely (it should occur with a probability of 0.0178). Most participants who are not engaged with the task should be screened out. If a participant is engaged but is listening over speakers rather than over headphones, then the tone in anti-phase will be heavily attenuated due to cancellation and should be judged (incorrectly) as the least intense of the three tones. In such a situation, the participant is again unlikely to give the correct response on five of six trials and in fact should perform below the chance level of two correct trials.
The online screening task was run on 5,154 participants (2,590 females, mean age = 34.5 years, SD = 11.1). The 184 (3.6%) reporting hearing impairment were included in our general analysis (i.e., not analyzed separately). Listeners unable to hear the 200-Hz test tone due to hearing loss (or for any other reason) would likely be screened out.
A control task with all three tones in-phase (i.e., no anti-phase tones) was also run online, with 150 participants (75 females, mean age = 38.5 years, SD = 11.7). The three participants (2%) who reported hearing impairment were included in our general analysis.
Simulation and acoustic measurement of anti-phase attenuation
The screening test relies on the attenuation of the anti-phase tone when played in free-field conditions. We thus first evaluated the extent of the attenuation produced by anti-phase tones. We used simulations to choose an appropriate test frequency and then made measurements to assess the degree of attenuation in realistic listening conditions.
Figure 2A shows the expected attenuation over space in ideal free-field conditions (see Supplemental Materials ). In simulations, the test frequency used in the screening test (200 Hz) produces consistent attenuation over a broad region of space, making the attenuation effect robust to variations in head position. Higher frequencies produce attenuation that depends sensitively on head position and thus are not ideal for our screening task. Figure 2B shows measurements of attenuation of a 200-Hz anti-phase tone using a head-and-torso simulator placed at various locations relative to the speakers. Attenuation is greater than −20 dB in every case, substantially exceeding the −6 dB required for the screening test.
Administering the test over laptop speakers (Fig. 4) again produced substantially worse performance than when participants were wearing headphones (Fig. 3, in blue), although it elicited a different pattern of responses than our test with desktop-speakers (K-S test between distributions of Figs. 3 and 4B in red, p < 0.05, D = 0.37), with a greater proportion passing our threshold (>4 correct). The screening test thus failed to detect 4 of 22 participants using laptop speakers, a modest but nonnegligible subset of our sample. The distribution of participants’ mean scores (Fig. 4C) indicates that some participants performed poorly in all rooms (mean scores in the range 0-1) while some performed well in all rooms (mean scores in the range 5-6). Examining scores obtained in each room (Figure S2) also suggests that the testing space had little impact on performance. Instead, the difference in performance could have arisen from variation in laptop speaker designs or variation in distance from the ears to the speakers due to user behavior (e.g., leaning in). Some participants (3/22) even reported using vibrations felt on the surface of the laptop to perform the task. Because 200 Hz is within the range of vibrotactile stimulation, and because phase-cancellation could also occur in surface vibrations, using touch instead of free-field hearing might not necessarily alter the expected results. However, this strategy could possibly improve performance if vibrations in the laptop-case fail to attenuate to the same degree they would in the air, for instance if a participant placed their hand close to a speaker.
Figures 3 and 4 suggest that our screening task is more effective (i.e., produces lower scores absent headphones) when desktop speakers, rather than laptops, are used. This might be expected if desktop speakers generally sit farther from the listener, because anti-phase attenuation with low-frequency tones becomes more reliable as distance to the listener increases (Figure S1B).
The dependence of test effectiveness on hardware raises the question of what sort of listening setup online participants will tend to have. To address this issue, for a portion of our online experiments (described below), we queried participants about this on our online demographics page. We found them split rather evenly between desktops and laptops. In the brief experiment run with this question added, 97 participants said they were using desktops while 107 said they were using laptops (45.8% and 50.5% respectively). The remaining 8 participants (3.6%) said they were using other devices (e.g., tablet, smartphone).
The scores obtained from this control version of the screening task are distributed differently from the scores from our standard task (K-S test, p < 0.0001, D = 0.24). In particular, there are far fewer below-chance scores. This result suggests that the preponderance of below-chance scores observed in the standard task (i.e., when anti-phase tones are used; Fig. 4) is not due to confusion of instructions. The control task results also reveal that some proportion of online participants are screened out for poor performance even without anti-phase tones—given a pass threshold of 5 or more trials correct, 18 of 150 participants (12.0%) in this control task would have failed to pass screening (35.3% fail in the standard task with anti-phase tones). In contrast, none of the 20 participants who performed the task in the lab over headphones would have been screened out (Fig. 3). Our procedure appears to act as a screen for a subset of online participants that perform poorly (e.g., due to low motivation, difficulty understanding the instructions, or adverse environmental listening conditions), in addition to screening out those attempting the task over loudspeakers.
We developed a headphone screening task by exploiting phase-cancellation in free-field conditions coupled with dichotic headphone presentation. The screening consisted of six trials of a 3-AFC intensity discrimination task. In the lab, participants with headphones performed very well, whereas participants listening over loudspeakers performed very poorly. When run online (where we cannot definitively verify the listening modality), a distribution of scores was obtained that suggests some participants were indeed listening over loudspeakers despite being instructed to wear headphones and can be screened out effectively with our task.
The effectiveness of our screening task can be considered in terms of two kinds of screening errors: screening out participants who are in fact wearing headphones, or passing participants who are not wearing headphones. The first type of error (excluding participants despite headphone use) can result from poor performance independent of the listening device, because participants unable to perform well on a simple 3-AFC task are screened out. This seems desirable, and the cost of such failures is minimal since participants excluded in this way are easily replaced (especially in online testing). The second type of screening error (including participants who are not wearing headphones) is potentially more concerning since it permits acquisition of data from listeners whose sound presentation may be suspect. The relative rates of each kind of error could be altered depending on the needs of the experimenter by changing the threshold required to pass the screening task. For example, requiring >5 correct instead of >4 correct would result in a screen that is more stringent, and would be expected to increase errors of the first kind while reducing errors of the second kind.
Differences between in-lab and online experiments
We found that online participants were much more likely to fail the headphone check than in-lab participants who were wearing headphones (failure rates were 35.3% vs. 0%, respectively). What accounts for the relatively low pass rate of this task online? As argued above, the tendency for below-chance performance suggests that some participants were not in fact wearing headphones despite the task instructions, but this might not be the only difference. Hearing impairment in online participants seems unlikely to have substantially contributed to the online pass rate, because just 3.6% reported any impairment. It is perhaps more likely that some participants wore headphones but did not understand the task instructions. Prior studies using crowdsourcing have observed that a significant number of participants fail to follow instructions, potentially reflecting differences in motivation or compliance between online and in-lab participants. As such, it is standard for experiments to contain catch trials (Crump et al., 2013). Our screening task may thus serve both to screen out participants who ignored the instructions to use headphones as well as participants who are unwilling or unable to follow the task instructions. Both these functions likely help to improve data quality.
Limitations and possibilities in crowdsourced auditory psychophysics
Although our methods can help to better control sound presentation in online experiments, crowdsourcing obviously cannot replace in-lab auditory psychophysics. Commercially available headphones vary in their frequency response and how tightly they couple to the ear, thus neither the exact spectra of the stimulus nor the degree of external sound attenuation can be known. This precludes the option of testing a participant’s hearing with an audiogram, for instance. In addition, soundcards and input devices may have small, unknown time delays, making precise measurement of reaction times difficult. Because environmental noise is likely to remain audible in many situations despite attenuation by headphones, online testing is inappropriate for experiments with stimuli near absolute threshold and may be of limited use when comparing performance across individuals (whose surroundings likely vary). Microphone access could in principle allow experimenters to screen for environmental noise (or even for headphone use), but this may not be possible on some computer setups, and even when possible may be precluded by concerns over participants’ privacy. We have also noted cases in which our screening method could be affected by uncommon loudspeaker setups: for example, subwoofer speakers that broadcast only one audio channel (as may occur in some desktop speaker setups, as well as high-end “gaming” laptops and recent models of the Macbook Pro), setups that combine stereo channels prior to output (as may occur in devices with just one speaker), or speakers with poor low-frequency response that render the test tones inaudible. In many of these cases participants would be screened out as well, but the mechanism by which the screening operates would not be as intended.
The limitations of online experiments are less restrictive for some areas of research than others. In many situations, precise control of stimulus level and spectrum may not be critical. For instance, experiments from our own lab on attention-driven streaming (Woods and McDermott, 2015) and melody recognition (McDermott et al., 2008) have been successfully replicated online.
Crowdsourcing has the potential to be broadly useful in hearing research because it allows one to ask questions that are difficult to approach with conventional lab-based experiments for practical reasons. For example, some experiments require large numbers of participants (Kidd et al., 2007; McDermott et al., 2010; Hartshorne and Germine, 2015, Teki et al., 2016) and are much more efficiently conducted online, where hundreds of participants can be run per day. Experiments may also require recruiting participants from disparate cultural backgrounds (Curtis and Bharucha, 2009; Henrich et al., 2010) that are more readily recruited online than in person. Alternatively, it may be desirable to run only a small number of trials on each participant, or even just a single critical trial (Simons and Chabris, 1999; Shin and Ma, 2016), after which the participant may become aware of the experiment’s purpose. In all of these cases recruiting adequate sets of participants in the lab might be prohibitively difficult, and online experiments facilitated by a headphone check could be a useful addition to a psychoacoustician’s toolbox.
This work was supported by an NSF CAREER award and NIH grant 1R01DC014739-01A1 to J.H.M. The authors thank Malinda McPherson, Alex Kell, and Erica Shook for sharing data from Mechanical Turk experiments, Dorit Kliemann for help recruiting subjects for in-lab validation experiments, and Ray Gonzalez and Kelsey R. Allen for organizing code for distribution.
Code implementing the headphone screening task can be downloaded from the McDermott lab website (http://mcdermottlab.mit.edu/downloads.html).
- Gardner, W. G. (2002). Reverberation algorithms. In Applications of digital signal processing to audio and acoustics (pp. 85–131). Springer US.Google Scholar
- Gutierrez-Parera, P., Lopez, J. J., & Aguilera, E. (2015). On the influence of headphone quality in the spatial immersion produced by Binaural Recordings. In Audio Engineering Society Convention 138. Audio Engineering Society.Google Scholar
- Hartshorne, J. K., & Germine, L. T. (2015). When does cognitive functioning peak? The asynchronous rise and fall of different cognitive abilities across the life span. Psychological Science, 26.Google Scholar
- Jensen, F. B., Kuperman, W. A., Porter, M. B., & Schmidt, H. (2000). Computational ocean acoustics. Springer Science & Business Media.Google Scholar