Conjoint psychometric field estimation for bilateral audiometry
Abstract
Behavioral testing in perceptual or cognitive domains requires querying a subject multiple times in order to quantify his or her ability in the corresponding domain. These queries must be conducted sequentially, and any additional testing domains are also typically tested sequentially, such as with distinct tests comprising a test battery. As a result, existing behavioral tests are often lengthy and do not offer comprehensive evaluation. The use of active machinelearning kernel methods for behavioral assessment provides extremely flexible yet efficient estimation tools to more thoroughly investigate perceptual or cognitive processes without incurring the penalty of excessive testing time. Audiometry represents perhaps the simplest test case to demonstrate the utility of these techniques. In puretone audiometry, hearing is assessed in the twodimensional input space of frequency and intensity, and the test is repeated for both ears. Although an individual’s ears are not linked physiologically, they share many features in common that lead to correlations suitable for exploitation in testing. The bilateral audiogram estimates hearing thresholds in both ears simultaneously by conjoining their separate input domains into a single search space, which can be evaluated efficiently with modern machinelearning methods. The result is the introduction of the first conjoint psychometric function estimation procedure, which consistently delivers accurate results in significantly less time than sequential disjoint estimators.
Keywords
Psychophysics Perceptual testing Audiometry Hearing Psychometric functionA psychometric curve represents the probabilistic behavior of a subject in response to a unidimensional perceptual or cognitive task. These curves take the form of monotonically increasing probabilities as a function of increasing task ease, indexed by a single independent variable (Fechner, 1860/1966; Kingdom & Prins, 2016). When tasks are represented by multiple independent variables, a psychometric field results. Estimation procedures for unidimensional psychometric curves are many, varied, and widespread and have a long history. Estimation procedures for multidimensional psychometric fields, on the other hands, are much less advanced. In some ways, psychometric field estimation could be performed like any multiple logistic regression procedure (Hosmer & Lemeshow, 2013). The main issue with this simple formulation, however, is that because data must be accumulated sequentially in these tasks, inefficient inference requiring hours of subject performance is too impractical for any kind of mainstream test. As a result, the relatively few psychometric field estimation procedures tend to have been customized to each application (Bengtsson, Olsson, Heijl, & Rootzén, 1997; Heijl & Krakau, 1975; Lesmes, Jeon, Lu, & Dosher, 2006; Shen & Richards, 2013), with relatively few studies proposing general principles (Cohen, 2003; DiMattina, 2015).
An alternative approach to psychometric field estimation recasts the key inference step from parametric logistic regression to nonparametric or semiparametric probabilistic classification (Song, Garnett, & Barbour, 2017; Song, Sukesan, & Barbour, 2018). This framework naturally scales to multiple input dimensions, provides great flexibility for estimating a wide variety of functions, and sets the stage for other novel theoretical advances. One such advance involves actively learning which stimuli would be most valuable to deliver in order to rapidly converge onto an accurate estimate (Song et al., 2018; Song et al., 2015b). This procedure, referred to as active testing, yields estimation efficiencies at least as high as those of adaptive parametric testing. Another advance allows for multiple stimuli to be delivered simultaneously while retaining the same binary subject responses elicited classically (Gardner, Song, Weinberger, Barbour, & Cunningham, 2015). This procedure, referred to as multiplexed active testing, substantially improves test efficiency beyond that of active testing alone by reducing the total number of stimulus presentations.
Another possible advance includes considering multiple nominally decoupled input domains for incorporation into a single task and a single estimator. This extension is possible when the method used to implement the probabilistic classifier is capable of learning nonlinear interactions between the input dimensions of interest. As long as the input domains share some interrelationship, an active method that can learn and exploit this information should produce accurate overall estimates in less time. This procedure is termed conjoint active testing.
An example of a simple multidimensional perceptual test that could benefit from conjoint testing is the threshold audiogram, which evaluates the hearing function of each ear typically via a staircase method sequentially applied at discrete frequencies (Carhart & Jerger, 1959; Hughson & Westlake, 1944). This clinical standard is a compromise between the necessary acquisition of important diagnostic information and the length of time required to conduct the test, which is on the order of 15 min for both ears. Estimating full psychometric field audiograms at the same frequency resolution using serial logistic regression would require up to 20 h and is rarely performed for obvious reasons (Allen & Wightman, 1994). An appropriately designed active learning procedure can estimate a full audiogram in less time than is required to estimate a threshold audiogram using conventional methods (Song et al., 2018; Song et al., 2015b).
Like all other perceptual tests to date, estimation of a person’s hearing proceeds sequentially, one input domain (i.e., ear) at a time, resulting in two independently measured unilateral audiograms. The information about one ear is not used to infer information about the other ear during testing, though such information could be used to speed estimation of the other ear in two ways. In the first strategy, once the estimation of one ear’s hearing is complete, that ear’s audiogram could be used as a Bayesian prior to initiate testing of the contralateral ear. If the two ears share common features above and beyond what is shared among human ears generally, the contralateral ear could in that case be accurately estimated in less time.
A more compelling strategy, however, would be to use information from each ear to infer the audiograms of both ears in real time as the test is being conducted. This mutually conjoint testing strategy is referred to as the bilateral audiogram, which should be equivalent to the two separately estimated unilateral audiograms, but it should be determined in less time because of shared information between the ears. The purpose of this study was to develop the mathematical theory of conjoint psychometric field estimation, apply it to the bilateral audiogram, and evaluate the efficiency and accuracy of this novel method for determining the hearing thresholds of a subject’s two ears.
Theory
The GP model for audiogram estimation yields probabilistic estimates for the likelihood of tone detection, which is inherently a classification task. To properly construct a framework for GP classification, however, it is convenient to first examine GP regression.
(Rasmussen & Williams, 2006). The posterior mean and covariance functions reflect both the prior assumptions and the information contained in the observations.
One notable advantage of the GP model is that its probabilistic predictions enable a set of techniques collectively known as active learning. Active learning, sometimes called “optimal experimental design,” allows a machinelearning model to select the data it samples in order to perform better with less training (Settles, 2009). To contrast with adaptive techniques, queries via active learning are chosen in such a way as to minimize some loss function. For example, an active learning procedure may select a query designed to minimize the expected error of the model against the latent function. In general, the application of active learning proceeds as follows: First, use the existing model to classify unobserved data; next, find the best next point to query on the basis of some objective function, and query the data via an oracle (e.g., a human expert); finally, retrain the classifier, and repeat these steps until satisfied.
The most common form of active learning is uncertainty sampling (Lewis & Gale, 1994; Settles, 2009). Models employing uncertainty sampling will query regions in the input domain about which the model is most uncertain. In the case of probabilistic classification, uncertainty sampling corresponds to querying the instances for which the probability of being either adjacent class is closest to 0.5. This method can rapidly identify a class boundary for a target function of interest, but because uncertainty sampling attempts to query exactly where p(y = 1x) = 0.5 (in the binary case), the model underexplores the input space. In the context of psychometric fields, the transition from one class to another (i.e., the psychometric spread) is not as readily estimated in this case (Song et al., 2018).
This expression can be computed in linear time, making it easy to work with in practice. BALD selects the x for which the entire model is most uncertain about y (i.e., high H[yx]) but for which the individual predictions given a setting of the hyperparameters are very confident. This can be interpreted as “seeking the x for which the [hyper]parameters under the posterior disagree about the outcome the most” (Houlsby et al., 2011).
Materials and methods
Simulated subjects
Simulated subjects were assigned distinct groundtruth audiograms for each ear. These audiograms defined the probability of stimulus detection over a twodimensional input domain consisting of sound frequency and intensity. The audiogram shapes were defined by one of four canonical human audiogram phenotypes: oldernormal, sensory, metabolic, and metabolic + sensory (in order of severity of hearing loss). These phenotypic categories are evident in the population data of human audiograms and are informed by etiologies determined via physiological study in animal models (Dubno, Eckert, Lee, Matthews, & Schmiedt, 2013). The average audiogram shape for these categories therefore spans diagnostic categories from normal (oldernormal) through the most common pathologic categories that theoretically could affect each ear separately (metabolic, sensory, metabolic + sensory). Three different pairings of groundtruth audiograms (normal–normal, normal–pathologic, and pathologic–pathologic) therefore reflect conditions with varying putative estimation benefit from considering both ears conjointly. These canonical audiogram phenotypes have been used previously to evaluate the accuracy (Song et al., 2017) and efficiency (Song et al., 2018) of disjoint machinelearning audiometry.
Threshold is a function of frequency. The subject response for any particular tone was simulated as a Bernoulli random variable with success probability given by the cumulative Gaussian.
Bilateral audiogram
Traditional puretone audiometry involves delivering tones in a twodimensional continuous input domain indexed by frequency and intensity. The input domain for the bilateral audiogram is augmented to include a third discrete “ear” dimension, yielding x_{i} = (ω_{i}, I_{i}, e_{i}). In querying a simulated subject’s audiogram, the conjoint estimator determines in which ear to deliver the tone as well as the frequency and intensity of tone delivered. Binary responses of “heard” or “unheard” are recorded as described previously for each simulated tone delivery.
A linear kernel incorporates the prior knowledge that higher intensities are more likely to be detected than lower intensities. The precise shape of this detectability can be modeled with the likelihood function, described below.
where ℓ is the length scale.
Finally, the model uses the cumulative Gaussian likelihood function for binary classification, which is both standard for GP classification and effectively captures the sigmoidal behavior of psychometric functions as described above.
The exact form of the posterior requires computing the product of the likelihood and prior distributions. In the case of GP classification, the product of a Gaussian distribution with a sigmoidal function does not produce a tractable posterior distribution. The model must instead approximate the posterior. The estimator evaluated here uses expectation propagation, which approximates each of the sigmoidal likelihoods with momentmatching Gaussian distributions to derive a Gaussian posterior distribution (Gelman et al., 2014).
Simulations
The GP estimator is fully characterized by its mean and covariance functions. The GP used here has a constant mean function with one hyperparameter, and the kernel function has four hyperparameters: one for the length of the squared exponential kernel, and three for the discrete kernel. Because the posterior distribution of the hyperparameters p(ΘD) may be multimodal, standard gradient descent approaches run the risk of becoming trapped in a local extremum. To circumvent this issue, the estimator performs gradient descent on two sets of hyperparameters after each observation. The first set comes from the most recent results of the model (or the hyperparameter prior in absence of any data). The second set is drawn from a multivariate Gaussian distribution whose mean is the hyperparameter prior derived below. Gradient descent is performed on both settings of the hyperparameters, and the setting with higher likelihood p(DΘ) is retained for the next iteration.
The first iterations of the estimator originally performed poorly from inefficient early sampling. Resolving this issue involved learning reasonable hyperparameter priors to serve as a starting point for sample selection. Each of the four common human phenotypes discussed earlier has at least one optimal setting to its hyperparameters to minimize estimation error. Because the kernel function is symmetric, ten unique pairs of audiogram profiles can be derived from the four phenotypes. Data were collected for each of the ten audiogram profile pairs far in excess of what would be collected in a clinical or experimental setting. First, 400 stimuli were delivered across both ears using Halton sampling (Halton, 1964; Song et al., 2018). Then, an additional 100 stimuli were queried via BALD to gain additional sampling density around the threshold (Houlsby et al., 2011; Song et al., 2018). Hyperparameters were learned using modified gradient descent. The same concept was repeated with varying numbers of Halton and BALD queries, though the hyperparameters converged to within 2% of each after about 300 samples. The final setting of the hyperparameter priors was computed by taking an average of the hyperparameters learned for each of the ten audiogram pairs, weighted by the prevalence of those phenotypes in human populations (Dubno et al., 2013).

Estimator 1: Unconstrained mutually conjoint GP audiogram estimation (unconstrained conjoint). This method performs inference using the conjoint audiogram estimation extension of GPs described above, giving the model complete choice over which ear to query, as well as which frequency/intensity pair to deliver. This procedure occasionally results in multiple stimuli being delivered sequentially to the same ear, particularly in cases in which the model is more unsure of the audiogram in one ear than the other.

Estimator 2: Alternating mutually conjoint GP audiogram estimation (alternating conjoint). This method performs inference using the conjoint audiogram estimation extension of GPs described above but is artificially constrained to alternate samples between the left and right ears. Odd samples refer to the first ear and even samples to the second ear.

Estimator 3: Disjoint GP audiogram estimation (disjoint). This method performs inference using two separate models of the existing GP audiogram framework (Song et al., 2017; Song et al., 2018). Information in this approach is not shared between ears. Tone delivery is alternated between left and right ears so that odd samples refer to the first ear and even samples to the second ear.

Case 1: Older normal. This case was defined as having the oldernormal phenotype in both ears

Case 2: Asymmetric hearing loss. This case was defined as having the oldernormal phenotype in one ear, and the metabolic + sensory phenotype in the other ear. This situation reflects more severe asymmetric hearing loss than is typical in human populations.

Case 3: Symmetric hearing loss. This case was defined as having metabolic + sensory hearing loss in one ear, and sensory hearing loss in the other. This case is more typical of hearing loss in human subjects. The sensory and metabolic + sensory phenotypes have slightly different thresholds. Distinct phenotypes were selected for this case in order to more accurately reflect presentations of hearing loss in human subjects, in which generally the left and right audiograms are not exactly identical.
One hundred tests of each estimator–case pair were run, for a total of 900 tests. For each test, 100 tones were delivered sequentially to the simulated subjects and responses from the underlying ground truth were recorded. To avoid unstable hyperparameter learning and uninformative early querying, the first 15 tones were delivered via a modified Halton sampling algorithm. The modified Halton sampling algorithm constrained initial tone deliveries to be below 60 dB HL, which is a safeguard for human testing. Subsequent tones were sampled via BALD, with constraints for the disjoint and alternating conjoint cases as discussed above. Hyperparameters were learned via a modified gradient descent algorithm on every iteration starting with Iteration 16. Hyperparameter learning was off for the first 15 iterations of each test to prevent model instability. For each tone delivery, the model posterior distribution across the entire input space was recorded. From here, the abscissa intercept of the latent function was determined to calculate the 50% threshold over semitone frequencies from 0.125 to 16 kHz. The accuracy for a single test was evaluated using the mean absolute error between the estimated threshold and the true threshold at each semitone frequency. The results were then averaged at each iteration across all 100 tests to obtain the mean threshold error per iteration for each of the three estimators in each of the three cases. In addition to comparing mean threshold error per iteration, the mean number of iterations required to achieve less than 5 dB threshold error in each ear and then in both ears was also determined. This value was chosen as a measure of “convergence” because it is the minimum step size in the Hughson–Westlake procedure and is close to the empirical test–retest reliability of that test (Mahomed, Eikelboom, & Soer, 2013).
This coefficient measures the agreement between two variables while taking into account the difference in their means (Lin, 1989).
Results
GP classification was used to simultaneously estimate the rightear and leftear audiograms of simulated subjects. The GP framework produces continuous audiogram estimates across the entire input domain, and tones were actively sampled in order to reduce the number of stimuli required to achieve acceptable error thresholds. The goals of this experiment were (1) to achieve with a single bilateral audiogram estimation procedure accuracy in estimated threshold comparable to conventional unilateral methods and (2) to achieve these results across both ears more quickly than would otherwise be possible with consecutive unilateral audiogram estimations.
To determine how much baseline similarity might be expected within and between the hearing functions of a human’s ears, a large database of audiograms was first analyzed. Over the course of a decade, the National Institute for Occupational Safety and Health (NIOSH) collected discrete Hughson–Westlake audiogram threshold curves in roughtly 1.1 million individuals to provide the scientific basis for recommending occupational noise exposure limits (Masterson et al., 2013). Hearing threshold levels at 500, 1000, 2000, 3000, 4000, 6000, and 8000 Hz for both the left and right ears were determined per worker and recorded within the NIOSH database.
The high concordance for adjacent thresholds is presumably a reflection of similar physiology for nearby points on the cochlea. In other words, it is a quantification of the smoothness of the hearing threshold functions. Traditional audiometry makes very little use of such prior knowledge, but disjoint GP audiogram estimation takes advantage of this information across the population, initially by incorporating appropriate prior beliefs, and also, as tests progress, by learning each individual’s frequency concordance. The concordance learned by the GP manifests as a nonlinear interaction between frequency and intensity and results in threshold curves that are diagnostic for hearing ability.
Although the NIOSH database is large, and therefore informative about the underlying population from which its samples are drawn, that underlying population only reflects American workingage individuals at some risk for damaging their hearing on the job. Other populations, including retirees, children, and nonAmericans, might be expected to demonstrate other distributions of shared variability in hearing functions between the ears. A method capable of learning this shared variability for each individual would be able to accommodate such differences. Individualization of this sort reflects the design intentions of machinelearning audiometry.
Mean numbers of iterations (with standard deviations) required to achieve the 5dB threshold error for each ear, Case 1
Ear 1: OlderNormal  Ear 2: OlderNormal  

Unconstrained Conjoint  11.1 ± 4.3^{*}  1 ± 0^{*} 
Alternating Conjoint  16.7 ± 3.5  1.07 ± 0.70^{*} 
Disjoint  16.9 ± 2.5  19.7 ± 3.0 
Mean numbers of iterations (with standard deviations) required to achieve the 5dB threshold error for each ear, Case 2
Ear 1: Older Normal  Ear 2: Metabolic Sensory  

Unconstrained Conjoint  9.72 ± 5.3^{*}  28.0 ± 5.8^{*} 
Alternating Conjoint  9.63 ± 7.7^{*}  31.4 ± 6.6^{*} 
Disjoint  16.6 ± 2.5  46.0 ± 9.3 
Mean numbers of iterations (with standard deviations) required to achieve the 5dB threshold error for each ear, Case 3
Ear 1: Metabolic Sensory  Ear 2: Sensory  

Unconstrained Conjoint  30.5 ± 6.1^{*}  29.2 ± 5.0^{*} 
Alternating Conjoint  31.3 ± 5.9^{*}  33.4 ± 5.3^{*} 
Disjoint  42.5 ± 8.6  48.3 ± 9.3 
Mean numbers of tones and percentages of tones, relative to disjoint, required across both ears for each of the three models to achieve better than the 5dB mean threshold error
Older Normal  Asymmetric  Symmetric  

Unconstrained Conjoint  Tone count  11.5 ± 4.5  27.9 ± 3.9  32.1 ± 4.7 
Relative to disjoint  60.7% ± 22.6%  58.4% ± 8.5%  57.8% ± 8.5%  
Alternating Conjoint  Tone count  17.3 ± 4.9  32.9 ± 5.2  36.1 ± 4.8 
Relative to disjoint  71.5% ± 24.6%  65.6% ± 11.3%  86.9% ± 8.7%  
Disjoint  Tone count  19.9 ± 2.4  46.0 ± 6.1  55.0 ± 8.2 
Also included in Table 4 are timetocriterion values in relative terms. Regardless of the phenotype pairings, the unconstrained conjoint approach requires on average approximately 60% of the samples required by the disjoint approach. One interesting observation made clear from this analysis is that the performance of the alternating conjoint method is closest to that of the unconstrained conjoint method in the case of asymmetric hearing loss. This outcome occurs because the conjoint model learns that there is relatively little correlation between the two ears and that to learn a good audiogram estimation, it must split its samples more evenly between the two ears. The unconstrained conjoint and alternating conjoint approaches therefore exhibit similar querying strategies in the asymmetric case, whereas in the oldernormal and symmetric cases their querying strategies are quite different. This difference illustrates the key advantage of conjoint estimation, in which multidimensional queries are optimized for the bilateral phenotype of each individual as the test progresses.
Discussion
Behavioral assessments of perceptual and cognitive processes represent organlevel systemidentification procedures for the brain. Constructing models for individual brains faces the fundamental limitation whereby data to fit the model must be accumulated sequentially. This limitation severely constrains the sophistication of models that can be estimated in reasonable amounts of time. Procedures to improve the efficiency of this process by optimally selecting stimuli in real time have been developed repeatedly in multiple domains, often with substantial improvements in speed of model convergence (Bengtsson et al., 1997; DiMattina, 2015; Kontsevich & Tyler, 1999; Lesmes et al., 2006; Myung & Pitt, 2009; Shen & Richards, 2013; Watson, 2017).
Physiological assessments of neural function represents tissuelevel systemidentification procedures for the brain, and these procedures have evolved largely independently of behavioral methods. Often the total time of data collection is more constrained for neurophysiology than for behavior, and many advanced system identification procedures for efficiently modeling neuronal function in real time have been developed (Benda, Gollisch, Machens, & Herz, 2007; DiMattina & Zhang, 2013; Lewi, Butera, & Paninski, 2009; Paninski, Pillow, & Lewi, 2007; Pillow & Park, 2016), including the use of GP methods (Park, Horwitz, & Pillow, 2011; Pillow & Park, 2016; Rad & Paninski, 2010; Song et al., 2015a). One of the key advantages GPs provide is the ability to titrate prior beliefs finely under the same estimation architecture, leading to procedures that can be both flexible and efficient (Song et al., 2017; Song et al., 2018; Song et al., 2015b). Applying such tools to behavioral assessment provides similar benefits.
The conventional audiogram represents a relatively simple multidimensional behavioral test typically evaluated with weak prior beliefs. As a result, considerable audiometric information is usually discarded that could be used to improve audiogram test accuracy and speed. The large amounts of paired and unpaired ear data in the NIOSH database indicate that information exists from the contralateral ear that could be quite useful for incorporating into measures of the ipsilateral ear. The Hughson–Westlake procedure provides a very limited mechanism to do so, since the only real flexibility in the test available to the clinician or experimenter is the starting sound level. Machinelearning audiometry, on the other hand, can exploit various degrees of prior information by design. Active GP estimation is able to determine correlations between variables in real time as data are accumulated. Although a person’s two ears are not themselves physiologically linked, they do share many things in common, including the same or similar age, genetics, blood supply, downstream neural processes, accumulated acoustic exposure, accumulated toxin exposure, and so forth. GP inference therefore represents an excellent method for exploiting these correlations for improving test accuracy and efficiency.
The present series of experiments demonstrated that incorporating subject responses to tones delivered to either ear dynamically during a test can reduce the testing time by up to 80% with an estimator designed to take advantage of this information. All previous audiograms have been designed to evaluate either each ear sequentially or the better ear through freefield sound delivery. The bilateral audiogram delivers clinically diagnostic data for each ear separately with nearly the efficiency of singleear screening methods. Stimuli are delivered to each ear individually, but inference is drawn for both ears simultaneously. This procedure extends the rovingfrequency procedure of machinelearning audiometry, already shown to deliver equivalent thresholds to the fixedfrequency Hughson–Westlake procedure (Song et al., 2015b), into a rovingear procedure. The stimuli themselves are still pure tones and the instructions to participants are unchanged: Respond every time you hear a tone as soon as you hear it. Therefore, this new testing procedure is completely backwardcompatible with current measurement methods. Because it also returns tone detection threshold estimates, the test results and interpretations are also completely backwardcompatible. The true value of the technique, however, does not derive simply because it can deliver a faster threshold audiogram.
As has been documented previously, machinelearning audiometry delivers continuous threshold estimates as a function of frequency (Song et al., 2015b), and psychometric spread estimates in addition to threshold (Song et al., 2017; Song et al., 2018). The theoretical extension described here of conjoining two stimulus input domains has the distinct advantage of speeding a full estimation procedure considerably. To obtain even a few psychometric threshold and spread estimates from both ears using the best extant estimators requires hours of data collection (Allen & Wightman, 1994; Buss, Hall, & Grose, 2006, 2009). The bilateral machinelearning audiogram obtains these estimates in minutes by defining a middle ground between estimating two sequential 2D psychometric fields and performing a full 4D psychometric field estimation. Both of these approaches would take longer because of shared variability among the input dimensions (i.e., interactions among the model predictors) that a conjoint GP can exploit. Sparseness in the data distributions for models of higher dimensionalities may lead to practical limits in estimating such models, although the dynamic dimensionality reduction options available to GPs, such as conjoint estimation, appear to be able to extend practical estimation to relatively high dimensionality.
Future advantages will derive from conjoining additional input domains. In audiometry, for example, if appropriate transducers are connected to a patient, conjoint air conduction and bone conduction audiometry could be performed simultaneously, thereby speeding both tests. Even more significant, by conjoining tone stimuli with contralateral masking stimuli, every audiogram can become a masked audiogram, automatically accommodating asymmetric hearing loss in real time as data are collected. The testing time would not be noticeably increased from an unmasked audiogram for those individuals with symmetric hearing without need of masking. Conjoint ipsilateral masking makes an even more compelling case, such that hearing capability could potentially be assessed dynamically in suboptimal acoustic environments. Because conjoint methods scale modularly, adding other auditory tests such as DPOAEs and speech intelligibility to standard puretone audiometry can be done automatically, as well. Professional human expertise will be critical for ensuring that valid data are collected, but the complexities of highdimensional diagnostic input domains can be navigated efficiently and effectively by augmenting clinician or experimenter expertise with advanced machinelearning algorithms.
The focus of this study has been on improving psychometric field estimation speed over conventional methods. The nature of the GP estimator at the core of machinelearning audiometry, however, can accommodate other advances yet to be realized. For example, nonstationarities in the underlying function, such as attention lapses or criterion drift, can introduce estimation errors. A standard approach to account for these conditions in parametric estimators involves introducing additional parameters to model them directly (Prins, 2012; Wichmann & Hill, 2001). This method is available to the GP estimator if the likelihood function is modified to accommodate nonstationarities. Other options exist for the GP, however, such as replacing the linear kernel with a squared exponential kernel. This change retains all other features of GP estimation but results in psychometric functions resembling nonparametric estimates (Miller & Ulrich, 2001; Zychaluk & Foster, 2009). Such modularity allows users a tremendous range of options across which to design an effective estimator. In addition, the malleability of the GP for incorporating strong or weak prior beliefs further enables speed to be traded off as desired against robustness, all within the same estimation framework. Together, this modularity and malleability are what enable GP estimation—a semiparametric method as presented here—to achieve both the efficiency of a parametric estimator and the flexibility of a nonparametric estimator. These native capabilities, along with mathematical extensions such as multiplexed and conjoint estimation, yield substantial advantages to estimation procedures for a wide array of behavioral processes.
Conclusion
Bilateral audiometry, the first implementation of conjoint psychometric testing, can achieve its potential to return accurate results in significantly less time than sequential disjoint tests. The time required to acquire audiograms in both ears conjointly of a simulated participant was as little as 60% of the time (i.e., an 80% speedup) required to achieve the same degree of accuracy with disjointly acquired data. The sound stimuli and participant instructions are identical to those in current testing procedures, as are the results returned, so test delivery and interpretation are unchanged. Conjoint machinelearning audiometry also has great potential to incorporate additional testing procedures directly into the methodology with alternate kernel design, eventually leading to unified tests that actively customize stimulus delivery and diagnostic inference for each subject. The future of audiologic testing involves working up patients dynamically via multiple subjective and objective testing modalities that mutually reinforce one another to construct a thorough assessment of a patient’s hearing. Furthermore, the general principles of conjoint psychometric testing can be applied more broadly in the behavioral sciences in order to unify the individual components of perceptual or cognitive test batteries into a single efficient test.
Author note
Funding for this project was provided by the Center for Integration of Medicine and Innovative Technology (CIMIT) and the National Center for Advancing Translational Sciences (NCATS) Grant UL1TR002345. D.L.B. has a patent pending on the technology described in this article.
References
 Allen, P., & Wightman, F. (1994). Psychometric functions for children’s detection of tones in noise. Journal of Speech, Language, and Hearing Research, 37, 205–215.CrossRefGoogle Scholar
 American National Standards Institute. (1978). Methods for manual puretone threshold audiometry (Standard No. ANSI S3.211978). Washington, DC: Author.Google Scholar
 Benda, J., Gollisch, T., Machens, C. K., & Herz, A. V. (2007). From response to stimulus: Adaptive sampling in sensory physiology. Current Opinion in Neurobiology, 17, 430–436.CrossRefGoogle Scholar
 Bengtsson, B., Olsson, J., Heijl, A., & Rootzén, H. (1997). A new generation of algorithms for computerized threshold perimetry, SITA. Acta Ophthalmologica, 75, 368–375.CrossRefGoogle Scholar
 Buss, E., Hall, J. W., III, & Grose, J. H. (2006). Development and the role of internal noise in detection and discrimination thresholds with narrow band stimuli. Journal of the Acoustical Society of America, 120, 2777–2788.CrossRefGoogle Scholar
 Buss, E., Hall, J. W., III, & Grose, J. H. (2009). Psychometric functions for pure tone intensity discrimination: Slope differences in schoolaged children and adults. Journal of the Acoustical Society of America, 125, 1050–1058.CrossRefGoogle Scholar
 Carhart, R., & Jerger, J. (1959). Preferred method for clinical determination of puretone thresholds. Journal of Speech and Hearing Disorders, 24, 330–345.CrossRefGoogle Scholar
 Cohen, D. J. (2003). Direct estimation of multidimensional perceptual distributions: assessing hue and form. Perception & Psychophysics, 65, 1145–1160.CrossRefGoogle Scholar
 Coren, S. (1989). Summarizing puretone hearing thresholds— The equipollence of components of the audiogram. Bulletin of the Psychonomic Society, 27, 42–44.CrossRefGoogle Scholar
 Coren, S., & Hakstian, A. R. (1990). Methodological implications of interaural correlation: Count heads not ears. Perception & Psychophysics, 48, 291–294.CrossRefGoogle Scholar
 DiMattina, C. (2015). Fast adaptive estimation of multidimensional psychometric functions. Journal of Vision, 15(9), 5. https://doi.org/10.1167/15.9.5 CrossRefGoogle Scholar
 DiMattina, C., & Zhang, K. (2013). Adaptive stimulus optimization for sensory systems neuroscience. Frontiers in Neural Circuits, 7, 101. https://doi.org/10.3389/fncir.2013.00101 CrossRefGoogle Scholar
 Divenyi, P. L., & Haupt, K. M. (1992). In defense of the right and left audiograms: A reply to Coren (1989) and Coren and Hakstian (1990). Perception & Psychophysics, 52, 107–110.CrossRefGoogle Scholar
 Dubno, J. R., Eckert, M. A., Lee, F.S., Matthews, L. J., & Schmiedt, R. A. (2013). Classifying human audiometric phenotypes of agerelated hearing loss from animal models. Journal of the Association for Research in Otolaryngology, 14, 687–701.CrossRefGoogle Scholar
 Duvenaud, D. (2014). Automatic model construction with gaussian processes (Doctoral dissertation). University of Cambridge, Cambridge, UK.Google Scholar
 Fechner, G. T. (1966). Elements of psychophysics (H. E. Adler, Trans.; D. H. Howes & E. C. Boring Eds.). New York, NY: Holt, Rinehart & Winston. (Original work published 1860).Google Scholar
 Fisher, R. A. (1925). Intraclass correlations and the analysis of variance. In Statistical methods for research workers (pp. 177–207). Edinburgh, UK: Oliver & Boyd.Google Scholar
 Gardner, J. R., Song, X., Weinberger, K. Q., Barbour, D., & Cunningham, J. P. (2015). Psychophysical detection testing with Bayesian active learning. In Uncertainty and artificial intelligence: Proceedings of the thirtyfirst conference (pp. 286–295). Corvallis, OR: AUAI Press.Google Scholar
 Garnett, R., Osborne, M. A., & Hennig, P. (2013). Active learning of linear embeddings for Gaussian processes. arXiv:1012.2599Google Scholar
 Gelman, A., Vehtari, A., Jylanki, P., Robert, C., Chopin, N., & Cunningham, J. P. (2014). Expectation propagation as a way of life. arXiv:1412.4869v2Google Scholar
 Halton, J. H. (1964). Algorithm 247: Radicalinverse quasirandom point sequence. Communications of the ACM, 7, 701–702.CrossRefGoogle Scholar
 Heijl, A., & Krakau, C. (1975). An automatic static perimeter, design and pilot study. Acta Ophthalmologica, 53, 293–310.CrossRefGoogle Scholar
 Hosmer, D. W., & Lemeshow, S. (2013). The multiple logistic regression model. In Applied logistic regression (3rd ed., pp. 35–48). Hoboken, NJ: Wiley.Google Scholar
 Houlsby, N., Huszár, F., Ghahramani, Z., & Lengyel, M. (2011). Bayesian active learning for classification and preference learning. arXiv:1112.5745Google Scholar
 Hughson, W., & Westlake, H. (1944). Manual for program outline for rehabilitation of aural casualties both military and civilian. Transactions of the American Academy of Ophthalmology and Otolaryngology, 48, 1–15.Google Scholar
 International Organization for Standardization. (2010). ISO 82531:2010: Acoustics—Audiometric test methods—Part 1: Puretone air and bone conduction audiometry. Geneva, Switzerland: ISO.Google Scholar
 Kingdom, F. A. A., & Prins, N. (2016). Psychophysics: A practical introduction (2nd ed.). London, UK: Elsevier Academic Press.Google Scholar
 Kontsevich, L. L., & Tyler, C. W. (1999). Bayesian adaptive estimation of psychometric slope and threshold. Vision Research, 39, 2729–2737.CrossRefGoogle Scholar
 Kujala, J. V., & Lukka, T. J. (2006). Bayesian adaptive estimation: The next dimension. Journal of Mathematical Psychology, 50, 369–389.CrossRefGoogle Scholar
 Lesmes, L. L., Jeon, S. T., Lu, Z. L., & Dosher, B. A. (2006). Bayesian adaptive estimation of threshold versus contrast external noise functions: The quick TvC method. Vision Research, 46, 3160–3176.CrossRefGoogle Scholar
 Lewi, J., Butera, R., & Paninski, L. (2009). Sequential optimal design of neurophysiology experiments. Neural Computation, 21, 619–687.CrossRefGoogle Scholar
 Lewis, D. D., & Gale, W. A. (1994). A sequential algorithm for training text classifiers. In W. B. Croft & C. J. van Rijsbergen (Eds.), Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 3–12). New York, NY: Springer.Google Scholar
 Lin, L. I. (1989). A concordance correlationcoefficient to evaluate reproducibility. Biometrics, 45, 255–268. https://doi.org/10.2307/2532051 CrossRefGoogle Scholar
 Mahomed, F., Eikelboom, R. H., & Soer, M. (2013). Validity of automated threshold audiometry: A systematic review and metaanalysis. Ear and Hearing, 34, 745–752.CrossRefGoogle Scholar
 Masterson, E. A., Tak, S., Themann, C. L., Wall, D. K., Groenewold, M. R., Deddens, J. A., & Calvert, G. M. (2013). Prevalence of hearing loss in the United States by industry. American Journal of Industrial Medicine, 56, 670–681.CrossRefGoogle Scholar
 Miller, J., & Ulrich, R. (2001). On the analysis of psychometric functions: The Spearman–Karber method. Perception & Psychophysics, 63, 1399–1420.CrossRefGoogle Scholar
 Minka, T. P. (2001). Expectation propagation for approximate Bayesian inference. In J. Breese & D. Koller (Eds.), Proceedings of the Seventeenth Conference on Uncertainty and Artificial Intelligence (pp. 362–369). Corvallis, OR: AUAI Press. arXiv:1301.2294Google Scholar
 Myung, J. I., & Pitt, M. A. (2009). Optimal experimental design for model discrimination. Psychological Review, 116, 499–518. https://doi.org/10.1037/a0016104 CrossRefGoogle Scholar
 Paninski, L., Pillow, J., & Lewi, J. (2007). Statistical models for neural encoding, decoding, and optimal stimulus design. Progress in Brain Research, 165, 493–507.CrossRefGoogle Scholar
 Park, M., Horwitz, G., & Pillow, J. W. (2011). Active learning of neural response functions with Gaussian processes. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, & K. Q. Weinberger (Eds.), NIPS ’11: Proceedings of the 24th International Conference on Neural Information Processing Systems (pp. 2043–2051). New York, NY: Curran Associates.Google Scholar
 Pillow, J. W., & Park, M. J. (2016). Adaptive Bayesian methods for closedloop neurophysiology. In A. El Hady (Ed.), Closed loop neuroscience (pp. 3–18). Amsterdam, The Netherlands: Elsevier.CrossRefGoogle Scholar
 Prins, N. (2012). The psychometric function: The lapse rate revisited. Journal of Vision, 12(6), 25. https://doi.org/10.1167/12.6.25 CrossRefGoogle Scholar
 Rad, K. R., & Paninski, L. (2010). Efficient, adaptive estimation of twodimensional firing rate surfaces via Gaussian process methods. Network, 21, 142–168.CrossRefGoogle Scholar
 Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Cambridge, MA: MIT Press.Google Scholar
 Settles, B. (2009). Active learning literature survey (Computer Sciences Technical Report 1648). Madison, WI: University of Wisconsin–Madison. Retrieved from burrsettles.com/publications Google Scholar
 Shen, Y., & Richards, V. M. (2013). Bayesian adaptive estimation of the auditory filter. Journal of the Acoustical Society of America, 134, 1134–1145.CrossRefGoogle Scholar
 Song, X. D., Garnett, R., & Barbour, D. L. (2017). Psychometric function estimation by probabilistic classification. Journal of the Acoustical Society of America, 141, 2513–2525.CrossRefGoogle Scholar
 Song, X. D., Sukesan, K. A., & Barbour, D. L. (2018). Bayesian active probabilistic classification for psychometric field estimation. Attention, Perception, & Psychophysics, 80, 798–812. https://doi.org/10.3758/s1341401714600 CrossRefGoogle Scholar
 Song, X. D., Sun, W., & Barbour, D. L. (2015a). Rapid estimation of neuronal frequency response area using Gaussian process regression. Article presented at the annual conference of the Society for Neuroscience, Chicago, IL.Google Scholar
 Song, X. D., Wallace, B. M., Gardner, J. R., Ledbetter, N. M., Weinberger, K. Q., & Barbour, D. L. (2015b). Fast, continuous audiogram estimation using machine learning. Ear and Hearing, 36, e326–e335.CrossRefGoogle Scholar
 Watson, A. B. (2017). QUEST+: A general multidimensional Bayesian adaptive psychometric method. Journal of Vision, 17(3), 10:1–27. 10.1167/17.3.10Google Scholar
 Wichmann, F. A., & Hill, N. J. (2001). The psychometric function: I. Fitting, sampling, and goodness of fit. Perception & Psychophysics, 63, 1293–1313. https://doi.org/10.3758/BF03194544 CrossRefGoogle Scholar
 Williams, C. K., & Barber, D. (1998). Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1342–1351.CrossRefGoogle Scholar
 Zychaluk, K., & Foster, D. H. (2009). Modelfree estimation of the psychometric function. Attention, Perception, & Psychophysics, 71, 1414–1425. https://doi.org/10.3758/APP.71.6.1414 CrossRefGoogle Scholar