Correcting “confusability regions” in face morphs
The visual system represents summary statistical information from a set of similar items, a phenomenon known as ensemble perception. In exploring various ensemble domains (e.g., orientation, color, facial expression), researchers have often employed the method of continuous report, in which observers select their responses from a gradually changing morph sequence. However, given their current implementation, some face morphs unintentionally introduce noise into the ensemble measurement. Specifically, some facial expressions on the morph wheel appear perceptually similar even though they are far apart in stimulus space. For instance, in a morph wheel of happy–sad–angry–happy expressions, an expression between happy and sad may not be discriminable from an expression between sad and angry. Without accounting for this confusability, observer ability will be underestimated. In the present experiments we accounted for this by delineating the perceptual confusability of morphs of multiple expressions. In a two-alternative forced choice task, eight observers were asked to discriminate between anchor images (36 in total) and all 360 facial expressions on the morph wheel. The results were visualized on a “confusability matrix,” depicting the morphs most likely to be confused for one another. The matrix revealed multiple confusable images between distant expressions on the morph wheel. By accounting for these “confusability regions,” we demonstrated a significant improvement in performance estimation on a set of independent ensemble data, suggesting that high-level ensemble abilities may be better than has been previously thought. We also provide an alternative computational approach that may be used to determine potentially confusable stimuli in a given morph space.
KeywordsEnsemble perception Faces Morphs Discriminability
The tendency to consolidate crowds of similar objects into summary representations, a phenomenon known as ensemble perception, is an area of active research. Work in this area has broad intuitive appeal, since it may be the means by which the visual system overcomes traditional limits of visual consciousness (Alvarez & Oliva, 2008; Demeyere, Rzeskiewicz, Humphreys, & Humphreys, 2008; Fischer & Whitney, 2014; Haberman & Whitney, 2011), such as inattentional blindness (Simons & Levin, 1998) and crowding (Whitney & Levi, 2011). Recent work has even suggested that ensembles may serve to bind information across visual scenes (Fischer & Whitney, 2014; Manassi, Liberman, Chaney, & Whitney, 2017), providing a sense of visual stability in an inherently dynamic environment (Whitney, Haberman, & Sweeny, 2014).
The methods employed to explore the mechanisms of ensemble perception have varied from psychophysical (e.g., Ariely, 2001; Haberman & Whitney, 2007) to neuropsychological (e.g., Leib et al., 2012) to neuroimaging (e.g., Cant & Xu, 2012). One approach of particular relevance here is the method of continuous report, a psychophysical technique in which an observer adjusts a test stimulus to match the perceived average of the preceding set. This approach is useful because it can characterize the full distribution of ensemble representation abilities (Haberman, Lee, & Whitney, 2015b; Haberman & Whitney, 2010) and is supported by an array of robust analytical procedures (e.g., circular statistics, mixture modeling; Berens, 2009; Suchow, Brady, Fougnie, & Alvarez, 2013). In continuous report, observers select a response from a continuous distribution on each trial; the difference between what the observer selects and the correct response is used as an index of precision. Continuous report in ensemble perception has effectively been used to address a number of theoretical questions, including how the visual system integrates deviant items across a scene (Haberman & Whitney, 2010) and how the cognitive architecture of ensemble perception is organized (Brady & Alvarez, 2011; Haberman, Brady, & Alvarez, 2015a).
Although continuous report has yielded fruitful results in understanding ensemble representations, an inherent concern exists with its implementation in perceiving average faces. Although this concern does not undermine the conclusions to date regarding how humans perceive crowds of faces, it may ultimately lead to an underestimation of face ensemble abilities, therefore making it difficult to detect subtle differences between conditions. Often in continuous report, the stimuli span a circular distribution (this is not a requirement, it is just often the case; Haberman & Whitney, 2010). This stimulus design works exceptionally well for domains such as orientation, in which the stimulus space naturally lies along a circular continuum. In face space, however, the circular distribution must be artificially constructed. Typically, this entails morphing between multiple exemplars from a single person (e.g., a single individual displaying a happy, sad, and angry expression, if the domain of interest is facial expression). Mathematically, the relationship between any two morphs on the continuum is well characterized, since the morphs are simple linear interpolations. Perceptually, however, the morph space may be heterogeneous, whereby some elements may be more difficult to discriminate than others (note that this concern is true even within orientation space, in which vertical and horizontal orientations are more easily discriminable than orientations around oblique meridians; Andrews, 1967). The more critical concern, which the present article seeks to mitigate, is that some faces along one section of the morph wheel may be perceptually confusable with faces from an entirely different section of the wheel.
The purpose of this experiment was to better estimate ensemble expression ability by accounting for faces that might be confused with one another within a commonly used stimulus set. The first step was to identify any and all such face regions along the morph continuum by having observers evaluate whether any two images displayed were the same or different.
Eight observers participated in this experiment (average age = 21.3 years), seven of whom were naïve to its purposes. All participants gave informed consent and had normal or corrected-to-normal vision. This research and all research described herein was approved by and conducted in accordance with the Institutional Review Board at Rhodes College.
Stimuli and design
The stimulus set originated from a single individual taken from the Karolinska Directed Emotional Faces database (KDEF; Lundqvist et al., 1998) displaying three emotional expressions: angry, happy, and sad. The images were first gray-scaled and then morphed from one expression to the next using linear interpolation (MorphAge, version 4.1.3, Creaceed). This morphing procedure generated a circular distribution of 360 images going from angry to happy to sad and back to angry again (Fig. 2).
To test for confusability, 36 “anchor images” (every 10th face on the wheel starting from “Image 1,” including all three pure expressions) were compared to every other face on the continuum. All observers judged the same set of anchor images in blocks (i.e., one anchor image per run). The order of the blocks was randomized for each observer.
On each trial, the anchor image was presented adjacent to an identical face or a different face. The order of each trial type was randomized for each participant, without trial order optimization (i.e., the same trial type could occur multiple times in succession). In the “different” condition, the anchor image was presented with a randomly selected face (without replacement) from the morph wheel, and in the “same” condition, the anchor image was compared to itself. Each image subtended 6.5° × 8.2° of visual angle, and was displayed 4.2° on either side of the screen along the horizontal meridian. Participants judged whether the images in the presented pair were the “same” or “different.”
For each trial, observers viewed a pair of faces from the morph wheel and had to judge whether the faces were the same or different. One face served as the anchor image for the entirety of a given block (i.e., the standard by which all other faces would be compared). The other face was either identical to the anchor image or one of the other 359 faces from the continuum.
Observers each participated in 36 blocks over the course of several months. Each block consisted of 720 trials (360 “same” and 360 “different”). With 36 blocks (i.e., 36 anchor images), this amounted to 25,920 trials per observer.
An example of a confusable face pair may be seen in Fig. 1. In stimulus space, these images are far from one another, well beyond what should be the just noticeable difference (JND)1 between any two images. In an ensemble task in which observers are asked to select the average expression of a set, the selection of image 155 when the correct response was Image 80 would grossly overestimate the perceptual error. That is, even though in stimulus space these two images (the “correct” image and the observer’s selection) are separated by 75 units, their perceptual similarity is much closer. The confusability matrix allows for the correction of items that are distant in stimulus space but close in perceptual space, which can provide a more accurate assessment of ensemble abilities.
To test whether accounting for confusability regions significantly improves ensemble performance, we compared accuracy before and after implementing a correction to an unpublished ensemble dataset that utilized the same morph continuum. Variants of this task have been published and extensively described elsewhere (e.g., Haberman & Whitney, 2010)
Nine naïve participants from the Harvard University community participated for course credit or cash compensation. This research was approved by and conducted in accordance with the Institutional Review Board at Harvard University.
Stimuli and design
The same face morphs described in Experiment 1 were used to generate ensembles in this experiment. The mean was randomly selected on every trial. Sets were composed of four faces ± 10 and ± 30 emotional units from the mean. Each face within a set subtended 1.9° × 2.5° of visual angle and was presented radially 4.7° from fixation in a square formation. Following each set, observers adjusted a single test face, randomly selected from the morph wheel, presented in the center of the screen at the same size as the faces in the set.
On each trial, observers viewed a set of four faces for 750 ms and then adjusted a test face to match the average of the preceding set using continuous report. Observers altered the appearance of the test face by moving the mouse along the x-axis. This movement was yoked to the morph wheel. Observers scrolled through the morph wheel until they found what they perceived to be the average expression and locked in their selection by pushing the space bar.
The confusability matrix (Fig. 4) was used as the basis for the model implementation (i.e., correcting for the confusable faces). The 36 anchor images were used to define a 10-unit range on which a particular correction was applied—if the anchor image was 10, the correction derived from the confusability matrix was applied to Faces 6 through 15. This range was selected because ± 5 faces falls well within one JND, such that if face X was confusable with face Y, it should follow that face X – 5 would also be confusable with face Y. Without this assumption, we would have had to collect an untenable 225,000 trials per participant (i.e., 360 anchor images rather than 36). The basic approach for implementation was to correct for all regions in the morph space that were perceptually confusable. This approach rests upon the assumption that the observer intended to select a morph that was closer to the actual mean (i.e., one that was perceptually similar), and not the one that was more distant in stimulus space. Note that there were multiple decision points in the implementation of this model—we were not committed to this particular instantiation, but rather sought to demonstrate proof of concept. In this instance, we adopted a conservative approach.
First, we identified trials with an error greater than 30° in the independent ensemble task, since these were the candidate trials likely to reveal a large performance disconnect between the stimulus and perceptual spaces (i.e., large error according to the morph wheel, and small error according to our visual system). The confusability matrix (Fig. 4) was used to identify whether the observer response was perceptually confusable with the correct response (i.e., whether the morph selected by the observer had been incorrectly identified as the same as the correct response by at least 50% of the participants in Exp. 1). If so, that response was replaced with the correct answer plus Gaussian noise with a standard deviation of 20 emotional units. After all large, confusable errors had been identified and replaced, the error for the independent ensemble task was recalculated.
These experiments were designed to provide a more precise estimation of high-level ensemble perception ability. Although continuous report has been effectively used to make strong conclusions about the nature of ensemble perception (e.g., Haberman et al., 2015a), the stimuli traditionally used to assess ensemble face representation introduce noise into the estimation. For example, within a particular morph continuum, we identified multiple “confusability regions,” whereby faces far apart in morph space were indistinguishable from one another in perceptual space. We corrected for confusability regions in an independent ensemble task by replacing the observer responses with the responses they were commonly confused with (and that were closer to the correct answer). In other words, our model assumed that while observers selected one face morph, they actually meant to select a different, more accurate face morph. This is akin to distorting the morph continuum, in essence reshaping it on the basis of the perceptual relations among the faces, not the mathematical ones. After model implementation, participants showed an average improvement of 3.7° (SD = 1.4°) in ensemble estimation, pointing to the importance of accounting for perceptual error.
Perceptual similarity versus physical similarity
It is clear that accounting for stimulus confusability has an impact on the precision of ensemble ability estimation. However, the process by which confusability regions may be identified is laborious, and perhaps even impractical. Might there be a method by which researchers can efficiently estimate potential confusability regions, without having to collect tens of thousands of trials worth of data? A substantial body of work has examined the relationship between perceptual and physical similarity (e.g., Folstein, Gauthier, & Palmeri, 2012; Yue, Biederman, Mangini, von der Malsburg, & Amir, 2012). If we could determine the physical similarity of the morphs in our continuum, and this physical similarity correlated highly with the results of our discrimination task (Exp. 1), it could provide an alternative method by which morph confusability could be measured.
Even with this analysis, one is still faced with the difficult task of choosing the level at which two images may be considered “confusable.” In our behavioral task, we labeled any two images that were incorrectly identified as “the same” 50% of the time as confusable—a rather conservative criterion. With the physical similarity analysis, however, the units are arbitrary, and thus the same principled approach is not available. For our purposes, we looked for a similarity value that resulted in a rate of confusability comparable to the one in the behavioral task. Specifically, the model implementation for our behavioral data identified 21% of the image pairs as confusable. To match this rate of confusability in the physical similarity analysis, values below the 60th percentile of the maximum dissimilarity value were labeled as confusable. That is, if the maximum dissimilarity between two morphs was 340 units, every pair with a dissimilarity score less than 204 units would be considered confusable.
This approach proved effective. Correcting the same data described in Experiment 2 using the physical similarity analysis yielded an improvement in ensemble performance estimation comparable to the correction based on the behavioral data (the differences between the corrected and uncorrected data from the behavioral analysis and the physical similarity analysis were 3.7° and 3.8°, respectively). Overall, these results suggest that this particular image similarity analysis (Gabor filters simulating tiled simple cells) was a reasonable proxy for our behavioral analysis. Of course, the criterion for confusability was guided by the behavioral results, without which deciding what image pairs to label as confusable would have been challenging. One possible solution to this would be to derive rough discrimination thresholds for a given morph sequence and to use these to guide the selection of what level of dissimilarity to label as confusable in the physical similarity analysis.
Choosing a model
The choice of how to implement the correction is not as important as implementing some correction. Detection of potentially small effects, such as ones that might emerge using attentional manipulations (e.g., Attarha, Moore, & Vecera, 2014; Emmanouil & Treisman, 2008), becomes difficult if stimulus confusability is not accounted for. As we noted above, however, the choice of how to implement a correction is flexible. There are multiple possible decision points in accounting for confusability, ranging from what level of performance to call “confusable” (we chose 50%), to the size of the error to correct for (we chose greater than 30°), to the distribution of noise to apply to the correction (we chose a Gaussian with a standard deviation of approximately 20°). These particular decision points are fairly conservative, and other choices would be justifiable. For example, rather than correcting for regions of confusability, one could simply remove trials in which responses fell within a confusable region.
Although there are many decision points in the implementation of this model, one should be careful to make such decisions on principled grounds, choosing model parameters before examining the end result. To do so, it is important to understand ahead of time how such decisions might impact a given dataset. For example, in our implementation we chose to label as “confusable” any morph pair that 50% of our observers incorrectly identified as the same. This resulted in 21% of the total morph pairs being potentially confusable. When we relaxed the “confusable” criterion to 30% (i.e., observers incorrectly identified morphs as identical 30% of the time), the total percentage of confusable morph pairs increased to 25%, which increased the likelihood that a given trial would be corrected.
All of the parameters available to tweak will affect the likelihood that a given trial will undergo correction. How it will impact this likelihood is fairly straightforward (i.e., less conservative criteria will result in more model corrections). However, the model will not ever reverse the direction of an effect (since it is applied to all conditions), nor will it turn typically unusable data into something usable. This last point was verified by simulating a randomly responding observer and subjecting these data to our model. Although the ensemble performance improved slightly after correction, it still did not pass our criterion for inclusion, which tested whether an observer’s response distribution differed significantly from uniform (see Fig. 5 for a reference—note that this participant’s performance is visibly centered around the mean; a random guesser would have responses uniformly distributed about the mean).
We note that our results, although limited to the specific face morph used here, highlight the importance of accounting for perceptual confusability when estimating ensemble representation abilities—even for paradigms that do not utilize continuous report (e.g., the method of constant stimuli). We acknowledge that the amount of data required in order to implement such corrections is daunting (each of our observers performed psychophysics for approximately 12 h), and as such have made our data publicly available at https://jasonmarchaberman.wordpress.com/zeeabrahamsen-and-haberman-2017-vss-abstract/. Additionally, our physical similarity analysis (Yue et al., 2012) revealed a strong correlation with psychophysical performance, providing an alternative and efficient means of assessing morph confusability.
Despite the potential challenges associated with accounting for perceptual errors, we contend that future research should endeavor to do so, particularly when attempting to make comparisons across stimulus domains (e.g., average orientation vs. average expression) or exploring questions that typically yield small effect sizes. Critically, these considerations do not apply exclusively to research on ensemble perception—any domain using nonlinearized morph sequences would benefit from characterizing their perceptual space, unless noise is not a serious consideration.
JNDs were determined in a separate pilot experiment. Six observers were asked to determine which of two images differed from a template image. The images were simultaneously displayed in a triangle formation for 2 s. The template was randomly selected on every trial, and the comparison images varied from among the faces 10–60 emotional units from the template. Observers performed 240 trials. The average 75% JND across observers was 27°.
- Attarha, M., Moore, C. M., & Vecera, S. P. (2014). Summary statistics of size: Fixed processing capacity for multiple ensembles but unlimited processing capacity for single ensembles. Journal of Experimental Psychology: Human Perception and Performance, 40, 1440–1449. https://doi.org/10.1037/a0036206 PubMedGoogle Scholar
- Lundqvist, D., Flykt, A., & Öhman, A. (1998). The Karolinska Directed Emotional Faces (KDEF) (CD ROM). Stockholm: Karolinska Institutet, Department of Clinical Neuroscience, Psychology section.Google Scholar
- Whitney, D., Haberman, J., & Sweeny, T. D. (2014). From textures to crowds: Multiple levels of summary statistical perception. In J. S. Werner & L. M. Chalupa (Eds.), The new visual neurosciences (pp. 685–709). Cambridge, MA: MIT Press.Google Scholar