In very early neuroscience textbooks, vision seemed to be a hierarchical process that started at the lowest processing levels in the brain and propagated into higher areas that progressively computed more details and complexity. But we have also known for a long time that there are massive descending feedback connections in the brain, modulating processing at lower levels (Felleman & Van Essen, 1991). In general, a range of so-called contextual effects indicate that neurons’ responses are influenced by information outside their receptive fields that is difficult to explain by mere bottom-up processing. Srinivasan, Laughlin, and Dubs (1982) were the first to suggest a computational model based on a predictive-coding mechanism, to explain the center–surround antagonisms that interneurons of the retina exhibit. Predictive coding originally referred to an algorithm used in the context of image compression that enabled the removal of redundancy. This process exploits the fact that neighboring pixel intensities tend to be correlated; thus, the actual information can be sufficiently represented by deviations in these intensities. In this way, an image can be stored most efficiently. There are considerable statistical regularities in the visual images and scenes that biological organisms encounter, too (Torralba & Oliva, 2003). Rao and Ballard (1999) suggested a computational model that explains extracellular field effects in the visual cortex on the basis of similar mechanisms. They proposed that cortico-cortical feedback connections provide predictions about sensory input, and only the deviations, or prediction errors, are fed forward in the visual hierarchy to be processed further.

The scope of predictive coding quickly grew beyond the cellular level and promoted large-scale proposals that suggested that the minimization of surprise might be the brain’s major computational goal (Friston, 2005, 2010). Nowadays, the idea of a predictive brain is influential in many domains in cognitive science—for example, perception (Bar, 2007; Kok, Jehee, & de Lange, 2012; Summerfield & de Lange, 2014), action control (Makris & Urgesi, 2015), interoception (Seth & Critchley, 2013; Seth, Suzuki, & Critchley, 2011), language (DeLong, Urbach, & Kutas, 2005; Kuperberg & Jaeger, 2015; Weber, Lau, Stillerman, & Kuperberg, 2016), and affective and social neuroscience (Barrett & Bar, 2009; Koster-Hale & Saxe, 2013). Lastly, clinical disorders, such as anxiety, autism, and schizophrenia, have also been connected to deficits in anticipatory processes (Fletcher & Frith, 2009; Grupe & Nitschke, 2013; Pellicano & Burr, 2012; Van de Cruys et al., 2014).

Just like objects or any other sensory information, faces are perceived within fractions of a second (e.g., Liu, Harris, & Kanwisher, 2002), and this despite sharing the key problems prevalent in visual recognition—that is, variations in illumination, viewpoint, and other objects that may partly cover a face. Studies have suggested that robust representations are involved in the recognition of familiar faces, whereas the recognition of once-seen unfamiliar faces is relatively more dependent on single-image information (Burton, 2013; Burton, Jenkins, & Schweinberger, 2011; Eger, Schweinberger, Dolan, & Henson, 2005; Hancock, Bruce, & Burton, 2000). This efficiency and robustness raises the question of whether and how predictive processes shape face recognition to mitigate the computational burden.

Several studies have used faces as stimuli to address questions regarding the predictive-coding framework in the past. A major reason for this is that faces are known to be processed in certain well-defined and widely studied dedicated areas of the brain. Although functional imaging studies first established the fusiform face area (FFA; Kanwisher, McDermott, & Chun, 1997) as a face-sensitive brain region, the “core” system for face processing includes at least two further regions in the inferior occipital gyri (IOG) and the superior temporal sulcus (STS) areas (Gobbini & Haxby, 2007; Haxby, Hoffman, & Gobbini, 2000). These face-sensitive regions have been related to early visual analysis (IOG), the processing of changeable aspects of a face (STS), and the processing of more time-invariant aspects of face information. In parallel, electroencephalography (EEG) and magnetoencephalography (MEG) studies have identified a number of face-sensitive brain responses, including the N/M170, P/M200, N/M250r, and N/M400, which are thought to reflect different stages of face perception (e.g., Schweinberger & Burton, 2003).

Overall, this well-described network enables one to test the predictions of the complex predictive-coding models directly—using faces allows for probing the effects of prediction manipulations in exactly those regions or in well-known EEG/MEG components. Probably the most important evidence of connections between the face-processing network and predictive coding comes from neuroimaging studies of repetition suppression (RS). RS refers to a reduced neural response for repeated as compared to alternating stimuli and has been explained by neuronal fatigue or adaptation effects (Grill-Spector, Henson, & Martin, 2006). Summerfield et al. (2008) presented pairs of faces, which could show either two different individuals (alternation trial, AltT) or the same image twice (repetition trial, RepT). RepT and AltT were not presented randomly; rather, they were grouped into short blocks with either high (repetition block, RepB) or low (alternation block, AltB) probabilities of stimulus repetition. The authors found that RS in the FFA was enhanced in blocks in which repetitions were frequent. They interpreted this probabilistic modulation as a result of top-down influence on RS. In other words, it was suggested that participants predicted the frequently occurring repetitions and that this enhanced RS in those blocks. Later, such modulations of RS by expecting the repetition have been found in almost every member of the core network—namely in the occipital face area (OFA) and lateral occipital complex (LO) (for a summary, see Grotheer & Kovács, 2016). Interestingly, this effect seems to be specific for faces and familiar characters, and it does not occur for other stimuli (Grotheer & Kovács, 2014; but see Utzerath, John-Saaltink, Buitelaar, & Lange, 2017, for a different conclusion).

This functional magnetic resonance imaging (fMRI) RS modulation seems not to be attenuated for inverted faces, suggesting that the prediction effect operates at an early level—that is, before configural or holistic processing of facial features (Grotheer, Hermann, Vidnyánszky, & Kovács, 2014). In an EEG study, however, Schweinberger, Huddy, and Burton (2004) found that the N250r repetition effect was strongly attenuated for inverted faces, possibly supporting the idea that prediction effects (1) should be considered at multiple levels of processing and (2) should be considered not just with fMRI, but also with EEG. De Gardelle, Waczuzk, Egner, and Summerfield (2013) used a repetition suppression paradigm and examined the neural correlates of repetition enhancement (RE) and suppression (RS) in face-sensitive regions with multivoxel pattern analysis, and they found that both coexist in the same region but show different latencies and connectivities with lower- and higher-level surrounding areas. Egner, Monti, and Summerfield (2010) reported that the FFA is activated when faces are highly expected, due to the presentation of a prior cue, regardless of whether the actual stimulus was a face or a house. They fitted the fMRI data with either a (traditional) feature detection model or the predictive-coding model, and found that the latter had a better fit. Bell, Summerfield, Morin, Malecek, and Ungerleider (2016) exposed a macaque monkey to either fruits or face stimuli and recorded activity from single neurons in the inferior temporal cortex. They found reduced signals for expected face stimuli, a greater accuracy in decoding the stimulus from multivariate population signals, and the encoding of probabilistic information about the face occurrence in the inferior temporal cortex. Brodski, Paasch, Helbling, and Wibral (2015) manipulated expectations of faces and provided further evidence that bottom-up propagation of prediction error signals is reflected in gamma-band activity. Apps and Tsakiris (2013) directly addressed whether face recognition can be reconciled with predictive coding. They aimed to understand why familiarity for faces is increased regardless of changes in viewpoint. They provided evidence for the argument that familiarity is constantly updated and takes into account contextual information that renders familiarity more or less likely (with context being defined here as a specific situation in which familiar faces tend to occur). In their study, the behavioral responses of participants were in line with a computational model that is built upon predictive-coding principles.

Despite these enormous efforts that were designed to put the predictive-coding model under empirical scrutiny, the above-mentioned studies (with the exception of Apps & Tsakiris, 2013) made little reference to established cognitive models of face perception. Instead, they used face stimuli as a means to address questions in the context of predictive processing. In other words, theoretical questions with regard to cognitive face models have remained untouched. Conversely, with respect to face recognition, early studies investigating the cognitive and neuronal mechanisms of priming (both repetition and semantic priming) have provided crucial information on the ways that faces could be represented and processed (e.g., Bruce & Valentine, 1985, 1986; Schweinberger, Pfütze, & Sommer, 1995). However, reference to the role of expectations during face processing in studies of priming remained indirect only.

We argue here that wrapping up cognitive face models in a predictive framework will not only address open issues in these models, but also shed light on questions related to the predictive-processing framework. In the next section, we will briefly review dominant cognitive models of face perception.

Cognitive models of face perception

Cognitive models of face perception distinguish several levels of sensory face processing. During low-level visual analysis (image encoding), a detailed analysis is conducted of the visual stimulus and its elements, before this basic information (e.g., about edges, coloration, and shape) is integrated into a uniform holistic representation of a face. So-called first-order configural information (i.e., the fact that a face contains two eyes above a nose, above a mouth) may be crucial for the categorization of a stimulus as a face. So-called second-order configural information (in terms of the relative metric distances between features) can then be encoded (Maurer, Le Grand, & Mondloch, 2002). Although spatial configural information likely is important for learning new faces, recent evidence suggests that the representations that mediate the recognition of familiar faces depend to a great extent on the texture/reflectance information in the image, rather than on spatial configuration per se (e.g., Burton, Schweinberger, Jenkins, & Kaufmann, 2015; Itz, Schweinberger, Schulz, & Kaufmann, 2014). Once an encoded representation has been matched with a more permanently stored representation of a familiar face, the face may be recognized, and semantic or episodic information (e.g., occupation, most recent encounter) about the depicted person may subsequently become available. The existence of temporal or permanent difficulties in retrieving personal names (e.g., Diaz, Lindin, Galdo-Alvarez, Facal, & Juncos-Rabadan, 2007) is one of several arguments for separating name retrieval from general semantic processes in cognitive models of familiar face identification.

Even though face recognition and identification are the results of successful processing at multiple levels (see also Fig. 1), the level of representation of faces has received particularly strong scientific attention. One important framework for modeling the representational level of face perception is in terms of a face space (Valentine, 1991, 2001; Valentine, Lewis, & Hills, 2016), so that a face is conceived as being a point within a multidimensional face space (MDFS) made up of the entire set of faces with which one has familiarity. Each dimension of the space represents a visible characteristic upon which faces can be differentiated. Two versions of face space have been widely debated. The first version suggests that faces are processed in terms of their deviation from the average face, or “norm” (Leopold, Bondar, & Giese, 2006; Rhodes, Brennan, & Carey, 1987; Rhodes & Jeffery, 2006). The competing proposal—the exemplar-based model—suggests that each face is encoded as a specific point within the space and is differentiated from the others as a whole, without specific reference to the center of the space (Nosofsky, 1988, 1991; Valentine, 1991). Although the MDFS model has its merits, the metaphor of a face being represented as a point in MDFS has its limitations and could potentially be misleading with respect to the representation of familiar faces. This becomes evident when considering the evidence for qualitative differences in the mental representations of familiar and unfamiliar faces (Megreya & Burton, 2006). Whereas the visual characteristics of the image are crucial for processing unfamiliar faces (Hancock et al., 2000), familiar faces are characterized by robust representations that are much less image-dependent. In that sense, familiar face representations may be better considered as covering an area, rather than a point, in MDFS—where the area covers all “possible” images of a specific familiar face (e.g., Burton et al., 2011; Burton et al., 2015).

Fig. 1
figure 1

Simplified cognitive model of face perception, which depicts an approximate time course of the subprocesses involved in face perception and recognition and indicates some of the event-related potential components that are sensitive to different types of face repetition effects. The subprocesses are sensitive to predictive-coding mechanisms, as indicated by the bidirectional arrows with subsequent levels. From “Repetition Effects in Human ERPs to Faces,” by S. R. Schweinberger and M. F. Neumann, 2016, Cortex, 80, pp. 141–153. Copyright 2016 by Elsevier. Adapted with permission, license number 4262480623312

Face space and robust-representation models are appealing because they provide frameworks to consider the many different ways in which faces differ from each other. Originally proposed by Valentine (1991) to account for findings such as facial distinctiveness and the other-race effect (Hayward, Crookes, & Rhodes, 2013), most recent investigations have used perceptual adaptation to examine the ways that face perception is distorted following prolonged exposure to a face. In face adaptation (e.g., Leopold, O’Toole, Vetter, & Blanz, 2001; Rhodes & Leopold, 2011), exposure to a face with certain characteristics—for example, femininity—makes a subsequent face appear to take on the opposite characteristics (in this case, to look more masculine). Adaptation effects seem naturally to derive from a face space model, because changes in perceived appearance can be conceptualized as temporary changes in position within the space. Face aftereffects are generally interpreted as being more favorable to norm-based than to exemplar accounts, since adaptation effects typically show sensitivity to the center of the space. For example, Rhodes and Jeffery (2006) created adaptor–test pairs that in some cases formed dimensions running through a central face norm and in other cases formed dimensions that did not run through the centroid of face space. The authors argued that a norm-based coding account would predict greater adaptation when the norm was involved, since any changes in perceived appearance should be maximally driven toward or away from the norm. They further argued that an exemplar-based account would propose relatively similar levels of adaptation in the two conditions, since a function can be placed between any two points in face space, and there is no a priori reason to assume that one pair would produce stronger aftereffects than another. Rhodes and Jeffery found support for the norm-based account, in that stronger adaptation effects were observed when adaptation went through the center of the space than when it went on a different trajectory. Recently, however, Ross, Deroche, and Palmeri (2014) have demonstrated that such arguments may be less compelling than was previously claimed. They modeled the results of face adaptation experiments using three different architectures: (1) exemplar-based (e.g., Valentine, 1991, 2001), (2) norm-based (using a centroid norm; e.g., Leopold et al., 2001), and (3) opponent-coding-based (a variant of the norm-based approach; e.g., Rhodes & Jeffery, 2006). According to previous conceptions, the norm-based and opponent-coding approaches, but not exemplar-based models, could account for face adaptation effects such as those demonstrated by Rhodes and Jeffery. However, Ross et al. demonstrated that an exemplar-based model could also account for such results, whereas the norm-based model using a centroid of the space as the norm was unable to do so. Overall, the controversy between exemplar- and norm-based views is likely to continue. Although norm-based opponent coding may not be a good model to explain adaptation for selected facial signals such as eye gaze (Calder, Jenkins, Cassel, & Clifford, 2008), the norm-based account appears to remain dominant as an explanation of facial adaptation effects for a variety of important social signals, including identity, gender, and expression (Rhodes et al., 2017).

The (missing) link between predictive frameworks and cognitive face models

We have outlined that cognitive models of face perception emphasize that the successful processing of a face (i.e., the recognition of a given person or the correct perception of an emotion on a given face) is mediated by multiple subprocesses within a network of occipito-temporal and frontal cortical areas. In brief, these models typically include the following processing steps, each allocated to certain cortical areas of the network: low-level, image-based pictorial processing of visual features; detection of a face in a scene; perceptual encoding of a facial representation; comparison with the representations of familiar faces stored in memory; and retrieval of semantic information and naming (in the case of familiar faces) (Bruce & Young, 1986; Gobbini & Haxby, 2007; Haxby et al., 2000; Schweinberger & Burton, 2003). In principle, predictive coding may occur at all these levels of processing (e.g., Schweinberger & Neumann, 2016; see Fig. 1).

In current conceptualizations of face perception, processing starts with the incoming sensory information. This information is carried up to the cortex via thalamo-cortical sensory pathways and enters the cortex via V1. After early, local, feature-based processing in early cortical areas such as V1 and V2, the face stimulus—like every type of shape-related information—enters the ventral visual pathways. Current models of face perception (e.g., Haxby et al., 2000) split the processing into two basic steps. The first step performs a relatively low-level visual feature analysis of the faces and is connected to the so-called “core network” of face perception. The core network is composed of occipital (OFA, LO), posterior (FFA), and superior temporal sulcal (STS) areas. A second, final step complements this early processing and is connected to an extended network of areas. Areas in the parietal and auditory cortices, the amygdala, the insula, over the anterior tip of the temporal lobe, and over the frontal cortex connect the currently observed faces to attention, emotion, identity-specific, and speech-related information, and to various social factors. At this stage of processing, the stimulus is compared to an existing template (and mnemonic information). In other words, the processing of faces proceeds from easy and “sketchy” toward complicated, in a hierarchical system.

Although the idea of a cortical network behind face perception is very attractive, and many studies support its existence (Fairhall & Ishai, 2007), several researchers doubt that such a model is able to explain every aspect of face perception. These considerations are based on two different caveats. First, many areas that were originally thought to be involved only in low-level processing have in fact been shown to be necessary for higher-level tasks such as person recognition (Ambrus, Dotzer, Schweinberger, & Kovács, 2017; Davies-Thompson & Andrews, 2012; Gschwind, Pourtois, Schwartz, Van De Ville, & Vuilleumier, 2012; Rossion, 2008, 2014). Second, many areas of the extended part of the “face-network” are not face-specific but are involved in general cognitive mechanisms such as attention, emotion processing, or executive control. How can these aspects be reconciled with current cognitive face models? Taking into account the role of expectations in face perception, cortical feedback originating from higher-level cortical areas descending to lower levels may solve this issue. Indeed, this perspective is currently attracting more attention in face recognition. For example, it is argued that these feedback connections may be important in integrating the processes that precede the incoming face stimulus—that is, expectations that prepare the system for the interpretation of specific information. This is an operation step that has been examined thoroughly in the context of predictive processing.

What happens in the system when the face is not yet presented, but expected? Several studies have already addressed this question by manipulating prediction with a cue—for example, an arrow or a preceding symbol that participants learn to associate with an upcoming stimulus. Several studies have shown that this results in a baseline shift in areas that process the face stimulus (Esterman & Yantis, 2010; Puri, Wojciulik, & Ranganath, 2009). This anticipatory activity in the FFA was even sensitive to the likelihood of the upcoming face (Trapp, Lepsien, Kotz & Bar, 2016).

However, to the best of our knowledge, studies demonstrating evidence for predictive effects in face perception have not applied functionally predictive coding steps into currently available face-processing theories. The only exception has been the study of Apps and Tsakiris (2013)—these authors directly looked at the neural bases of face perception, familiarity, and prediction. They showed that FFA activity signals how much information can be extracted about a face (and thus how much more familiar it can become) and that the posterior STS signals the current level of familiarity of faces in the environment (i.e., a key prior), which would influence how face space might be processed. This was one of the first studies to suggest how face processing and predictive models might interact. It should also be noted that self-perception has been linked with predictive-processing accounts (Apps & Tsakiris, 2014).

Not only can the prediction perspective address current discussions in face perception with regard to the brain areas involved, it may also help us understand difficulties with face perception in specific clinical disorders—for example, in autism spectrum disorder (for a review, see Weigelt, Koldewyn, & Kanwisher, 2012). From the predictive perspective, it may be that although comparison with existing face templates works well, deficits occur during the prediction process, thereby slowing the comparison process or adding more noise to it. This aligns with a recent suggestion that autistic individuals might generally perceive the world with fewer expectations (Pellicano & Burr, 2012; Van de Cruys et al., 2014). A recent study revealed that children with autism spectrum disorder showed higher activation of (dorsolateral) frontal regions during successful face recognition, which was accompanied by increased response times (Herrington, Riley, Grupe, & Schultz, 2015). The authors suggested that this activation might act as a compensatory mechanism. From the predictive perspective, it could be interpreted as more effort being required in order to activate predictive information elsewhere in the cortex.

How cognitive face models can inform predictive frameworks

The rise of predictive coding in cognitive neuroscience has stimulated a plethora of studies, and much work has been dedicated to identifying the correlates of prediction and prediction errors, as well as the neural networks that support the process (e.g., Alink, Schwiedrzik, Kohler, Singer, & Muckli, 2010; Bendixen, Schwartze, & Kotz, 2015; de Gardelle et al., 2013; Kimura, Kondo, Ohira, & Schröger, 2012; Kok, Failing, & de Lange, 2014; Todorovic, van Ede, Maris, & de Lange, 2011). However, as of yet, predictive frameworks have been relatively mute on the representational format of the “prediction.” For example, do we expect a specific chair, or rather an abstract category? What would such an abstract category look like? It is here that we see a role of (neuro)cognitive face models to inform predictive frameworks: As we have outlined in this article, face models are detailed about the representational format of the template (which may be a norm or an exemplar), and may thereby generate ideas about the format of the prior in other domains of perception. In addition, it remains to be specified how exactly the comparison between prior and incoming sensory information is accomplished. The lack of psychological process models within abstract mathematical frameworks such as Bayesian models has been discussed extensively elsewhere (Jones & Love, 2011). Notably, through decades of research, the comparison process between face templates and incoming sensory information has been specified, and this knowledge may guide the way we conceptualize and investigate questions in the context of the “predictive brain.”