Validity refers to the degree to which (empirical) evidence and theory support the interpretation(s) of test scores for the intended uses of those or the degree to which empirical evidence and theory support the interpretation of results derived from experiments.
Within psychology, the concept of validity is used in two different yet related areas. First, validity is an important aspect of the psychometric quality of scores derived from diagnostic measures (e.g., psychological tests, questionnaires, interviews, and behavior observations). Second, validity is also used to gauge the quality of result interpretations of psychological experiments. The definition presented above is based on the Standards for Educational and Psychological Testing (AERA and APA 2014) but extents it in order to also accommodate the second area where validity is relevant, i.e., experiments.
Validity in Psychological Experiments
In psychology, experiments are used with the aim of testing causal relationships between psychological constructs. To this end, experimental paradigms are created which standardize the setting and thereby allow to directly and specifically vary so-called independent variables (IV). This variation is the experimental manipulation of the standardized situation. Thus, this variation is the only aspect the situations differ in. The impact of such variations on other variables, so-called dependent variables (DV), is measured and interpreted.
For example, Maaß and Ziegler (2017) were interested in how situational demand affects the relation between narcissism (IV) and self-promotion (DV). The experimental setup was standardized such that participants worked on the task in the same room, receiving the same standard instructions and filling out the same questionnaires. The task for all participants was to write a short self-description in order to introduce oneself to a person he/she just met in a fitness studio. These self-descriptions were later rated by expert raters regarding self-promotion (DV). The situational demand (IV) was manipulated by (a) directly requesting participants to present themselves in a very positive way or (b) using a subliminal priming to present oneself in a positive way. Two further control conditions were also set up. The analyses revealed that the relation between narcissism and self-promotion was not affected by situational demand. In order to put faith into this judgment (Open Science Collaboration 2015), it is necessary to consider internal and external validity of the result interpretations.
Kemper Internal Validity in this encyclopedia puts forward the following definition: “Internal validity refers to the degree to which a conclusion concerning the causal effect of one variable (independent or treatment variable) on another variable (dependent variable) based on an experimental study is warranted. Such warrant is constituted by ruling out possible alternative explanations of the observed causal effect that are due to systematic error involved in the study design.”
Thus, the experimental standardizations conducted are sufficient to trust the results, i.e., trust the causal link between IV and DV. In other words, there are no other variables (confounders) not controlled for which are actually driving the effects observed. Such confounders can be rather technical. In our example stated above, the time window for the subliminal priming could accidentally have been set too short so that no perception was possible at all. Confounders can also be differences in other constructs not controlled for. In our example, it might just be the case that self-esteem was actually driving the effects (Maaß and Ziegler did control for self-esteem). What these examples should show is that an internally valid experiment makes sure that alternative explanations for the causal link under investigation can be either ruled out due to standardization or can be statistically controlled for by including additional measures into the analyses.
According to Kemper External Validity in this encyclopedia: “External validity refers to the degree to which conclusions from experimental scientific studies can be generalized from the specific set of conditions under which the study is conducted to other populations, settings, treatments, measurements, times, and experimenters.”
To some extent, internal and external validity are difficult to reconcile. The more researchers try to standardize or the more they add confounders to their analyses, the less generalizable the results. One specific argument often brought forward and potentially limiting external validity is the widespread use of the so-called WEIRD (Western, educated, industrialized, rich, and democratic) samples in psychological research (Henrich et al. 2010). What those authors refer to is the extensive use of psychology students as participants in psychological experiments and studies. To some degree, it can certainly be argued that generalizability and thus external validity is severely limited for studies with such samples.
Validity and Psychometric Tests
Psychometric tests (e.g., cognitive ability tests) and questionnaires (e.g., Big Five questionnaires) as well as interviews (e.g., employment interviews) and behavior observations (e.g., assessment center) belong to the standard tools utilized in psychological research. A lot of trust is put into the interpretations based on such psychometric tools. For example, companies hire new employees or select internally based on such tools (Sackett and Lievens 2008); researchers advocate the impact of interindividual differences in a number of personality and ability constructs on a wide range of life outcomes (Deary et al. 2004; Roberts et al. 2007; Whalley et al. 2005). This clearly shows that the interpretation of a test score as a measure of intelligence or extraversion needs to be trustworthy, i.e., valid. Unfortunately, it is not that easy to straightforwardly test the validity of a test score interpretation (Usually the test score is the sum or the mean of all item answers. Thus, in an intelligence test, it could be the sum of all items solved or the percentage of items solved. In questionnaires using a Likert-type rating scale, each rating category is assigned a number which is then added or averaged across items to determine the test score.). The debate on what good validity evidence is enriched the early psychometric literature (e.g., Campbell and Fiske 1959; Cronbach and Meehl 1955; Loevinger 1957). As a result, there is no such thing as the one validity. Instead, different approaches focusing on different sources of validity have been proposed. Based on this tradition, evidence regarding those different sources of validity has to be presented before a test score interpretation can be regarded as valid. At this point, it should be noted that it is important to stress that not a test per se is valid. Rather than that, it is the interpretation based on the test score or as Messick (1989) put it: “…validity is an inductive summary of both the existing evidence for and the actual as well as potential consequences of score interpretation and use” (p. 5). Such a compilation of validity evidence should not be random. It should rather result from careful planning and hypothesizing. As Ziegler (2014) stated in the ABC of test construction, there should be a clear connection between the intended uses of a test as well as its theoretical foundation and the validity evidence presented.
For many years, the terms construct validity, criterion validity, and content validity were used to label those sources. Within the following passages, we will outline the ideas behind different sources of validity and discuss their implications for test construction. Before doing this, it should be mentioned that the described classical approach has not remained without critique (Borsboom et al. 2004; see evaluating validity below). We will base the following passages on the Standards of Educational and Psychological Testing mentioned above (AERA et al. 2014).
Sources of Validity Evidence
As already mentioned, there is no such thing as the validity for a test score interpretation. Theoretically this would be the correlation between the test score and the true score. However, if we knew the true score, we would not need a test in the first place. So this solution remains a hypothetical one. Thus, different scholars tried to circumvent the problem of the missing true score by making use of auxiliary propositions, i.e., they constructed a theoretical building that can be tested. A metaphor for this is the following. Imagine a researcher looks at an apple and interprets its color and form as evidence for its sweetness. Now this interpretation needs to be validated. Unfortunately, the researcher has no idea on how to measure the amount of sugar (true score) through biochemical methods. Thus, instead of directly measuring the amount of sugar apples have, the researcher sets up an experiment. The hypothesis to be tested is: If the color and form of the apple indeed can be interpreted as a sign of sweetness, the apple should taste sweet. The researcher then asks a lot of people to eat apples and later to rate their sweetness. The researcher further assumes that if the average rating is above a certain threshold, the apple could be considered sweet and thus color and form would be validly interpreted as indicators of sweetness. Determining the validity of test score interpretations is similar.
A seminal paper underlying many of these ideas is the paper by Cronbach and Meehl (1955) where the basic ideas of a nomological network were laid out. Those authors said: “Scientifically speaking, to ‘make clear what something is’ means to set forth the laws in which it occurs. We shall refer to the interlocking system of laws which constitute a theory as a nomological network” (p. 290). They went on and stated several other important features, of which the idea that a nomological net may relate theoretical constructs to observables and different theoretical constructs to one another is especially important for the ideas of validity brought forward by other scholars.
The Standards’ definitions and explanations of sources of validity evidence which are often cited today no longer directly refer to a nomological network approach. The historical roots are evident though. Thus, when explaining the ideas behind different sources of validity, we will try to present both the historical roots and the present interpretation, whenever possible.
Evidence Based on Test Content
The basic idea here is that the construct to be measured should be reflected in the test content. The easiest way to test this is to look at the items of a test and to determine the degree to which the items reflect the construct. More abstractly speaking, this would mean that the items are a representative sample of the universe of items the construct could be measured with. This more abstract definition reveals the difficulties in approaching this task. What is the item universe? How can we determine representativeness of the item sample? Oftentimes expert ratings are utilized. More specifically, subject experts are asked to rate in how far the items reflect the construct. Importantly, the interpretations and especially the inferences which are supposed to be based on the test scores should be considered. For example, if the test score is supposed to reflect the construct of conscientiousness, it is important to consider the intended test use. If the test were to be used in an occupational context, it should be determined whether the items reflect manifestations of conscientiousness relevant for the occupational context.
Evidence Based on Response Processes
This approach to validity is strongly related to cognitive surveys conducted during test constructions (Ziegler et al. 2015). The underlying idea is that based on the construct, hypotheses can be generated regarding the response process. Cognitive surveys, such as a think aloud technique, can then be used to collect empirical evidence supporting the hypothesized response process. Other approaches to this idea also use quantitative methods. For example, in educational research, response processes oftentimes can be directly attributed to different skills (e.g., a complex math item in the form of a text can require to addition, subtraction, modeling; see Kunina-Habenicht et al. 2009). Using so-called diagnostic classification models, it can then be tested whether the assumed processes really contribute to the response process. A nice example where qualitative and quantitative approaches were mixed to investigate the response process was presented by Carpenter et al. (1990) who demonstrated in which aspects persons high vs. low on ability differ when solving matrix items in a standard intelligence test.
Another interesting approach related to this idea was suggested by Krumm et al. (2017). Those authors suggested experimentally manipulating specific features of the items which are supposedly related to the construct measured and substantially contribute to the validity of the test score interpretation. Krumm et al. (2015) used this approach and could show that the situational vignettes employed in situational judgment tests are not really necessary for the actual scores people achieve.
Evidence Based on Internal Structure
This source of evidence has a historical root. Loevinger (1957) made the term structural validity popular. In the following years, especially with the widespread use of factor analytic methods, this source of validity evidence was strongly promoted. The underlying idea is that a construct has a specific structure. In the simplest form, there is only one latent variable like conscientiousness causing differences in behavior. More complex models assume intermediate levels or strata of latent variables, e.g., personality facets between behaviors and personality traits, e.g., achievement striving or order. This assumed structure should then be reflected in the structure of the items or more specifically the interitem correlations (Ziegler and Bäckström 2016). Confirmatory factor analysis is often used to test whether the assumed structure of a construct is reflected in the correlations between items. It should be stressed here that while this source of evidence surely is important, the validation approach described here would not assume validity of a test score interpretations simply based on the fact that a CFA model provides evidence for structural validity.
Evidence Based on Relations to Other Variables
Again, a historical root for this approach can be identified. The seminal paper by Campbell and Fiske (1959) outlined the ideas of convergent and discriminant validity as corner stones of construct validity. This approach exemplifies how auxiliary propositions were utilized. Convergent validity means that tests measuring the same construct should yield converging results. The Standards softened this original proposition and also include similar constructs. Either way, the basic idea is the same: If the same (or similar) construct is reflected in the test scores, the results should be similar. For example, results from different Big Five tests should converge (but see Pace and Brannick 2010). This is usually tested by correlating test scores derived from different tests assessing the same (or similar) constructs. It is important to note here that convergent validity can also be determined by using different raters instead of different methods. In that sense, raters are methods. This traditional approach also means that when ratings come from others and the self, their agreement (others and self as well as others and others) might not only be a sign for the psychometric adequacy of a test score but might also indicate a certain “rateability” of traits: some traits are more observable than others (Yalch and Hopwood 2016). Discriminant validity means that test scores in measures capturing different constructs should not converge. For example, results in a grit measure should be distinct from results in measures for other constructs such as conscientiousness (but see Credé et al. 2016). As before, these assumptions are often tested by correlating scores from different tests. Modern approaches to analyzing such multi-trait-multi-method data and thus convergent and discriminant validity use confirmatory factor analysis (see Koch et al. 2017 in this encyclopedia).
Another popular source of validity evidence which is subsumed here refers to test-criterion correlations. The basic idea is that the inferences based on the test score interpretations should hold true in a specific context. For example, if employees are selected based on the scores in a test battery, those scores should be correlated with actual job success. As before, oftentimes correlations, i.e., correlations between test scores and criteria, are used here. In that sense, the test scores should predict a criterion. This criterion should be relevant for the intended test uses. It should be noted here that many test applications require decisions regarding individuals. Such intended test uses should be backed up by showing the test score’s sensitivity and specificity.
Besides “rateability,” other ratings also have implications for test-criterion correlations. Meta-analyses showed that for some traits, for example, conscientiousness and emotional stability, other ratings have higher predictive values on academic achievement and job performance than self-ratings (Connelly and Ones 2010; Poropat 2014). Three possible explanations for this have been proposed. The first explanation is a reduced response bias in other ratings. The second explanation is the specific context the others know and rate which is also relevant for the criterion. Finally, the others’ focus on the reputation component rather than identity and internally held motives has been suggested as an explanation (Connelly and Ones 2010). Thus, ratings from different perspectives potentially enhance test-criterion correlations.
Some researchers also include the prediction of the test score by antecedents as validity evidence. While this may not be covered by the Standards or similar works, the basic idea is appealing. If the developmental mechanisms underlying variation in a construct are known, test scores should be predictable by those antecedents. Considering the state of affairs in terms of knowing developmental antecedents of constructs, this approach might be considered to be in its infancy.
Implications for Test Construction: The ABC of Test Construction
Ziegler (2014) proposed the ABC of test construction and was referring to a construction strategy that informs the validation strategy. In short, three basic questions need to be answered by the test constructor: (A) What is the construct to be measured, (B) what is the targeted population, and (C) what are the intended uses? The second question mainly informs sampling strategies but also is of relevance for content validity. The targeted population inhabits a specific context which should then be reflected in the items as detailed above. Answers to the first and the second question directly inform the validation strategy. The definition of the construct to be measured should include a description of its nomological network. Thus, it should be possible to select measures that assess the same construct (convergent validity) or a different construct (discriminant). Especially the later becomes more meaningful when the selection of discriminant measures is based on the nomological net. In that sense, it is of little relevance to show that scores in an intelligence test do not correlate with agreeableness scores. However, the nomological net of most intelligence tests will show that constructs such as working memory or sustained attention are close, potentially overlapping but still distinct. Consequently, measures assessing those constructs would be more adapt for testing discriminant validity. Finally, the definition of the nomological network helps to ensure content validity in yet another way. By defining the construct space, the range of the possible item universe has also been set.
The third question, the intended uses of the test, mainly inform the validation strategy related to test-criterion correlations. Clearly, if a test is supposed to be used to make decisions regarding entrance to a specific school or university, success in this kind of institution should be predictable by the test scores of the measure in focus.
All of these consequences for the validation strategy also demonstrate that based on theory, clear hypotheses for the kind of evidence expected supporting validity should be stated.
To sum up, the ABC of test construction provides guidelines for the test construction process with an emphasis on the validation strategy.
Above it was mentioned that the classical approach to testing validity has not remained without critique. One of the criticisms brought forward by Borsboom et al. (2004) was that this classic approach puts too much weight on correlations (instead of causality). Within the passages regarding sources of validity evidence, this should have become even evident. For example, convergent validity is shown when test scores from measures assessing the same construct correlate. But how strong does this correlation need to be? The EFPA Board of Assessment provides detailed advice on how to judge a convergent correlation (EFPA Board of Assessment 2013). For example, it is stressed that correlations larger than 0.6 can be expected only if very similar measures are used and data are collected concurrently. The more dissimilar the measures and the larger the time gap between assessments, the smaller the expected and accepted correlation. It is further stated that correlations which are too large (exceeding 0.90) can be problematic if the intended use of the measure was to add to the already existing measures. Furthermore, it is recommended to consider the reliability estimates of both measures. This list of aspects to consider when judging a validity correlation coefficient could be extended by, for example, level of symmetry, range restriction, criterion contamination, and criterion deficiency (for a detailed discussion, see Ziegler and Brunner 2016). Importantly, the EFPA board has highlighted an important principle here, i.e., the uselessness of strict cutoffs. It is necessary to consider the quality of the measures, the context of data collection, and the sample composition when judging a correlation coefficient.
To sum up, the different approaches to attest validity are traditionally based on ideas utilizing auxiliary propositions. Validity for a test score interpretation should be based on a compilation of evidence covering several sources and being in line with the intended test use.
- AERA, & APA. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association & American Psychological Association.Google Scholar
- AERA, APA, & NMCE. (2014). Standards for educational & psychological testing. Washington, DC: American Educational Research Association & American Psychological Association.Google Scholar
- Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061–1071. Retrieved from http://psycnet.apa.org/journals/rev/111/4/1061.pdf.CrossRefPubMedGoogle Scholar
- Credé, M., Tynan, M. C., & Harms, P. D. (2016). Much ado about grit: A meta-analytic synthesis of the grit literature. Jounal of Personality and Social Psychology. http://dx.doi.org/10.1037/pspp0000102
- EFPA Board of Assessment. (2013). EFPA review model for the description and evaluation of psychological and educational tests. Washington, DC: American Educational Research Association & American Psychological AssociationGoogle Scholar
- Koch, T., Holtmann, J., Bohn, J., & Eid, M. (2017). Multitrait-multimethod analysis. In Encyclopedia of personality and individual differences. Springer.Google Scholar
- Krumm, S., Hüffmeier, J., & Lievens, F. (2017) Experimental test validation. European Journal of Psychological Assessment, Advance on, 1–8. doi: 10.1027/1015-5759/a000393.
- Kunina-Habenicht, O., Rupp, A. A., & Wilhelm, O. (2009). A practical illustration of multidimensional diagnostic skills profiling: Comparing results from confirmatory factor analysis and diagnostic classification models. Studies in Educational Evaluation, 35(2–3), 64–70. doi: 10.1016/j.stueduc.2009.10.003.CrossRefGoogle Scholar
- Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251). doi: 10.1126/science.aac4716.
- Roberts, B. W., Kuncel, N. R., Shiner, R., Caspi, A., & Goldberg, L. R. (2007). The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspectives on Psychological Science, 2(4), 313–345. doi: 10.1111/j.1745-6916.2007.00047.x.CrossRefPubMedPubMedCentralGoogle Scholar
- Yalch, M. M., & Hopwood, C. J. (2016). Target-, informant-, and meta-perceptual ratings of maladaptive traits. Psychological Assessment. doi: 10.1037/pas0000417.
- Ziegler, M., & Brunner, M. (2016). Test standards and psychometric modeling. In A. A. Lipnevich, F. Preckel, & R. Roberts (Eds.), Psychosocial skills and school systems in the 21st century (pp. 29–55). Göttingen: Springer.Google Scholar