The diversity effect in inductive reasoning depends on sampling assumptions
A key phenomenon in inductive reasoning is the diversity effect, whereby a novel property is more likely to be generalized when it is shared by an evidence sample composed of diverse instances than a sample composed of similar instances. We outline a Bayesian model and an experimental study that show that the diversity effect depends on the assumption that samples of evidence were selected by a helpful agent (strong sampling). Inductive arguments with premises containing either diverse or nondiverse evidence samples were presented under different sampling conditions, where instructions and filler items indicated that the samples were selected intentionally (strong sampling) or randomly (weak sampling). A robust diversity effect was found under strong sampling, but was attenuated under weak sampling. As predicted by our Bayesian model, the largest effect of sampling was on arguments with nondiverse evidence, where strong sampling led to more restricted generalization than weak sampling. These results show that the characteristics of evidence that are deemed relevant to an inductive reasoning problem depend on beliefs about how the evidence was generated.
KeywordsCategory-based induction Evidence diversity Bayesian modeling Relevance theory Sampling assumptions
Philosophers of science have suggested that diverse evidence leads to more robust generalization (e.g., Hempel, 1966). The “diversity effect” in category-based induction suggests that most adults share this intuition: people are more likely to generalize a novel property to other category members when that property is shared by a diverse set of categories rather than a nondiverse set. For example, knowing that lions and cows have some property p is generally seen as a stronger basis for generalizing that property to other mammals than knowing that lions and tigers have property p (Osherson, Smith, Wilkie, Lopez, & Shafir, 1990). This diversity effect is robust, having been replicated across a range of reasoning tasks and category stimuli (e.g., Feeney & Heit, 2011; Liew, Grisham, & Hayes, 2018; Osherson et al., 1990). Moreover, diverse samples of evidence have been shown to facilitate hypothesis testing (e.g., López, 1995) and promote conceptual change (Hayes, Goodhew, Heit, & Gillan, 2003). Early accounts of the diversity effect in category-based induction emphasized the crucial role of similarity between those categories known to have a property (premise categories) and the categories to which the property could be generalized (conclusion categories). Osherson et al.’s (1990) influential similarity-coverage model, for example, attributes the diversity effect to the fact that diverse premise categories (e.g., lions and cows) have greater “coverage” of broader conclusion categories such as mammals (i.e., diverse premise categories are similar to more members of a superordinate like mammals than are nondiverse categories).
There is a growing consensus in the field, however, that similarity alone is insufficient to explain property induction (e.g., Kemp & Tenenbaum, 2009; Medin, Coley, Storms, & Hayes, 2003). Inductive arguments involving premise and conclusion categories (e.g., lions and cows have p, therefore mammals have p) are often communicative acts, designed to influence the beliefs of the reasoner, and as such, pragmatic inferences can shape the perceived strength of the inductive argument (Goodman & Frank, 2016; Grice, 1975). Experimental manipulations of the communicative context influence how people interpret an inductive argument (Ransom, Perfors, & Navarro, 2016; Voorspoels, Navarro, Perfors, Ransom, & Storms, 2015), in a manner consistent with Bayesian theories of inductive reasoning (Navarro, Dry, & Lee, 2012; Sanjana & Tenenbaum, 2003; Tenenbaum & Griffiths, 2001). Within the Bayesian framework, these effects are seen as reflecting changes in sampling assumptions—assumptions that a reasoner makes about how an inductive argument was constructed.
Much of the literature on sampling assumptions has focused on the effect of adding new evidence (e.g., additional premise categories) to an inductive argument (e.g., Fernbach, 2006; Ransom et al., 2016). However, to the extent that these findings reflect the operation of more general principles of Bayesian reasoning (Sanjana & Tenenbaum, 2003; Tenenbaum & Griffiths, 2001), one might wonder if sampling assumptions also shape the value people assign to the diversity of evidence in inductive arguments. Our goal in this article is to address this question. Is the diversity effect in inductive reasoning purely a similarity-driven effect, or does it depend on how the reasoner believes the inductive argument was constructed?
Reasoning as Bayesian inference
The central characteristic of this belief revision is that it is driven by the likelihood P(x|h) that the reasoner would have encountered the evidence x if the hypothesis h correctly described the true extension of the property p. Importantly, this likelihood is subjective: It is based on the reasoner’s personal theory about how the inductive argument was constructed, referred to as the sampling assumption (e.g., Fernbach, 2006; Navarro et al., 2012; Tenenbaum & Griffiths, 2001).
A key implication of our approach is that sampling assumptions matter more for inferences based on nondiverse evidence. To illustrate, suppose that the learner is now told that dogs and wolves both produce leptine. How should a Bayesian reasoner behave? The answer depends on what the reasoner believes about why they were informed about dogs and wolves, specifically. One possibility—known as weak sampling—is that these two animals were chosen at random, and by chance it happened to be two canines, and (also by chance) the two canines do produce leptine. Because the items are chosen at random, irrespective of whether or not they have the property in question, the likelihood takes on a constant value P(x|h) ∝ 1 for every hypothesis that is consistent with the evidence (i.e., canines, placentals, mammals), and P(x|h) = 0 for all hypotheses that are not (ursines, macropods, marsupials). The posterior distribution is therefore evenly spread across the three still-plausible hypotheses—that is, P(h|x) = 1/3 (see Fig. 1a).
Another alternative in the literature is known as strong sampling, and describes situations where the premise categories x are selected precisely because they possess the property p. Perhaps a helpful teacher looked up a list of leptine-producing animals and then randomly chose two illustrative animal items from this list (e.g., dog and wolf). This produces a model in which the probability of sampling item x is given by P(x|h) = 1/|h|, where |h| denotes the size of the hypothesis. Importantly, this leads to a change in the reasoning process. If the learner believes there are 36 species of canine in the list, then for h = canines, the probability of choosing a wolf is 1/36, and the probability of choosing a wolf and a dog (without replacement) is 1/36 × 1/35 ≈ 7.9 × 10−4. In contrast, if the true extension of the category is all mammals (h = mammals), the chance of selecting a wolf and a dog is extremely small, say 1/5,000 × 1/4,999 ≈ 4.0 − 10−8. Taking the ratio of these two probabilities, P(wolf, dog | canines) : P(wolf, dog | mammals) = 7.9 × 10−4 : 4.0 × 10−8 ≈ 19,837:1, we see that the evidence is much more likely under the smaller hypothesis (h = canines). Repeating the exercise for the case of canines versus placentals, we find a similarly large ratio. That is, P(wolf, dog | canines) : P(wolf, dog | placentals) = 7.9 − 10-4 : 6.3 × 10-8 ≈ 12,692:1 Thus, after eliminating those hypotheses that are inconsistent with the evidence (ursines, macropods, and marsupials), the posterior distribution overwhelmingly favors the canine hypothesis over the placental or mammal hypothesis (see Fig. 1b). Specifically, P(canines | wolf, dog) = (0.16 × 7.9 * 10−4) / ((0.16 × 7.9 × 10−4) + (0.16 × 4.0 × 10−8) + (0.16 × 6.3 × 10−8)) ≈ 0.99. By comparison, P(mammals | wolf, dog) ≈ 5.0 ×10−5, and P(placentals | wolf, dog) ≈ 7.9 ×10−5. The strong sampling model therefore embodies a size principle in which the reasoner comes to prefer the smallest or most specific hypothesis that is consistent with the evidence.
To illustrate the implications for the diversity effect, consider how the previous example plays out if the reasoner is given diverse evidence, say, that dogs and koalas produce leptine. In this situation, the sampling model is largely irrelevant: The evidence is only consistent with a single hypothesis (mammals), so the reasoner will strongly endorse an argument generalizing from dogs and koalas to all mammals, regardless of the sampling assumption (see Fig. 1c–d). This leads to our key prediction about the impact of sampling assumptions on the diversity effect—the effect will be far larger under strong sampling assumptions (compare Fig. 1b and d) than under weak sampling assumptions (compare Fig. 1a and c).
We carried out an experimental test of these predictions in a property-induction experiment in which target arguments containing diverse or nondiverse premises were presented to groups under conditions that promoted an assumption of either strong or weak sampling. Each group received instructions that described the process by which premises were selected (selected by a helpful agent vs. selected randomly), together with a set of filler arguments, designed to reinforce this description. In the strong sampling group, fillers resembled target items and contained diverse and nondiverse arguments with the same conclusion category. In the weak sampling group, the fillers conveyed the impression that the premises had been generated randomly. This combination of instructional and item manipulation has been successful in previous work in shifting people toward a belief in strong or weak sampling (Ransom et al., 2016; Voorspoels et al., 2015), and has been more effective than cover-story manipulations alone (see Navarro et al., 2012).
One hundred and eighty-seven participants from the United States were recruited through Amazon Mechanical Turk (AMT) and paid US$1.00. All had high approval status (≥95% approval for previous tasks). Three were excluded because they failed the attention check administered at the end of the procedures (see below for details). The final sample total was 184 (81 female, 103 male; age: M = 35.97 years, SD = 10.92), with equal numbers randomly assigned to strong or weak sampling groups.
The inductive arguments used in the task
(a) Target arguments (diverse)
(b) Target arguments (nondiverse)
dogs, rats, whales → all mammals
rabbits, raccoons, squirrels → all mammals
octopi, eels, trout → all sea creatures
sardines, herring, anchovies → all sea creatures
flies, termites, millipedes → all insects
bees, wasps, hornets → all insects
(c) Filler arguments (strong sampling condition)
cows, mice, seals → all mammals
zebras, giraffes, camels → all mammals
pigeons, hens, ostriches → all birds
ducks, swans, pelicans → all birds
apples, peaches, papaya → all fruit
strawberries, blueberries, raspberries → all fruit
(d) Filler arguments (weak sampling condition)
chickens, condors, coconuts → all mammals
geese, skunks, ¬ carp → all mammals
elephants, moths, pineapples → all birds
robins, salmon, ¬ cod → all sea creatures
spiders, finches, ¬ worms → all insects
¬ tigers, ¬ bananas, locusts → all fruit
(e) List of properties used
the chemical didymium
traces of magnesium
Because property induction is affected by the typicality of premises (i.e., the extent to which each premise category is seen as representative of the broader conclusion category; Osherson et al., 1990), it was important that this was controlled. Premises for target arguments were chosen in order to match the mean premise typicality across diverse and nondiverse versions, as rated by 162 participants recruited through AMT who were paid US$0.50 but did not participate in the main study.
The two sampling groups received six different filler items. In strong sampling, the fillers were three arguments with diverse premises and three with nondiverse premises (see Table 1c for examples). In weak sampling, each filler contained three premises, drawn from two or three different superordinate categories of living things (see Table 1d for examples). To further reinforce the impression of randomness, four of the six fillers in this condition contained at least one premise which was said to “NOT have” the property (see https://osf.io/fpx9k/ for all experimental materials and data, including premise typicality ratings).
On each trial you will see three instances of living things that have a particular property. Note that the instances were deliberately chosen to best illustrate the variety of living things that have the property.
On each trial you will see three instances of living things that have a particular property. We asked a student to open a book on plants and animals at random pages and note the first three living things they came across and whether or not those living things have the property in question. This means the information you receive may not be the most helpful for making your judgment—by chance, the student will sometimes select very dissimilar items, and sometimes very similar ones.
They then saw 12 test trials (three diverse targets, three nondiverse targets, six fillers) in random order. On each trial, three premises were listed as having a shared novel property (or in fillers in the weak condition, some premises were shown not to have the property). Participants then rated the likelihood that all members of the conclusion category had the property (1 = not very likely, 7 = very likely; hereafter “argument strength”). For each participant, the property attached to each argument was drawn randomly from the 12 fictitious biological properties shown in Table 1e, with a different property used on each trial. After test, there was an attention check where participants had to identify the largest integer in a random sequence.
Ratings of argument strength were first averaged across the three diverse and three nondiverse targets for each participant in the strong and weak sampling groups. Mean group argument strength ratings and within-group standard errors for diverse and nondiverse arguments are plotted in Fig. 2c. There is a clear diversity effect: Properties shared by diverse premises were more likely to be generalized (M = 5.08, SE = .09) than properties shared by less diverse premises (M = 4.48, SE = .08, BF10 > 1,000, ηp2 = 0.25).1 The sampling manipulation also influenced ratings of argument strength in the expected fashion, with participants in the weak sampling condition giving higher ratings overall (M = 5.23, SE = .11) than those in the strong condition (M = 4.33, SE = .11, BF10 > 1,000, ηp2 = 0.15). Most importantly, there is strong evidence for an interaction: As predicted by our theoretical analysis, the diversity effect is attenuated under weak sampling relative to strong sampling (BF10 = 36.0, ηp2 = 0.07). To confirm that the form of this interaction is indeed an attenuation of the diversity effect in the weak sampling condition (as opposed to a disappearance of the effect), we ran a Bayesian paired-samples t test for this condition alone and found strong evidence that the effect (BF10 = 136.0) still exists in this condition. Taken together, the higher overall level of generalization in the weak sampling condition and the fact that there is still a modest diversity effect in this condition suggest that people in this condition are not simply ignoring similarity among categories as a source of evidence, rather, they appear to assign different evidentiary value to this similarity.
The effect of evidential diversity on property induction is one of the most widely replicated findings in the field of inductive reasoning. When introducing their Bayesian generalization model, Tenenbaum and Griffiths (2001) argued that it naturally accommodates the effect of diversity on inductive argument strength. In this article, we extend their analysis. We have shown empirically that the magnitude of the diversity effect depends on participants’ assumptions about how the evidence has been selected. As predicted by the Bayesian model, when led to believe strong sampling applies, a robust diversity effect appeared. However, when the context suggested that evidence was generated randomly (weak sampling), the diversity effect was attenuated.
Notably, this attenuation meant that overall ratings of property generalization were higher under weak than under strong sampling. As predicted, the largest effect of sampling was on inferences from evidence with low diversity where strong sampling prompted more restricted property generalization than weak sampling. In all crucial respects, the group empirical results were consistent with the ordinal predictions of the Bayesian model.
In regard to the generality of these effects, the predicted difference in the magnitude of the diversity effect under weak and strong sampling assumptions was obtained consistently across a variety of inductive arguments (see Fig. 3). Although our experiment only examined the results of a single operationalization of diversity (diverse vs. nondiverse premises), our simulation results (see Fig. 2a) shows that the same qualitative prediction about the effects of sampling assumptions holds across a range of possible levels of evidence diversity. The relationship between diversity effects and sampling assumptions should therefore be seen as a generic prediction of Bayesian inductive reasoning models. There was, however, suggestive evidence (see Fig. 4) for some heterogeneity in the effects of sampling assumptions across subjects. Although a majority in the strong sampling condition showed a robust diversity effect, some showed little effect of evidence diversity. This could reflect individual differences in belief in the cover story used to manipulate sampling assumptions, in knowledge of biological categories, or a more fundamental difference in the way that different individuals generate inductive hypotheses from diverse or nondiverse evidence (cf. Navarro et al., 2012; Ransom, Hendrickson, Perfors, & Navarro, 2018).
Our theoretical analysis and results make an important contribution by highlighting the central role played by sampling assumptions in important inductive phenomena like the diversity effect. Previous theoretical explanations of this effect (e.g., Heit, Hahn, & Feeney, 2005; Osherson et al., 1990) have focused on how diverse sample content promotes property generalization. The Osherson et al. (1990) model, for example, assumes that more diverse samples support broader generalization because they provide more coverage of the category of interest. In contrast, our approach suggests that the strength of the diversity effect depends on one’s assumptions about how premise information is selected—especially for the nondiverse samples. The fact that many previous studies (Feeney & Heit, 2011; Liew et al., 2018; Osherson et al., 1990) have demonstrated robust diversity effects in property induction without explicit manipulation of sampling assumptions suggests that strong sampling of the presented evidence may be the default for a majority of subjects. Notably, the assumption of strong sampling may be more widespread amongst adults than children. Rhodes, Gelman, and Brickman (2010) found that diverse evidence affected 5-year-olds’ inferences when it was presented by a knowledgeable domain “expert,” but not when it was presented by a domain “novice.” In contrast diverse evidence affected adults’ inferences in both conditions.
Our results add to a growing body of evidence highlighting the central role of sampling assumptions in determining what characteristics of an argument are deemed relevant to an inductive reasoning problem. For instance, when introducing the relevance theory perspective on inductive reasoning, Medin et al. (2003) demonstrated a premise nonmonotonicity effect, in which adding premises that share a distinctive relation (e.g., adding the premise black bears to grizzly bears) weakened belief that the premise properties generalized to a conclusion category (mammals). By casting this in an explicitly Bayesian framework, Ransom et al. (2016) showed that this effect arises naturally from a strong sampling assumption, and can be reversed when learners are encouraged to adopt a weak sampling perspective. A similar effect of sampling assumptions was found when learners were presented with combinations of positive and negative evidence (Voorspoels et al., 2015). Whether considering the quantity of evidence (Ransom et al., 2016), the kind of evidence (Voorspoels et al., 2015), or, as we show here, the diversity of evidence, the inferences people make are highly dependent on their beliefs about the sampling mechanisms involved.
This study highlights that category-based induction, like other tasks that involve drawing conclusions from data (Gweon, Tenenbaum, & Schulz, 2010; Shafto, Goodman, & Griffiths, 2014), is highly sensitive to sampling assumptions. It also raises questions about the precise sampling assumptions involved. Consistent with many previous studies (e.g., Gweon et al., 2010; Navarro et al., 2012; Ransom et al., 2016), we framed the question as one of “strong” and “weak” sampling. In many other studies, however, the key difference is characterized as a contrast between “helpful” (or pedagogical) and “random” sampling (e.g., Shafto et al., 2014; Voorspoels et al., 2015), suggesting that the social context is critical to these effects. Although there are some contexts where the distinction between strong or helpful sampling leads to different kinds of inferences (e.g., Navarro et al. 2012), the distinction is not crucial for understanding the diversity effect. More generally, the current work highlights a need to investigate how learners’ beliefs about evidence generation and transmission affect the range of other inductive phenomena (see Hayes & Heit, 2018, for a review) that have been central to building theories of category-based inference.
Constraints on generality
Our work shows that the diversity effect in property induction depends, in part, on an assumption that the evidence presented in the experiment (i.e., the argument premises) was not selected randomly. Our target population for this work was adult reasoners. Because the diversity effect has been replicated in adult samples from a range of cultural backgrounds (e.g., United States, Belgium, Australia, China, Korea; see Choi, Nisbett, & Smith 1997; Medin et al., 2003), we expect that our results will have considerable cross-cultural generality. A constraint on generality is that we only examined diversity using categories and properties drawn from the domain of biology. It remains to be shown whether our results extend to reasoning about other domains (e.g., artifacts, social categories). Within the biological domain, we assume that our results apply to people with a modest amount of knowledge about biological kinds. However, they most likely do not apply to those with expert domain knowledge—who often do not show diversity effects when reasoning about objects within their area of expertise (e.g., Shafto & Coley, 2003).
Bayes factors were calculated using a mixed-effects Bayesian ANOVA, conducted using the BayesFactor package in R with default Cauchy priors.
This research was supported by Australian Research Discovery Grant DP150101094 to the first author.
- Fernbach, P. M. (2006). Sampling assumptions and the size principle in property induction. In R. Sun & N. Miyake, (Eds.), Proceedings of the 28th annual conference of Cognitive Science Society (pp. 1287–1292). Austin, TX: Cognitive Science Society.Google Scholar
- Grice, H. P. (1975). Logic and conversation. In P. Cole, & J. L. Morgan. (Eds.), Syntax and semantics (Vol. 3, pp. 41–58). New York, NY: Academic Press.Google Scholar
- Hayes, B. K., & Heit, E. (2018). Inductive Reasoning 2.0. Wiley Interdisciplinary Reviews: Cognitive Science, 9 (3), 1–13.Google Scholar
- Heit, E., Hahn, U., & Feeney, A. (2005). Defending diversity. In Ahn, W.-k., Goldstone, R. L., Love, B. C., Markman, A. B., & Wolff, P. (Eds.), Categorization inside and outside of the laboratory: Essays in honor of Douglas Medin (pp. 87–99). Washington, DC: American Psychological Association.CrossRefGoogle Scholar
- Hempel, C. G. (1966). Philosophy of natural science. Englewood Cliffs, NJ: Prentice Hall.Google Scholar
- Ransom, K., Hendrickson, A. T. Perfors, A., & Navarro, D. J. (2018). Representational and sampling assumptions drive individual differences in single category generalization. In T. T. Rogers, M. Rau, X. Zhu, & C. W. Kalish (Eds.), Proceedings of the 40th annual conference of the Cognitive Science Society (pp. 931–935). Austin, TX: Cognitive Science Society.Google Scholar
- Sanjana, N. E., & Tenenbaum, J. B. (2003). Bayesian models of inductive generalization. In Jordan, M. I., LeCun, Y., & Solla, S. A. (Eds.), Advances in neural information processing systems (pp. 59–66). Cambridge, MA: MIT Press.Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.