Introduction

When we think of "cat", we think of it as an animal, with a tail, fur, and being a pet, etc. These are our brain's memory storage and representation of the concept of "cat" and are expressed through the form of language, which are called the semantic features of a concept. Memory and language are very important high-level cognitive functions of humans. Researchers in many fields, such as psychologists, neuroscientists, and linguists, have been working hard to study their internal mechanisms. According to the conceptual structure account (Taylor, Devereux, & Tyler, 2011; Tyler & Moss, 2001), concepts can be expressed through their semantic features (e.g., "is hairy", "is an animal") and statistical measures, which are called conceptual structure statistics. Conceptual structure statistics can provide information about the parent category of a concept (for example, a cat is an animal) and how different a concept is within that category (for example, the uniqueness of the camel lies in its hump, which other animals do not have).

Concepts and their semantic features are important contents of memory. It has been found that distributed representations of knowledge about each attribute of the concept are in the corresponding areas of the brain (Binder, Desai, Graves, & Conant, 2009; Fernandino et al., 2016; Lambon Ralph, Jefferies, Patterson, & Rogers, 2017; Pulvermuller, 2013). For instance, the visual attributes (such as shape and color) are represented in ventral visual processing pathway including the occipital cortex, the ventral occipito-temporal cortex, the lateral occipital complex and fusiform gyri. The sound attributes (such as the sound made by an object) are associated with activation of the sound processing areas including the superior temporal gyrus and left ventrolateral prefrontal cortex (vlPFC). Whether the concept can be manipulated is represented in the areas processing motion including junction of the left posterior medial temporal gyrus (pMTG) and anterior occipital cortex (Fernandino et al., 2016; Lambon Ralph et al., 2017). Moreover, the human brain has two conceptual representation systems based on sensation and language system, respectively (Wang, Men, Gao, Caramazza, & Bi, 2020). When people recall information, they also search through the semantic network. Therefore, words that are semantically connected more closely with other words (which share greater semantic similarity with other words) are easier to be remembered (Xie, Bainbridge, Inati, Baker, & Zaghloul, 2020). However, it is still not clear how the human brain represents the relationship between concepts.

There are several computationally generating approaches to construct semantic spaces to represent the relationship between concepts. The word co-occurrence approach extrapolates words' relationship using their frequency of occurrence in the text (Van Rensbergen, De Deyne, & Storms, 2016). The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text using vector with a particular list of numbers to represent each distinct word, so that mathematical function (such as cosine similarity) can be used to calculate the semantic similarity between words (Mikolov, Chen, Corrado, & Dean, 2013). Besides, several semantic databases have been established. WordNet is a lexical database of semantic relations between words in more than 200 languages, which links words into semantic relations including synonyms, hyponyms, and meronyms (George, 1995). HowNet is an online common-sense knowledgebase which puts more emphasis on the relationships between concepts (including inter-conceptual relationships and inter-attribute relationships) than WordNet (Dong, Dong, & Hao, 2006). These semantic networks constructed by using Internet big data have made outstanding contributions to many application fields of artificial intelligence, such as natural language processing (Skelac & Jandrić, 2020), information retrieval (Leydesdorff & Vaughan, 2006), intelligent question and answering (Yilmaz & Toklu, 2020), etc., and also provide rich data support and data modeling for research on how the human brain represents the relationship between knowledge (Armeni, Willems, & Frank, 2017).

However, in the research fields that investigate the neural mechanism of knowledge representation with humans as research objects, more researchers choose the data evaluated directly by humans as experimental materials (De Deyne, Navarro, Perfors, Brysbaert, & Storms, 2019; Jouravlev & Mcrae, 2016; Scott, Keitel, Becirspahic, Yao, & Sereno, 2019). Though these approaches are more time-consuming than the computationally generating norms mentioned above, the semantic space reflected by these materials may be the closest to the representation of knowledge in the human brain, so it is more widely used in the fields of psychology and neuroscience. Two methods are generally used to investigate semantic representation in humans. One way is to give a word as a clue, and let the subjects associate some other words freely to generate the relationship between the concepts (De Deyne et al., 2019), such as English Small World of Words project (SWOW-EN) (See https://smallworldofwords.org). This word association approach reveals mental representations that cannot be reduced to lexical usage patterns. Another way to measure the relationship between concepts is mainly based on the features shared between the concepts. The more features shared between concepts, the higher the similarity is (Cree & McRae, 2003; Griffiths, Steyvers, & Tenenbaum, 2007; Jones & Mewhort, 2007; Vigliocco, Vinson, Lewis, & Garrett, 2004). Generally, studies on semantic memory ask subjects to list the features of concepts, called feature production norms (Toglia, 2009), and then investigate how these concepts are represented in memory according to the features of the concepts listed by the subjects. Although a lot of related research has been done, most researchers collected the concepts and features individually instead of using a normed dataset (Ashcraft, 1978; Collins & Loftus, 1988; Mcnorgan, Kotack, Meehan, & Mcrae, 2007; Mcwilliams & Schmitter-Edgecombe, 2008; Toglia, 2009). Due to the difference of selected concepts and subjects, this self-normed research cannot be compared between each other directly. Therefore, the availability of stimulating materials greatly limits the related research.

The study of human brain knowledge representation needs to be based on a standardized and large-scale conceptual semantic feature database. At present, several English conceptual semantic feature datasets have been established. The earliest dataset was established by McRae, Cree, Seidenberg, and McNorgan (2005), who asked participants to describe 541 nouns, including animate objects (such as crocodiles) and inanimate objects (such as airplanes). Subsequently, Vinson and Vigliocco (2008) added verbs to the dataset, and collected 456 semantic features of nouns and verbs. Buchanan, Holmes, Teasley, and Hutchison (2013) collected 1,808 nouns, verbs, and adjectives. Recently, after years of hard work, Buchanan, Valentine, and Maxwell (2019) have released the latest and largest conceptual semantic feature dataset to date, containing 4,436 concepts, including nouns, verbs, adjectives, and other types of vocabulary. The above-mentioned English conceptual semantic feature datasets adopted a free description method, that is, given a word, the subjects could freely describe the features of the concept without any hints, and they wrote down whatever features they thought of.

Devereux, Tyler, Geertzen, and Randall (2014) added some concepts on the basis of the dataset established by McRae et al. (2005), and expanded the dataset to 638 concepts. Instead of letting the subjects describe freely, they adopted the way of guiding word prompt. The guiding words were: is, has, does, made of, etc. The subjects could fill in the corresponding features after each guiding word. According to their research results, they concluded that the data collection method with guiding words yielded more information than the free description method, and they designed a web system that could expand the dataset continuously.

Although there are not many English conceptual semantic feature datasets, the number of Chinese datasets is more limited. Currently, only two Chinese datasets have been established. One was assessed by Xiang, Lin, and Jiang (2015), which contains only 50 nouns. The other one, assessed by Gao, Lin, Jiang, and Lu (2016), contains only 30 verbs. Both datasets used the free description collection method. Compared with the English datasets, the related work of Chinese datasets is far behind.

For language research, there are great differences in different languages. At the same time, in the study of memory, people from different cultural backgrounds also have great differences in concept coding. Firstly, different categories of concepts have different typical representatives. For example, when it comes to the concept of the word “bird”, Chinese are more familiar with the sparrow, while North Americans may first think of the robin. Secondly, the familiarity of different concepts is different for people with different cultural backgrounds. For example, the concept of “food” is a familiar superior category to everyone, but its subordinate categories (various foods) vary greatly among different cultures. People from different countries are only familiar with their own food (“粽子-zongzi” is familiar to Chinese but not to American). Finally, for the same concept, people from different cultural backgrounds may encode it differently, such as “the cross (十字架)”. Therefore, the study of the Chinese language system and the study of the language, memory, and other cognitive functions of native Chinese speakers need to use the Chinese conceptual semantic feature dataset as the stimulus material, rather than the English ones. Due to the limitation of the current Chinese database, it is necessary to establish a Chinese conceptual semantic feature dataset for relevant researchers to promote the research on language and memory of Chinese, which is the main purpose of this study.

The present study collected semantic features of 1,410 concepts in Chinese by using guiding words. We provided the features (including the frequency of each feature) of each concept and calculated the similarities between each two concepts. Concepts were further grouped into different categories and compared between categories. Hierarchical clustering was then used to examine the relationship between concepts. We also compared this dataset with a widely used English version of conceptual semantic feature dataset (Devereux et al., 2014) to see whether the present data make sense and the difference between the Chinese and English versions.

Methods

Participants

Two hundred and four subjects (44 males) participated in this experiment, aged between 18 and 57 (M = 23.495, SD = 4.806). All the participants were Chinese, and their native language was Chinese. We also collected the region of each participant, and the participants came from 25 provinces (116 from the north of China, 88 from the south of China, see Fig. S1 in the “Supplementary materials_1_Figures” for detailed geographical distribution of participants). On average, each participant completed the assessment of 202 concepts. The participants were paid after completing the tasks.

Materials

A total of 1,410 concepts were selected as experimental materials in this study. In order to cover common concrete concepts as comprehensively as possible, we integrated multiple word datasets: 1) concepts in the dataset created by Devereux et al. (2014) (hereinafter referred to as CSLB); 2) concepts in ImageNet (Deng, Dong, Socher, Li, & Li, 2009); 3) concepts in the open Chinese lexicon (THUOCL) (Han et al., 2016) with word frequency exceeding 10,000 in IT, finance and economics, place names, historical celebrities, medicine, diet, law, automobile and animal category. The English parts of these concepts (CSLB and ImageNet) were translated into Chinese, and the duplicated concepts were deleted. Finally, we got 11,160 concepts. Three graduate students (all are native Chinese speakers) were then asked to rate the familiarity (from 1 to 9, 1 for complete unfamiliarity and 9 for very familiar) and concreteness (from 1 to 9) of each concept. The concreteness was defined as whether the concept was perceptible (for example, it can be seen or touched) and easy to be imagined (the more concrete a concept is, the easier it is to be imagined), so 1 means very abstract, imperceptible, and hard to be imagined, while 9 means very specific, perceptive, and easy to be imagined. The rating results of familiarity and concreteness of each concept are shown in the “Supplementary materials_2_Table1”. We selected 1,410 concepts with the highest familiarity (M = 8.85, SD = 0.53) and concreteness (M = 8.97, SD = 0.22) scores as the formal experimental materials.

Procedure

The research work of Devereux et al. has shown that the guiding words method can collect more comprehensive data than the free description way (Devereux et al., 2014). Before we formally collected data, we did a pre-experiment and compared the way with guiding words and the way of free description. Two groups of participants (N = 40) were recruited, and one group (n = 20) used the way of guiding word, while the other group (n = 20) used the way of free description. The instruction of guiding word way was shown below. The only difference between the two methods was that no guiding words were used in the free description group. We tested 15 concepts. The results showed that the guiding word group spent more time (15 min) than free description group (12 min). Besides, the guiding word group provided more features per concept (n = 25.87) than free description group (n = 17.00). Thus, the guiding word way may give participants more clue to search features from memory. Therefore, guiding word method was used in the present study.

All data collection work was carried out through the Internet. First, the participants were presented with experimental task introduction in simplified Chinese. The instruction is as follows:

Hello, thanks for your participation. You will see some words, and please describe each word, such as its appearance, sound, smell, touch, or how to use it, when and where to use it. You can also describe where it comes from and the feelings it brings to you. Please fill in the corresponding content under each guiding word (is- 是, has- 有, can- 可以, need- 需要, like- 像, ......- 其他), and you can fill in any content you want in the box below “......” (ellipsis).

Note:

  1. 1.

    Please fill according to your intuition;

  2. 2.

    Multiple words can be filled under each guiding word, separated by spaces;

  3. 3.

    Please give all the features or attributes you can think of as comprehensively as possible;

  4. 4.

    If you don’t know what this word means, or you don’t know what feature it has, you can leave it blank.

When the participants understood the experimental task, the formal experiment started. For each concept, the interface is shown in Fig. 1. Participants need to fill in the blanks with the features of each concept (without time limit). Referring to the data collection method of CSLB (Devereux et al., 2014), this study selected six guiding words: "is- 是, has- 有, can- 可以, need- 需要, like- 像, ......- 其他". Participants can fill in the corresponding features under the corresponding guiding words. "……" means "other", which can be filled in any features below. The concepts were presented completely random among the participants. Each concept was evaluated by at least 30 participants.

Fig. 1
figure 1

Schematic diagram of data collection

Data analysis

Preprocessing

For all the 1,410 concepts, we got a total of 381,119 original descriptions. Each of the original description was checked one by one and preprocessed (if necessary) as follows:

  1. (1)

    Correct typos. Because the data were collected electronically, most participants used pinyin input method, which led to typos, such as "香蕉是职务 (banana is a job)", but actually is "香蕉是植物 (banana is a plant)". (In Chinese, “job- 职务” and “plant- 植物” are both pronounced “zhiwu”.)

  2. (2)

    Delete completely wrong descriptions (those are completely against the common sense), punctuation marks, strange characters, etc. For example, "电灯是食物 (electric light is food)".

  3. (3)

    Match the feature description to the correct guiding word. During the data collection, there were multiple guiding words, and some participants did not fill into the guiding words properly. For example, "猫像动物 (cats are like animals)" was changed to "猫是动物 (cats are animals)".

  4. (4)

    Delete adverbs of degree, such as "very". For example, "热狗是非常香的 (hot dogs are very fragrant)" was changed to "热狗是香的 (hot dogs are fragrant)".

  5. (5)

    Unify expressions. In order to facilitate subsequent analysis, we referred to the data preprocessing methods of CSLB (Devereux et al., 2014), and unified the expressions of the inspected data. Many features were described in different words, but the actual meaning was the same (synonyms). The features with the same meaning were unified into the same statement. For instance, "是食物 (is food)", "是食品 (is food)", "是吃的 (is for eaten)", "是食用的 (is edible)" were all unified into "是食物 (is food)".

  6. (6)

    Feature segmentation. Split combined features into independent features referred to the method used in CSLB (Devereux et al., 2014). Some participants used the form of "adjective + noun" to describe concepts, such as: "玫瑰是红色的花 (rose is a red flower)", which contains two features, which can be split into "玫瑰是红色的 (rose is red)" and "玫瑰是花 (rose is a flower)". Another example is "汽车有四个轮子 (car has four wheels)", which can be split into "汽车有轮子 (car has wheels)" and "汽车有四个轮子 (car has four wheels)".

After preprocessing, there were 378,533 descriptions left (2,586 fewer than original descriptions, which is 0.006%). All the following processing and related results were based on the data after preprocessing.

Within concepts

After the preprocessing, we calculated the overall information of the Chinese Conceptual semantic Feature Dataset (CCFD), including the number of effective feature descriptions and the average number of features per concept. Then we counted the features of each concept, how many participants effectively described each concept (after deleting invalid descriptions), and the frequency of each feature of each concept (that is, how many participants mentioned this feature). In addition, the sample coverage of each concept was calculated (Canessa, Chaigneau, Lagos, & Medina, 2020). The higher the sample coverage is, the more completeness the features are. Since the number of participants describing each concept after data preprocessing was not the same, we normalized the frequency information of each concept by "(frequency of each feature of the concept / number of participants describing the concept) × 30".

Between concepts

In this study, we would like to investigate the features common to multiple concepts (shared features) and the features unique to certain concepts (unique features). We calculated the sharing degree of each feature, that is, the number of concepts that have this feature. The features only belong to one or two certain concepts were defined as unique features. Features possessed by three or more concepts were shared features, and the classification criteria were the same as CSLB (Devereux et al., 2014). Then, we investigated the similarity between concepts, which was a key point of this study. The 1,410 concepts have a total of 10,059 features; thus, each concept has a 10,059-dimension feature vector with each element corresponding to the number of participants describing this feature. Principle component analysis (PCA) was used to centralize and reduce the dimension of feature vector to 645, which explained 95% of the variance, then the cosine distance was used to calculate the similarity between concepts.

Between categories

We investigated whether the concepts of different categories were different within and between categories. There is no uniform standard for the classification of concepts. Some studies on patients with brain injury have found that there are obstacles to the identification of specific categories of concepts, namely selective semantic impairment (Gainotti, 2000, 2005; Gainotti, Spinelli, Scaricamazza, & Marra, 2013). For example, the vast majority of patients with semantic disorders, who have lesions in the mid-to-anterior ventral and medial temporal regions bilaterally, are mainly restricted to the animal domain but not to plants or artificial objects (Gainotti, 2010), while damage to the left medial and inferior temporal regions hinders the recognition of plants (Samson & Pillon, 2003). In addition to these categories of living things (animals, plants) and non-living things (artificial objects, natural objects), body parts are usually treated as a separate category (Guido, 2015). According to these selective semantic impairment studies, we identified five superior categories: animal, artificial object, body part, plant, and natural object. During the actual processing of classification, we found that the food did not belong to any of the above categories (though it could be sorted into artificial object, but many plants also belong to food), so we made it a separate category. Although transportation could be directly classified as artificial object, it could also be further divided into vehicles, boats, and aircrafts. Therefore, for the convenience of subsequent classification, we also took transportation as a superior category. As a result, all concepts were artificially divided into seven superior categories as: animal, transportation, artificial object, body part, food, plant, and natural objects. To further distinguish more detailed conceptual categories, we artificially divided the seven superior categories into 28 subcategories (see Table 4 for details. “Supplementary materials_2_Table1” shows the category of each concept). Both the classification of superior categories and subcategories were determined by discussion among the authors.

Moreover, the ward minimum variance algorithm was used for hierarchical clustering. Unlike the artificially categorization approach according to domains, this is a data-driven approach based on the features of concepts. We compared the results of the two classifications.

Studies have shown that different categories of objects have different numbers of shared features. For example, animals have more shared features (such as having eyes, ears, and nose), while tools have fewer shared features but more unique features (Clarke & Tyler, 2015). Human use different brain regions to recognize different objects. Recognizing objects with more shared features requires more participation of the lateral posterior fusiform gyrus, while recognizing objects with fewer shared features relies more on the medial posterior fusiform gyrus (Tyler et al., 2013). Concepts that share more features and prone to confusion are difficult to distinguish, such as 海象 (walrus) and 海狮 (sea lion), which require more involvement of the entorhinal cortex (Bruffaerts et al., 2013). We compared the number of total features, number (and proportion) of shared features, and number (and proportion) of unique features between different subcategories and superior categories. Because the subcategories were nested into the superior categories, so we used the general linear model to analyze the nested data.

Results

Within concepts

In CCFD, there are 378,533 valid descriptions for 1,410 concepts, with an average of 268 descriptions for each concept, and an average of nine descriptions for each concept given by each participant. The sample coverage of each concept is presented in the “Supplementary material_2_Table1” and the average sample coverage is 68.84%. The features of each concept and the frequency of each feature are presented in the “Supplementary material_3_Table2_All concepts and features”. Among them, the features with frequency equaling to 1 were deleted as these features are not universal, which is the same as the processing method of CSLB (Devereux et al., 2014). Table 1 shows a schematic result of the concept "熊猫 (panda)". The average number of features of each concept which belongs to CCFD and CSLB respectively are shown in Table 2.

Table 1 The features and the corresponding frequency of the concept “panda” as an example
Table 2 Dataset overview of CCFD and CSLB (Devereux et al., 2014)

Between concepts

In this study, the concepts and features generated by normalized results were used to form a concept feature matrix (see the “Supplementary material_4_Table3_Concept feature matrix” for details). The 1,410 concepts have a total of 10,059 features. On average, each feature is shared by 5.17 (SD = 22.71) concepts. Table 2 shows the average number of shared features and unique features for each concept of both CCFD and CSLB. We divided the number of shared features according to the sharing degree, as shown in Table 3.

Table 3 The number of features with different sharing degree

The similarity results between concepts are presented in the “Supplementary material_5_Table4_Concept similarity matrix”. The concepts used in this study have partially overlap with CSLB. The overlapping concepts cover a total of 19 subcategories. We randomly selected six concepts from each subcategory to calculate the similarity between concepts, as shown in Fig. 2. Figure 2a shows the result of CCFD, and Fig. 2b shows the result of CSLB. We calculated the similarities within and between each superior category of both CCFD and CSLB (see Fig. S2 in “Supplementary materials_1_Figures” for details). The results showed that, compared with CSLB, CCFD had higher average intra-category similarity (CCFD: 0.26 > CSLB: 0.23) and lower average inter-category similarity (CCFD: – 0.04 < CSLB: 0.03). Cosine distance was used to calculate the correlation between CCFD and CSLB, and the correlation between the two similarity matrices based on off-diagonal pairs was 58.29% (including all the shared concepts).

Fig. 2
figure 2

Similarity matrix of randomly selected concepts among 19 subcategories in CCFD (a) and CSLB (b). Each cell in the picture represents the similarity between two concepts. The concepts from the same subcategory (six randomly selected concepts from each subcategory) gathered together, and the text on the left and bottom of the picture represents the name of the subcategory (black words). The concepts from the same superior category also gathered together. The upper and right sides of the picture represent the names of the superior categories (colored words)

Between categories

This study also hierarchically clustered all concepts based on the features of the concepts. The results of hierarchical clustering are detailed in the “Supplementary materials_6_Table5_Results of hierarchical clustering”. For the whole view of hierarchical clustering results, please see “Supplementary materials_7_Results of hierarchical clustering”. All concepts were grouped into two large clusters, which can be named as living things and non-living things, respectively. Living things were further divided into two clusters: animals and plants. We compared the clustering results and human category judgments. The results showed that the “animal” and “plant” categories were the same between the two methods of classification. However, the concepts which belong to “transportation, artificial object, body part, food and natural object” judged by human were put into a big cluster in the hierarchical clustering results. Interestingly, we could find some associations in the results of hierarchical clustering. For example, many kinds of vehicles (cars, ships, planes) and many kinds of roads and bridges gathered in a cluster, which are related to transportation. The various types of pens and hand-related concepts (hand, arm, finger, palm) also clustered to reflect the association of using these pens by hands.

We also compared the hierarchical clustering results between CCFD and CSLB. Because there were too many concepts, it was impossible to visualize them all here. Therefore, we only visualized the hierarchical clustering results of the animal category (Fig. 3). The results showed that CCFD had a similar hierarchical structure in the animal category as CSLB, which could be well divided into 鸟 (bird) (green lines and words), 昆虫 (insect) (red lines and words), 陆生动物 (terrestrial animal) (blue lines and words) and 水生动物 (aquatic animal) (purple lines and words). As we can see from Fig. 3b, some animals that live in the water (such as shark and dolphin) were clustered with terrestrial animals but not with aquatic animals in the CSLB.

Fig. 3
figure 3

The results of hierarchical clustering of animals. (a) The results of CCFD. (b) The results of CSLB. The concepts chosen by the two are the same.

The number of concepts in each category, the number of shared features, unique features, and total features of the concepts were analyzed in 28 subcategories and seven superior categories, as shown in Table 4.

Table 4 The number of concepts, shared features, unique features, and total features of the concepts in 28 subcategories and seven superior categories

Considering the nested relationship between subcategory and superior category, nested analysis of UNIANOVA was used to examine the difference in the number of total features, number (and proportion) of shared features and unique features of the concepts of different subcategories as well as superior categories. It was found that there were significant differences in the number of shared features (F(21,1382) = 7.173, p < 0.001, η2 = 0.098), proportion of shared features (F(21,1382) = 6.572, p < 0.001, η2 = 0.091), number of unique features (F(21,1382) = 4.253, p < 0.001, η2 = 0.061), proportion of unique features (F(21,1382) = 6.572, p < 0.001, η2 = 0.091) and the total number of features (F(21,1382) = 3.956, p < 0.001, η2 = 0.057) among subcategories. The results also showed significant differences in the number of shared features (F(6,1382) = 72.330, p < 0.001, η2 = 0.239), proportion of shared features (F(6,1382) = 81.682, p < 0.001, η2 = 0.262), number of unique features (F(6,1382) = 51.622, p < 0.001, η2 = 0.183), proportion of unique features (F(6,1382) = 81.682, p < 0.001, η2 = 0.262) and the total number of features (F(6,1382) = 27.817, p < 0.001, η2 = 0.108) among superior categories. Figure 4 shows the proportion of shared features and unique features of each subcategory (a) and superior category (b).

Fig. 4
figure 4

Proportion of shared features and unique features of each subcategory (a) and superior category (b). The color in this figure is the same as the color index in Fig. 2

Because there are too many subcategories in this study (n = 28), we did not report the post hoc LSD multiple comparison results of subcategory but only reported the results of superior category. For the total number of features, animals ≥ transportation ≥ plants > food ≥ natural objects ≥ artificial objects ≥ body parts (“>” means significant difference, p < 0.05, “≥” means no significant difference). There was no significant difference between animals and transportation (p = 0.078), transportation and plants (p = 0.113), food and natural objects (p = 0.658), food and artificial objects (p = 0.055), natural objects and artificial objects (p = 0.677), artificial objects and body parts (p = 0.481), body parts and food (p = 0.115), body parts and natural objects (p = 0.421), but there were significant differences between the other pairs.

For the proportion of shared features, plants ≥ animals (p = 0.426) > food > transportation > artificial objects > natural objects ≥ body parts (p = 0.143). The results of unique feature proportion were just the opposite of shared features. From the data we could see that animals and plants had relatively the most shared features and the fewest unique features, while natural objects and body parts had relatively the most unique features and the fewest shared features.

Discussion

This study established the first Chinese conceptual semantic feature dataset with 1,410 concepts. All the concepts are very familiar to Chinese native speakers. The CCFD covers seven superior categories and 28 subcategories, and each concept has about 37 features. According to the features of concepts, the similarity analysis results showed that the concepts in the same category were similar, while the concepts between different categories were less similar, but they were not completely irrelevant, which is consistent with our daily understanding of objects. Even for objects from different categories, we can still generate certain associations in non-category relationships, such as chopsticks and noodles that often appear in the same scene, the former belongs to tableware while the latter belongs to food.

Although there are several English conceptual semantic feature datasets (Buchanan et al., 2013; Buchanan et al., 2019; Devereux et al., 2014; McRae et al., 2005; Vinson & Vigliocco, 2008), the CCFD proposed in this study is the first large-scale conceptual semantic feature dataset in Chinese. Compared with the existing standardized English dataset, the CCFD is considerable both in terms of data quantity and quality. The biggest difference between CCFD and English dataset lies in the specific concepts and features, especially in the category of food. The category of food contains some concepts that people are very familiar with, but there are cultural differences for the familiarity of specific food, such as 臭豆腐 (stinky tofu), 皮蛋 (preserved eggs), 冰糖葫芦 (iced sugar gourd) and other concepts that only Chinese are familiar with. Therefore, CCFD is more suitable for research in Chinese. By comparing the Chinese and English conceptual feature datasets, researchers can also systematically investigate the differences between cultural differences, including the differences of concepts, the differences of features of the same concepts, etc.

By analyzing the frequency of each feature mentioned in a concept, we can see which features are core features and which features are marginal features. How to define a concept according to these features is also an important issue for language and cognitive psychologists. According to the features of these concepts, the concepts can be classified into different categories or even classified by hierarchy categories. However, how the human brain represents the category of concepts and the hierarchical relationship of concepts remains to be explored. The CCFD can provide data support for such studies. Our study also found that the number of shared features and unique features of different categories of concepts were different. Natural organisms such as animals and plants have more shared features, while the concepts of artificial things such as tools have fewer sharing features, which is consistent with previous research results (Clarke & Tyler, 2015).

A lot of research on memory and language use semantic priming paradigm. Researchers manipulate priming effect by manipulating the correlation between priming words and target words. Compared with unrelated cues, related cues enable subjects to recognize target words more quickly (Schacter, Dobbins, & Schnyer, 2004). How to determine the degree of correlation between concepts, and then choose the appropriate priming words and target words, is the key to related research. In this study, the similarity between concepts was calculated by their features. Future research on semantic priming paradigm can use the present results as one of the indicators to measure the correlation between priming words and target words. In conclusion, the CCFD provides rich data support for semantic priming paradigm.

In addition to the standardization of datasets in language and memory research, computational modeling of memory also needs large-scale datasets to describe semantic memory accurately. The field of artificial intelligence also pays attention to the semantic features of concepts. Researchers try to use a knowledge graph as an abstract carrier to represent the cognitive content of knowledge on the dataset system, so as to serve the fields of machine translation and intelligent question and answering (Balaid, Abd Rozan, Hikmi, & Memon, 2016). The number of concepts in knowledge graph dataset is much larger than that in artificially evaluated dataset, but the features of these concepts are usually extracted from the massive Internet data, and the relationship between them is represented by logical structures. This is different from the human brain's learning and representation of concepts. The human brain does not acquire and store all the features of a concept simultaneously, but acquires each feature of the concept at different time in life. Moreover, the learning process can be divided into fast learning processes and slow learning processes. When the content of new learning information is consistent with the existing knowledge structure, it can be quickly integrated into the neocortex. When it is inconsistent with existing knowledge, it takes a long time to be processed in the medial temporal cortex and hippocampus, and then slowly integrated into the neocortex (Kumaran, Hassabis, & McClelland, 2016). Various modal features of the concept are represented distributely in the neocortex (Binder et al., 2009), such as the image of a dog in the visual cortex, the bark in the auditory cortex, and the semantic representation of the dog's features in the anterior temporal lobe (Wang et al., 2020). Although the representation of concepts by the human brain is not completely accurate, it has strong flexibility, which is also an important embodiment of human brain intelligence. Therefore, although there are massive datasets in the field of artificial intelligence and they can also construct a meaningful semantic space, they may not be suitable to study the representation of knowledge in the human brain. Instead, the representation of human brain knowledge may guide the establishment of a brain-like knowledge map to better serve the field of artificial intelligence.

We hope the CCFD can provide standardized data to support the researchers in the fields of linguistics, psychology, neuroscience, artificial intelligence, and so on. We will continue to expand the content of CCFD. On the one hand, we will expand the data collection for different age groups, such as the data for children of all ages, so as to investigate the learning process of concepts in the natural state; on the other hand, we will expand the number and types of concepts, including verbs and adjectives as well as abstract concepts, to further investigate the cognitive process of language.