Skip to main content

Introduction: Tracing the History of a Discipline Through Quantitative and Qualitative Analyses of Scientific Literature

  • Chapter
  • First Online:

Abstract

The chapters of this book are concerned with learning of the evolution of ideas (theories, concepts, methods, and application domains) and of the history of a discipline, by means of the temporal evolution of word occurrences in papers published by scientific journals. The work carried out for each of the areas involved in the project (philosophy, sociology, psychology, linguistics, statistics) pursued different objectives: to obtain a first overview of the relationship between time and contents in order to observe latent temporal patterns; to identify relevant keywords; to cluster keywords portraying similar temporal patterns; to identify latent dynamics of cluster keywords; and to identify relevant topics as groups of related words. The contributions identified and analysed the main subject matters that, at the time of publication, were considered relevant by mainstream journals and offer new viewpoints to read and understand the evolution of a discipline. The interdisciplinary debate triggered by this research work is innovative because quantitative methods for text analysis have been used in areas of human and social sciences, which are traditionally studied through qualitative approaches, and also represents a positive experience since new paths have been explored by pooling together the qualitative and quantitative research methods, traditions, and expertise of different disciplines.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  • Aggarwal, C. C., & Zhai, C. (2012). Mining text data. New York: Springer.

    Book  Google Scholar 

  • Angelini, A., Canditiis, D. D., & Pensky, M. (2012). Clustering time-course microarray data using functional bayesian infinite mixture model. Journal of Applied Statistics, 39(1), 129–149.

    Article  MathSciNet  Google Scholar 

  • Baayen, R. H. (2001). Word frequency distributions. Dordrecht: Kluwer Academic Publishers.

    Book  Google Scholar 

  • Beaudouin, V. (2016). Statistical analysis of textual data: Benzécri and the French School of Data Analysis. Glottometrics, 33, 56–72.

    Google Scholar 

  • Berry, M. W. (Ed.). (2004). Survey of text mining. Clustering, classification, and retrieval. New York: Springer-Verlag.

    MATH  Google Scholar 

  • Berry, M. W., & Kogan, J. (2010). Text mining: Applications and theory. Chichester: Wiley Online Library.

    Book  Google Scholar 

  • Bhattacharya, S., & Basu, P. K. (1998). Mapping a research area at the micro level using co-word analysis. Scientometrics, 43(3), 359–372.

    Article  Google Scholar 

  • Blei, D. M., Ng, A. Y., & Jordan, M. (2003). Latent Dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.

    MATH  Google Scholar 

  • Bolasco, S. (2005). Statistica testuale e text mining: alcuni paradigmi applicativi. Quaderni di Statistica, 7, 17–53.

    Google Scholar 

  • Bolasco, S. (2013). L'analisi automatica dei testi. Fare ricerca con il text mining. Roma: Carocci.

    Google Scholar 

  • Cahlík, T., & Jiřina, M. (2006). Law of cumulative advantages in the evolution of scientific fields. Scientometrics, 66(3), 441–449.

    Article  Google Scholar 

  • Chavalarias, D., & Cointet, J. P. (2008). Bottom-up scientific field detection for dynamical and hierarchical science mapping, methodology and case study. Scientometrics, 75(1), 37–50.

    Article  Google Scholar 

  • Chavalarias, D., & Cointet, J. P. (2013). Phylomemetic patterns in science evolution – The rise and fall of scientific fields. PLoS One, 8(2), e54847.

    Article  Google Scholar 

  • Cobo, M., López-Herrera, A., Herrera-Viedma, E., & Herrera, F. (2011). An approach for detecting, quantifying, and visualizing the evolution of a research field: A practical application to the fuzzy sets theory field. Journal of Informetrics, 5(1), 146–166.

    Article  Google Scholar 

  • Cobo, M., López-Herrera, A., Herrera-Viedma, E., & Herrera, F. (2012). SciMAT: A new science mapping analysis software tool. Journal of the American Society for Information Science and Technology, 63(8), 1609–1630.

    Article  Google Scholar 

  • Coffey, N., Hinde, J., & Holian, E. (2014). Clustering longitudinal profiles using P-splines and mixed effects models applied to time-course gene expression data. Computational Statistics & Data Analysis, 71, 14–29.

    Article  MathSciNet  Google Scholar 

  • Cortelazzo, M. A., & Tuzzi, A. (Eds.). (2007). Messaggi dal Colle. I discorsi di fine anno dei presidenti della Repubblica. Venezia: Marsilio Editori.

    Google Scholar 

  • Cretchley, J., Rooney, D., & Gallois, C. (2010). Mapping a 40-year history with leximancer: Themes and concepts in the journal of cross-cultural psychology. Journal of Cross-Cultural Psychology, 41(3), 318–328.

    Article  Google Scholar 

  • Dister, A., Longrée, D., & Purnelle, G. (Eds.). (2012). JADT 2012 Actes des 11es Journées internationales d’analyse statistique des données textuelles. Liège/Bruxelles: LASLA – SESLA.

    Google Scholar 

  • Diwersy, S., & Luxardo, G. (2016). Mettre en évidence le temps lexical dans un corpus de grandes dimensions: l'exemple des débats du Parlement européen. In D. Mayaffre, C. Poudat, L. Vanni, V. Magri, & P. Follette (Eds.), JADT 2016 - proceedings of 13th international conference on statistical analysis of textual data. Nice: Pressess de Fac Imprimeur France.

    Google Scholar 

  • Giacofci, M., Lambert-Lacroix, S., Marot, G., & Picard, F. (2013). Wavelet-based clustering for mixed-effects functional models in high dimension. Biometrics, 69(1), 31–40.

    Article  MathSciNet  Google Scholar 

  • Greenacre, M. J. (1984). Theory and application of correspondence analysis. London: Academic Press.

    MATH  Google Scholar 

  • Greenacre, M. J. (2007). Correspondence analysis in practice. London: Chapman & Hall.

    Book  Google Scholar 

  • Gries, S. T., & Hilpert, M. (2008). The identification of stages in diachronic data: Variability-based neighbour clustering. Corpora, 3(1), 59–81.

    Article  Google Scholar 

  • Gries, S. T., & Hilpert, M. (2012). Variability-based neighbor clustering: A bottom-up approach to periodization in historical linguistics. In T. Nevalainen & E. Traugott (Eds.), The Oxford handbook of the history of English (pp. 134–144). Oxford: Oxford University Press.

    Google Scholar 

  • Griffiths, T., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 101(Supplement 1), 5228–5235.

    Article  Google Scholar 

  • Guérin-Pace, F., Saint-Julien, T., & Lau-Bignon, A. W. (2012). The words of L’Espace géographique: A lexical analysis of the titles and keywords from 1972 to 2010. Espace géographique, 41(1), 4–31.

    Article  Google Scholar 

  • Hall, D., Jurafsky, D., & Manning, C. D. (2008). Studying the history of ideas using topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 363–371.

    Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2008). The elements of statistical learning: Data mining, inference and prediction (2nd ed.). New York: Springer-Verlag.

    MATH  Google Scholar 

  • Hilpert, M., & Gries, S. T. (2009). Assessing frequency changes in multi-stage diachronic corpora: Applications for historical corpus linguistics and the study of language acquisition. Literary and Linguistic Computing, 24(4), 385–401.

    Article  Google Scholar 

  • Jacques, J., & Preda, C. (2014a). Model-based clustering for multivariate functional data. Computational Statistics & Data Analysis, 71, 92–106.

    Article  MathSciNet  Google Scholar 

  • Jacques, J., & Preda, C. (2014b). Functional data clustering: A survey. Advances in Data Analysis and Classification, 8(3), 231–255.

    Article  MathSciNet  Google Scholar 

  • James, G. M., & Sugar, C. A. (2003). Clustering for sparsely sampled functional data. Journal of the American Statistical Association, 98, 397–408.

    Article  MathSciNet  Google Scholar 

  • Johnstone, I. M., & Titterington, D. M. (2009). Statistical challenges of high-dimensional data. Philosophical Transactions of the Royal Society A, 367(1906), 4237–4253.

    Article  MathSciNet  Google Scholar 

  • Kao, A., & Poteet, S. R. (Eds.). (2007). Natural language processing and text mining. London: Springer-Verlag.

    MATH  Google Scholar 

  • Kelih, E., Knight, R., Mačutek, J., & Wilson, A. (Eds.). (2016). Issues in quantitative linguistics 4. Studies in quantitative linguistics (Vol. 23). Lüdenscheid: RAM-Verlag.

    Google Scholar 

  • Köhler, R. (2011). Laws of languages. In P. C. Hogan (Ed.), The Cambridge encyclopedia of the language science (pp. 424–426). Cambridge: Cambridge University Press.

    Google Scholar 

  • Köhler, R. (2012). Quantitative syntax analysis. Berlin: De Gruyter.

    Book  Google Scholar 

  • Köhler, R., & Galle, M. (1993). Dynamic aspects of text characteristics. In L. Hrebícek & G. Altmann (Eds.), Quantitative text analysis (pp. 46–53). Trier: Wissenschaftlicher.

    Google Scholar 

  • Koplenig, A. (2017). A data-driven method to identify (correlated) changes in chronological corpora. Journal of Quantitative Linguistics, 24(4), 289–318.

    Article  Google Scholar 

  • Lebart, L., Morineau, A., & Warwick, K. M. (1984). Multivariate descriptive statistical analysis: Correspondence analysis and related techniques for large matrices. Applied probability and statistics. Chichester: Wiley.

    MATH  Google Scholar 

  • Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. Boston: Kluwer Academic Publication.

    Book  Google Scholar 

  • Lee, S. X., & McLachlan, G. J. (2013). Model-based clustering and classification with non-normal mixture distributions. Statistical Methods & Applications, 22(4), 427–454.

    Article  MathSciNet  Google Scholar 

  • Léon, J., & Loiseau, S. (Eds.). (2016). History of quantitative linguistics in France. Lüdenscheid: RAM-Verlag.

    Google Scholar 

  • Maggioni, M. A., Gambarotto, F., & Uberti, T. E. (2009). Mapping the evolution of ‘Clusters’: A meta-analysis. FEEM working paper no. 74.2009.

    Google Scholar 

  • Mayaffre, D., Poudat, C., Vanni, L., Magri, V., & Follette, P. (Eds.). (2016). JADT 2016 - Proceedings of 13th International Conference on Statistical Analysis of Textual Data, Nice 7-10 giugno 2016. Nice: Pressess de Fac Imprimeur France.

    Google Scholar 

  • Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, et al. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176–182.

    Article  Google Scholar 

  • Mikros, G. K., & Mačutek, J. (Eds.). (2015). Sequences in language and text. Berlin/Boston: Walter De Gruyter.

    Google Scholar 

  • Moretti, F. (2013). Distant reading. London: Verso/New Left Books.

    Google Scholar 

  • Murtagh, F. (2005). Correspondence analysis and data coding with java and R. London: Chapman & Hall/CRC.

    Book  Google Scholar 

  • Murtagh, F. (2010). The correspondence analysis platform for uncovering deep structure in data and information, sixth Boole lecture. Computer Journal, 53(3), 304–315.

    Article  Google Scholar 

  • Murtagh, F. (2017). Big data scaling through metric mapping: Exploiting the remarkable simplicity of very high dimensional spaces using correspondence analysis. In F. Palumbo, A. Montanari, & M. Vichi (Eds.), Data science - innovative developments in data analysis and clustering (pp. 295–306). Cham: Springer.

    Google Scholar 

  • Naumann, S., Grzybek, P., Vulanović, R., & Altmann, G. (Eds.). (2012). Synergetic linguistics. Text and language as dynamic systems. Vienna: Praesens Verlag.

    Google Scholar 

  • Née, É., Daube, J.-M., Valette, M., & Fleury, S. (Eds.). (2014). Actes des 12e Journées internationales d'analyse statistique des données textuelles (JADT 2014), 3–6 juin 2014, Paris (Actes électroniques).

  • Obradović, I., Kelih, E., & Köhler, R. (Eds.). (2013). Methods and applications of quantitative linguistics: Selected papers of the VIIIth International Conference on Quantitative Linguistics (QUALICO), Belgrade, Serbia, April 16–19, 2012, Akademska Misao, Belgrado, Serbia.

    Google Scholar 

  • Pawłowski, A. (2006). Chronological analysis of textual data from the Wrocław Corpus of Polish. Poznań Studies in Contemporary Linguistics, 41, 9–29.

    Google Scholar 

  • Pawłowski, A. (2016). Chronological corpora: Challenges and opportunities of sequential analysis. The example of ChronoPress corpus of Polish. Digital Humanities (pp. 311–313).

    Google Scholar 

  • Pawłowski, A., Krajewski, M., & Eder, M. (2010). Time series modelling in the analysis of homeric verse. Eos, 97(2), 79–100.

    Google Scholar 

  • Popescu, I.-I., Macutek, J., & Altmann, G. (2009). Aspects of word frequencies. Studies in quantitative linguistics. Ludenscheid: RAM.

    Google Scholar 

  • Popescu, I.-I. (2009). Word frequency studies. Berlin: Mouton De Gruyter.

    Google Scholar 

  • Popescu, O., & Strapparava, C. (2014). Time corpora: Epochs, opinions and changes. Knowledge-Based Systems, 69, 3–13.

    Article  Google Scholar 

  • Porter, A. L., & Rafols, I. (2009). Is science becoming more interdisciplinary? Measuring and mapping six research fields over time. Scientometrics, 81(3), 719–745.

    Article  Google Scholar 

  • Ramsay, J., & Silverman, B. W. (2005). Functional data analysis (Springer series in statistics). New York: Springer.

    Google Scholar 

  • Ratinaud, P., & Marchand, P. (2012). Application de la méthode ALCESTE à de “gros” corpus et stabilité des “mondes lexicaux”: analyse du “CableGate” avec IRaMuTeQ. In Actes des 11eme Journées internationales d’Analyse statistique des Données Textuelles (pp. 835–844). Liège, Belgique.

    Google Scholar 

  • Ray, S., & Mallick, B. (2006). Functional clustering by bayesian wavelet methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(2), 305–332.

    Article  MathSciNet  Google Scholar 

  • Reinert, M. (1983). Une methode de classification descendante hierarchique: application a l’analyse lexicale par context. Les Cahiers de l’Analyse des Données, 8(2), 187–198.

    Google Scholar 

  • Reinert, M. (1990). ALCESTE: Une méthodologie d'analyse des données textuelles et une application: Aurélia de Gérard de Nerval. Bulletin de Méthodologie Sociologique, 26, 24–54.

    Article  Google Scholar 

  • Reinert, M. (1993). Les “mondes lexicaux” et leur “logique” à travers l’analyse statistique d’un corpus de récits de cauchemars. Language et Société, 66, 5–39.

    Article  Google Scholar 

  • Rodriguez, A., Dunson, D. B., & Gelfand, A. E. (2009). Bayesian nonparametric functional data analysis through density estimation. Biometrika, 96(1), 149–162.

    Article  MathSciNet  Google Scholar 

  • Sahami, A., & Srivastava, M. (Eds.). (2009). Text mining: Theory and applications. London: Taylor and Francis.

    MATH  Google Scholar 

  • Salem, A. (1988). Approches du temps lexical. Statistique textuelle et séries chronologiques. Mots. Les langages du politique, 17, 105–114.

    Google Scholar 

  • Salem, A. (1991). Les séries textuelles chronologiques. Histoire & Mesure, VI-1(2), 149–175.

    Article  Google Scholar 

  • Sanger, J., & Feldman, R. (2007). The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge: Cambridge University Press.

    Google Scholar 

  • Small, H. (2006). Tracking and predicting growth areas in science. Scientometrics, 68(3), 595–610.

    Article  Google Scholar 

  • Sullivan, D. (2001). Document warehousing and text mining: Techniques for improving business operations. Wiley: Marketing and Sales.

    Google Scholar 

  • Tibshirani, R., Wainwright, M., & Hastie, T. (2015). Statistical learning with sparsity: The lasso and generalizations. New York: Chapman and Hall/CRC.

    MATH  Google Scholar 

  • Trevisani, M., & Tuzzi, A. (2015). A portrait of JASA: The history of statistics through analysis of keyword counts in an early scientific journal. Quality and Quantity, 49, 1287–1304.

    Article  Google Scholar 

  • Trevisani, M., & Tuzzi, A. (2018). Learning the evolution of disciplines from scientific literature. A functional clustering approach to normalized keyword count trajectories. Knowledge-Based Systems, 146, 129–141.

    Article  Google Scholar 

  • Tuzzi, A. (2012). Reinhard Köhler’s scientific production: Words, numbers and pictures. In S. Naumann, P. Grzybek, R. Vulanović, & G. Altmann (Eds.), Synergetic linguistics. Text and language as dynamic systems (pp. 223–242). Vienna: Praesens Verlag.

    Google Scholar 

  • Tuzzi, A., Benesová, M., & Macutek, J. (Eds.). (2015). Recent contributions to quantitative linguistics. Berlin: De Gruyter.

    Google Scholar 

  • Tuzzi, A., & Köhler, R. (2015). Tracing the history of words. In A. Tuzzi, M. Benesová, & J. Macutek (Eds.), Recent contributions to quantitative linguistics (pp. 203–214). Berlin: DeGruyter.

    Google Scholar 

  • Van Den Besselaar, P., & Heimeriks, G. (2006). Mapping research topics using word-reference co-occurrences: A method and an exploratory case study. Scientometrics, 68(3), 377–393.

    Article  Google Scholar 

  • Wang, J. L., Chiou, J. M., & Mueller, H. G. (2016). Functional data analysis. Annual Review of Statistics and Its Application, 3(1), 257–295.

    Article  Google Scholar 

  • Wang, L., Köhler, R., & Tuzzi, A. (Eds.). (2018). Structure, Function and Process in Texts. Lüdenscheid: RAM-Verlag.

    Google Scholar 

  • Weiss, S. M., Indurkhya, N., Zhang, T., & Damerau, F. (2005). Text mining: Predictive methods for analyzing unstructured information. New York: Springer.

    Book  Google Scholar 

  • Yin, Y., & Wang, D. (2017). The time dimension of science: Connecting the past to the future. Journal of Informetrics, 11, 608–621.

    Article  Google Scholar 

  • Zhang, Y., Chen, H., Lu, J., & Zhang, G. (2017). Detecting and predicting the topic change of knowledge-based systems: A topic-based bibliometric analysis from 1991 to 2016. Knowledge Based System, 133(Supplement C), 255–268.

    Article  Google Scholar 

  • Zhang, Y., Zhang, G., Chen, H., Porter, A. L., Zhu, D., & Lu, J. (2016). Topic analysis and forecasting for science, technology and innovation: Methodology with a case study focusing on big data research. Technological Forecasting and Social Change, 105, 179–191.

    Article  Google Scholar 

Download references

Acknowledgements

To the members of the research team and co-authors of this book, which I had the honour to lead and coordinate, go all my respect and gratitude for having chosen to follow me in this challenging adventure and to join the small group of brave researchers who for some time shared my interest in this matter. I would like to recognize the open minds of our most senior colleagues, and their vision and desire to get involved on truly exceptional, unfamiliar terrain, and I am very satisfied with the work of my younger colleagues for the desire to learn which they have shown, and for the great enthusiasm that they dedicated to the project and for having become the real “research engine” of the group.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arjuna Tuzzi .

Editor information

Editors and Affiliations

Appendix

Appendix

1.1.1 A Brief Overview on Correspondence Analysis

Correspondence Analysis (CA) is an Explorative Data Analysis (EDA) that has proven useful in studying the conjoint distribution of two (or more) categorical variables. CA portrays the existing structure of association between two (or more) variables by means of simple plots that position the categories of the variables on a plane.

The quantitative perspective adopted by the contributions of this volume are based on words and word counts, i.e. they are based on the observation of occurrences of relevant keywords over time. In this perspective, CA can be exploited to achieve a content mapping as it is useful to represent the system of relationships among years (e.g. volumes of the journals), among words (e.g. relevant keywords), and between years and words. Although CA is not able to describe all relevant linguistic features of a set of texts, it contributes to highlight latent patterns. For example, in our case, it makes it possible to verify whether the volumes of a journal expressed a clear temporal pattern in their main contents.

In the simplest version, CA works on a two-way contingency table in which the rows represent keywords (e.g. m word-types w1,, wm) and columns represent the volumes of the journal (e.g. p time-points t1, …, tp). Each cell of this (lexical) contingency table represents the number nij of occurrences of the i-th keyword (the i-th row) in the volume published at the j-th time-point (the j-th column) (Table 1.1).

Table 1.1 Example of (lexical) contingency table words × time-points

CA provides the best simultaneous representation of row profiles and column profiles on each axis (and on each plane generated by a pair of axes). The purpose of the CA is to translate the similarities between categories (words and volumes) in a graph in which the most similar categories are placed in adjacent positions in the space defined by the Cartesian axes. If you look at the words, it is fairly intuitive to think that the similarity between two words depends on how much the occurrences in the two rows of the table “resemble each other”, that is, how similar they are in terms of presence, absence, or occurrence in the journal volumes: if two words tend to be used in the same volumes and with similar frequency, they have a similar profile over time. Two words with an identical profile will have no distance between them, that is, they will be represented on a graph as two overlapping points.

The intuitive notion of similarity between the profiles of two words wi and wk is translated into a distance (chi-square distance) that can be calculated for each pair of words:

\( {d}_{ik}^2=\sum \limits_{j=1}^p\frac{n}{n_{.j}}{\left(\frac{n_{ij}}{n_{i.}}-\frac{n_{kj}}{n_{k.}}\right)}^2 \)

All the reasoning can be repeated by taking into consideration the similarity between pairs of volumes and considering the profiles of the two columns. Two volumes of the journal (time-points tj and tk) resemble each other if they have a similar lexical profile, i.e. if they include the same words with a similar relative frequency (Fig. 1.1).

Fig. 1.1
figure 1

Profiles in terms of relative frequencies and positions on the plane of three time-points

The distance between two time-points tj and tk is given as:

\( {d}_{jk}^2=\sum \limits_{i=1}^m\frac{n}{n_{i.}}{\left(\frac{n_{ij}}{n_{.j}}-\frac{n_{ik}}{n_{.k}}\right)}^2 \)

From another viewpoint, the rows and the columns of this matrix are considered as vectors, i.e. as points in a multidimensional space, and the distance between two vectors is measured through a weighted Euclidian distance that compares the corresponding lexical profiles taking into account the size of the subcorpora (volumes) at each time-point and the occurrences of each word in the corpus as a whole.

Following the calculation of the pairwise distance for words and for volumes, the next step is to transform the space generated by the original variables in a Euclidean space generated by new orthogonal variables (components or axes). The multidimensional space generated by the matrix is reduced to orthogonal dimensions (axes) that are displayed as Cartesian axes. The number of dimensions of this new space (i.e. the number of orthogonal axes) is equal to the number of linearly independent variables (rank of the matrix) that, in our context, is the number of time-points minus one (p − 1, more generally min(m, p) − 1).

The starting point of this transformation are the square matrix m × m which contains the pairwise distances between words and the square matrix p × p with the pairwise distances between volumes. The calculation of the coordinates of each axis is based on the singular value decomposition (SVD). The orthogonal factorial axes are sorted according to the amount of inertia collected (according to degree of association), i.e. they are in order of relevance: the first is the most important axis and the one which collects the highest portion of the information contained in the contingency table, the second axis is the one which collects the highest portion of information not explained by the first axis and so on. The Cartesian plane constructed with the first two factorial axes is the two-dimensional space which best represents the structure of association shown in the contingency table on a low-dimensional Euclidean space, and so on.

Unlike other analyses that move from the analysis of a matrix cases × variables, in CA the contingency table can be read in two ways: as m row vectors in the p-1 dimensions space generated by the columns, i.e. m words in the space of p time-points (volumes), and as p column vectors in the m-1 dimensions space generated by the rows, i.e. p time-points in the space of m words. From this observation, there is the immediate possibility to obtain two graphs separately: one with the words and one with the volumes. For the geometric properties of the two spaces (duality), the dimensions are the same and the two graphs overlap. This makes it possible to observe the system of relations between all the categories in play; although we must be very careful in the interpretation of the joint graphical representation of the two variables. In order to briefly summarize the elements for reading the graphs obtained from CA, we should remember that the position where a word or a volume is found assumes a role only in the globally created context of the graph, i.e. it doesn’t have any meaning by itself, but it does have meaning in comparison with the positions taken by all the other points found in the solution with respect to the barycentre at the origin of the axes. If two words are close on the graph, it means that they have similar profiles and, analogously, if two volumes are close they have similar lexical profiles. The mutual position assumed by a word and a volume cannot be evaluated in a direct manner and must be evaluated with reference to the positions assumed by all the other elements. In this sense, it is useful to use the quadrants of the Cartesian plane and, thanks to the axes, the proximity can be evaluated by taking into account the angle formed by the axes (the more similar the angle formed with the axes is, the more they can be considered associated). The words or the volumes that contributed the most to the solution and which, therefore, can be considered the most important in the reconstructed context of the graph, are those which are distant from the origin of the axes. The densification of modalities in an area of the graph that stands out from the rest as a cluster might be interpreted as a semantic area and for this purpose one often choses to partition into clusters. The clusters of words or volumes should be homogeneous as much as possible within the group and, as much as possible, heterogeneous within groups. In the analysis of the lexical contingency table, a cluster analysis based on the CA groups together the volumes based on the lexical similarity (which is usually also visible in terms of proximity of the points on the graph).

1.1.2 An Example

To understand the functioning of the CA, an application example of a very simplified fictional corpus might be useful. Suppose you have 11 texts that include the topics of a journal of the statistical field and constitute a small text corpus:

  • text01 regression analysis; linear regression

  • text02 regression model; linear and non-linear model

  • text03 generalized linear model; parameter estimation

  • text04 sampling methods; random sampling; survey design and sampling methods

  • text05 survey design; finite populations

  • text06 methods for sampling elusive populations

  • text07 Normal distribution

  • text08 z-scores and Normal distribution

  • text09 Gamma distribution

  • text10 p-value: Normal distribution and Gamma—exponential family

  • text11 regression analysis; Normal distribution

There are 53 word-tokens and 25 word-types in the corpus. Taking into account only the words that are repeated at least twice, namely distribution (5 occurrences) and, linear, Normal, regression, and sampling (4), methods and model (3), analysis, design, Gamma, populations, and survey (2), we can construct a contingency table words × texts (Table 1.2), in which we see, for example, that the word survey was used once each by texts 04 and 05.

Table 1.2 Contingency table words × texts

The CA of the contingency table results in 10 factorial axes. The first two axes collect 55% of the information (explained inertia) and the first factorial plane is shown in Fig. 1.2.

Fig. 1.2
figure 2

First plane of correspondence analysis. Visualization of texts (a) and of both texts and words with frequency ≥2 (b)

Figure 1.2 shows very well the three latent patterns present in the texts that refer to linear model (regression, analysis), sampling methods (survey design, populations), and distribution (Normal, Gamma). Texts 01, 02, and 03 can be found together in the area of linear model (second quadrant, upper left) while texts 07, 08, 09, and 10 in the area of distribution (third quadrant, bottom left). Text 11 is somewhere between linear models and distributions areas because it includes both topics. In the area of sampling methods (first quadrant, on the left), there are the texts 04, 05, and 06. It is interesting to note the conjunction and which is found near the origin of the axes because it has been used in different contexts (though slightly more often used by those who talked about distributions).

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Tuzzi, A. (2018). Introduction: Tracing the History of a Discipline Through Quantitative and Qualitative Analyses of Scientific Literature. In: Tuzzi, A. (eds) Tracing the Life Cycle of Ideas in the Humanities and Social Sciences. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-97064-6_1

Download citation

Publish with us

Policies and ethics