Abstract
Before going to the thick of the multivariate summarization, this chapter first considers the concept of feature and its summarizations into histograms, density functions and centers. Two perspectives are defined, the probabilistic and vector-space ones, for defining concepts of feature centers and spreads. Also, current views on the types of measurement scales are described to conclude that the binary scales are both quantitative and categorical. The core of the Chapter describes the method of principal components (PCA) as a method for fitting a data-driven data summarization model. The model proposes that the data entries, up to the errors, are (sums of) products of hidden factor scores and feature loadings. This, together with the least-squares fitting criterion, appears to be equivalent to finding what is known in mathematics as part of the singular value decomposition (SVD) of a rectangular matrix. Three applications of the method are described: (1) scoring hidden aggregate factors, (2) visualization of the data, and (3) Latent Semantic Indexing. The conventional, and equivalent, formulation of PCA via covariance matrices involving their eigenvalues is also described. The main difference between the two formulations is that the property of principal components to be linear combinations of features is postulated in the conventional approach and derived in that SVD based. The issue of interpretation of the results is discussed, too. A novel promising approach based on a postulated linear model of stratification is presented via a project. The issue of data standardization in data summarization problems, remaining unsolved, is discussed at length in the beginning. A powerful application using eigenvectors for scoring node importance in networks and pair comparison matrices, the Google PageRank approach, is described too.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
B. Efron, R.J. Tibshirani, An Introduction to the Bootstrap (CRC Press, 1994)
T.K. Landauer, Latent Semantic Analysis (Wiley, Hoboken, 2006)
R.D. Luce, Utility of Gains and Losses: Measurement-theoretical and Experimental Approaches (Psychology Press, 2014)
C.D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval (Cambridge University Press, Cambridge, 2008)
B. Mirkin, (1979) Group Choice (Winston and Sons, 1979). A division of Scripta Technica (English translation from Russian, Group Choice Problems, 1974)
B. Mirkin, Mathematical Classification and Clustering (Kluwer Academic Press, 1996)
B. Mirkin, Clustering: A Data Recovery Approach (Chapman & Hall/CRC, Boca Raton, 2012)
R. Tibshirani, M. Wainwright, T. Hastie, Statistical Learning with Sparsity: The Lasso and Generalizations (Chapman and Hall/CRC, Boca Raton, 2015)
Articles
E. Andersson, P.A. Ekström, Investigating Google’s pagerank algorithm. A Tech. Rep. Sci. Comput. (2004)
J. Carpenter, J. Bithell, Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Stat. Med. 19(9), 1141–1164 (2000)
B. Cavallo, L. D’Apuzzo, A general unified framework for pairwise comparison matrices in multicriterial methods. Int. J. Intell. Syst. 24(4), 377–398 (2009)
S. Deerwester, S. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
H.J. Ferreau, C. Kirches, A. Potschka, H.G. Bock, M. Diehl, qpOASES: A parametric active-set algorithm for quadratic programming. Math. Program. Comput. 6(4), 327–363 (2014)
W.D. Fisher, On grouping for maximum homogeneity. J. Am. Stat. Assoc. 53(284), 789–798 (1958)
M. Franceschet, PageRank: Standing on the shoulders of giants. Commun. ACM 54(6), 92–101 (2011)
E.V. Kovaleva, B.G. Mirkin, Bisecting K-means and 1D projection divisive clustering: a unified framework and experimental comparison. J. Classif. 32(3), 414–442 (2015)
D.D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization. Adv. Neural Inf. Process. Syst. 556–562 (2001)
M.A. Makary, M. Daniel, Medical error—the third leading cause of death in the US. BMJ 353, i2139 (2016)
F. Murtagh, M. Orlov, B. Mirkin, Qualitative judgement of research impact: Domain taxonomy as a fundamental framework for judgement of the quality of research. J. Classif. 35(1), 5–28 (2018)
M. Orlov, B. Mirkin, A concept of multicriteria stratification: a definition and solution. Procedia Comput. Sci. 31, 273–280 (2014)
L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank citation ranking: bringing order to the web. Stanford InfoLab Technical Report (1999)
V. Podinovski, O.V. Podinovskaya, Criteria importance theory for decision making problems with a hierarchical criterion structure, Moscow. HSE Working Paper WP7/2014/04 (2014)
T.L. Saaty, How to make a decision: the analytic hierarchy process. Eur. J. Oper. Res. 48(1), 9–26 (1990)
C. Wang, D.M. Blei, Collaborative topic modeling for recommending scientific articles, in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2011), 448–456
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Mirkin, B. (2019). Quantitative Summarization. In: Core Data Analysis: Summarization, Correlation, and Visualization. Undergraduate Topics in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-00271-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-00271-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00270-1
Online ISBN: 978-3-030-00271-8
eBook Packages: Computer ScienceComputer Science (R0)