Skip to main content

Statistical Analysis and Modeling of Data

  • Chapter
  • 4691 Accesses

Part of the book series: Graduate Texts in Physics ((GTP))

Abstract

Outcomes of physical measurements are most frequently recorded as signals that need to be processed statistically in order to infer their overall properties and features, and later compared to theoretical or phenomenological models. This chapter starts with an introduction to the basic statistical techniques of computing the averages and moments of distributions and their uncertainties. Particular emphasis is given on ways of identifying outliers and robust estimates of location (averages) and scale (dispersion). We introduce commonly used methods of computing confidence intervals for the sample means and variances, of comparing the means of samples with equal or different variances, of comparing two distributions, and computing correlations. Simple linear and multiple linear, as well as non-linear regression techniques are explained, again with attention to robust measures. Powerful multi-variate methods of principal component analysis, cluster analysis, linear discriminant analysis, and factor analysis are discussed in a separate section each. The illustrations in the Problems include the study of Raman spectra in fabric yarns, the analysis of geyser eruptions, radar reflections in the ionosphere, and the correlation analysis of astrophysical objects.

This is a preview of subscription content, log in via an institution.

References

  1. J.E. Gentle, W. Härdle, Y. Mori (eds.), Handbook of Computational Statistics. Concepts and Methods (Springer, Berlin, 2004)

    MATH  Google Scholar 

  2. V. Barnett, T. Lewis, Outliers in Statistical Data, 3rd edn. (Wiley, New York, 1994)

    MATH  Google Scholar 

  3. R. Kandel, Our Changing Climate (McGraw-Hill, New York, 1991), p. 110

    Google Scholar 

  4. L. Davies, U. Gather, Robust statistics, in Handbook of Computational Statistics. Concepts and Methods (Springer, Berlin, 2004) pp. 655–695

    Google Scholar 

  5. Analytical Methods Committee, Robust statistics—how not to reject outliers, part 1: basic concepts. Analyst 114, 1693 (1989)

    Article  Google Scholar 

  6. Analytical Methods Committee, Robust statistics—how not to reject outliers, part 2: inter-laboratory trials. Analyst 114, 1699 (1989)

    Article  Google Scholar 

  7. V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: a survey. ACM Comput. Surv. 41, 15 (2009)

    Article  Google Scholar 

  8. A. Patcha, J.-M. Park, An overview of anomaly detection techniques: existing solutions and latest technological trends. Comput. Netw. 51, 3448 (2007)

    Article  Google Scholar 

  9. M. Agyemang, K. Barker, R. Alhajj, A comprehensive survey of numeric and symbolic outlier mining techniques. Intell. Data Anal. 10, 521 (2006)

    Google Scholar 

  10. V.J. Hodge, J. Austin, A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 85 (2004)

    Article  MATH  Google Scholar 

  11. L. Davies, U. Gather, The identification of multiple outliers. J. Am. Stat. Assoc. 88, 782 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  12. B. Iglewicz, J. Martinez, Outlier detection using robust measures of scale. J. Stat. Comput. Simul. 15, 285 (1982)

    Article  Google Scholar 

  13. F.E. Grubbs, Procedures for detecting outlying observations in samples. Technometrics 11, 1 (1969)

    Article  Google Scholar 

  14. W.J. Dixon, Ratios involving extreme values. Ann. Math. Stat. 22, 68 (1951)

    Article  Google Scholar 

  15. W.J. Dixon, Analysis of extreme values. Ann. Math. Stat. 21, 488 (1950)

    Article  Google Scholar 

  16. R.J. Beckman, R.D. Cook, Outlier..........s. Technometrics 25, 119 (1983)

    MathSciNet  MATH  Google Scholar 

  17. R.A. Maronna, R.D. Martin, V.J. Yohai, Robust Statistics. Theory and Methods (Wiley, Chichester, 2006)

    MATH  Google Scholar 

  18. M.R. Spiegel, Schaum’s Outline of Theory and Problems of Probability and Statistics (McGraw-Hill, New York, 1975)

    Google Scholar 

  19. S. Brandt, Data Analysis, 3rd edn. (Springer, New York, 1999)

    Book  MATH  Google Scholar 

  20. H.B. Mann, A. Wald, On the choice of the number of class intervals in the application of the chi square test. Ann. Math. Stat. 13, 306 (1942)

    Article  MathSciNet  MATH  Google Scholar 

  21. W.C.M. Kallenberg, J. Oosterhoff, B.F. Schriever, The number of classes in chi-squared goodness-of-fit tests. J. Am. Stat. Assoc. 80, 959 (1985), and references therein

    Article  MathSciNet  MATH  Google Scholar 

  22. W.C. Kallenberg, On moderate and large deviations in multinomial distributions. Ann. Stat. 13, 1554 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  23. M.A. Stephens, Use of the Kolmogorov–Smirnov, Cramer–Von Mises and related statistics without extensive tables. J. R. Stat. Soc. B 32, 115 (1970)

    MATH  Google Scholar 

  24. A.F. Nikiforov, S.K. Suslov, V.B. Uvarov, Classical Orthogonal Polynomials of a Discrete Variable. Springer Series in Computational Physics (Springer, Berlin, 1991)

    Book  MATH  Google Scholar 

  25. W.H. Press, B.P. Flannery, S.A. Teukolsky, W.T. Vetterling, Numerical Recipes: The Art of Scientific Computing, 3rd edn. (Cambridge University Press, Cambridge, 2007). See also the equivalent handbooks in Fortran, Pascal and C, as well as http://www.nr.com

    MATH  Google Scholar 

  26. C.A. Cantrell, Technical note: Review of methods for linear least-squares fitting of data and application to atmospheric chemistry problems. Atmos. Chem. Phys. 8, 5477 (2008)

    Article  ADS  Google Scholar 

  27. D. York et al., Unified equations for the slope, intercept, and standard errors of the best straight line. Am. J. Phys. 72, 367 (2004)

    Article  ADS  Google Scholar 

  28. K. Nakamura et al. (Particle Data Group), Review of particle physics. J. Phys. G 37, 075021 (2010). See Sect. 5 of the Introduction

    Article  ADS  Google Scholar 

  29. M.C. Ortiz, L.A. Sarabia, A. Herrero, Robust regression techniques. A useful alternative for the detection of outlier data in chemical analysis. Talanta 70, 499 (2006)

    Article  Google Scholar 

  30. J. Ferré, Regression diagnostics, in Comprehensive Chemometrics: Chemical and Biochemical Data Analysis, Vol. 3, ed. by S.D. Brown, R. Tauler, B. Walczak (2009), p. 33

    Google Scholar 

  31. P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection (Wiley, Hoboken, 2003)

    Google Scholar 

  32. I. Barrodale, F.D.K. Roberts, An improved algorithm for discrete l 1 linear approximation. SIAM J. Numer. Anal. 10, 839 (1973)

    Article  MathSciNet  ADS  MATH  Google Scholar 

  33. S. Portnoy, R. Koenker, The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators. Stat. Sci. 12, 279 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  34. P.J. Rousseeuw, Least median of squares regression. J. Am. Stat. Assoc. 79, 871 (1984)

    Article  MathSciNet  MATH  Google Scholar 

  35. T. Bernholt, Computing the least median of squares estimator in time \(\mathcal{O}(n^{d})\), in Lecture Notes in Computer Science, vol. 3480, ed. by O. Gervasi et al. (Springer, Berlin, 2005), p. 697

    Google Scholar 

  36. A. Stromberg, Computing the exact least median of squares estimate and stability diagnostics in multiple linear regression. SIAM J. Sci. Comput. 14, 1289 (1993)

    Article  MATH  Google Scholar 

  37. B.W. Rust, Fitting nature’s basic functions, part I: polynomials and linear least squares. Comput. Sci. Eng. Sep/Oct, 84 (2001)

    Google Scholar 

  38. B.W. Rust, Fitting nature’s basic functions, part II: estimating uncertainties and testing hypotheses, Comput. Sci. Nov/Dec, 60 (2001)

    Article  Google Scholar 

  39. B.W. Rust, Fitting nature’s basic functions, part III: exponentials, sinusoids, and nonlinear least squares, Comput. Sci. Jul/Aug, 72 (2002)

    Article  Google Scholar 

  40. B.W. Rust, Fitting nature’s basic functions, part IV: the variable projection algorithm, Comput. Sci. Mar/Apr, 74 (2003)

    Article  Google Scholar 

  41. A.J. Izenman, Modern Multivariate Statistical Techniques (Springer, Berlin, 2008)

    MATH  Google Scholar 

  42. H. Swierenga, A.P. de Weijer, R.J. van Wijk, L.M.C. Buydens, Strategy for constructing robust multivariate calibration models. Chemom. Intell. Lab. Syst. 49, 1 (1999)

    Article  Google Scholar 

  43. I.T. Jolliffe, Principal Component Analysis, 2nd edn. (Springer, Berlin, 2002)

    MATH  Google Scholar 

  44. S. Roweis, Z. Ghahramani, A unifying review of linear Gaussian models. Neural Comput. 11, 305 (1999)

    Article  Google Scholar 

  45. A. Azzalini, A.W. Bowman, A look at some data on the Old Faithful geyser. J. R. Stat. Soc. C 39, 357 (1990)

    MATH  Google Scholar 

  46. A.K. Jain, M.N. Murty, Data clustering: a review. ACM Comput. Surv. 31, 264 (1999)

    Article  Google Scholar 

  47. W. Härdle, L. Simar, Applied Multivariate Statistical Analysis (Springer, Berlin, 2007)

    MATH  Google Scholar 

  48. R. Xu, D.C. Wunsch II, Clustering (Wiley, Hoboken, 2009)

    Google Scholar 

  49. G. Gan, C. Ma, J. Wu, Data Clustering. Theory, Algorithms, and Applications (Philadelphia, SIAM, 2007)

    Book  MATH  Google Scholar 

  50. J. Kogan, Introduction to Clustering Large and High-Dimensional Data (Cambridge University Press, Cambridge, 2007)

    MATH  Google Scholar 

  51. J. Valente de Oliveira, W. Pedrycz (eds.), Advances in Fuzzy Clustering and Its Applications (Wiley, Chichester, 2007)

    Google Scholar 

  52. The R Project for Statistical Computing. http://www.r-project.org/. Attention: the R reference manual has approximately 3000 pages!

  53. J. Maindonald, J. Braun, Data Analysis and Graphics Using R, 2nd edn. (Cambridge University Press, Cambridge, 2006). A good introductory text for R, which is an open-source alternative to the S/S+ systems (“R is to S what Octave is to Matlab”)

    Book  Google Scholar 

  54. U. von Luxburg, A tutorial on spectral clustering. Technical Report No. Tr-149, Max-Planck-Institut für biologische Kybernetik, 2006

    Google Scholar 

  55. A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm. Adv. Neural Inf. Process. Syst. 14, 849 (2001). See also Ref. [13] in this paper

    Google Scholar 

  56. O.L. Mangasarian, W.N. Street, W.H. Wolberg, Breast cancer diagnosis and prognosis via linear programming. Oper. Res. 43, 570 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  57. C. Wolf et al., A catalogue of the Chandra deep field south with multi-colour classification and photometric redshifts from COMBO-17. Astron. Astrophys. 421, 913 (2004)

    Article  ADS  Google Scholar 

  58. C. Wolf et al., Calibration update of the COMBO-17 CDFS catalogue. Astron. Astrophys. 492, 933 (2008)

    Article  ADS  Google Scholar 

  59. http://www.mpia.de/COMBO/combo_CDFSpublic.html. The data can be found at http://astrostatistics.psu.edu/datasets/COMBO17.html

  60. R.A. Reyment, K.G. Jöreskog, L.F. Marcus, Applied Factor Analysis in the Natural Sciences (Cambridge University Press, Cambridge, 1993)

    Book  Google Scholar 

  61. G. Pison, P.J. Rousseeuw, P. Filzmoser, C. Croux, Robust factor analysis. J. Multivar. Anal. 84, 145 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  62. P. Filzmoser, K. Hron, C. Reimann, R. Garrett, Robust factor analysis for compositional data. Comput. Geosci. 35, 1854 (2009)

    Article  ADS  Google Scholar 

  63. C. Reimann, P. Filzmoser, R.G. Garrett, Factor analysis applied to regional geochemical data: problems and possibilities. Appl. Geochem. 17, 185 (2002)

    Article  Google Scholar 

  64. http://lib.stat.cmu.edu/datasets/bodyfat, where all data is collected and the corresponding original literature is cited

  65. http://astro.temple.edu/~alan/MMST/datasets.html

  66. http://www.ntwrks.com/~mikev/chart1.html

  67. V.G. Sigillito, S.P. Wing, L.V. Hutton, K.B. Baker, Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Tech. Dig. 10, 262 (1989). The corresponding data file can be found at http://archive.ics.uci.edu/ml/datasets.html

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Širca, S., Horvat, M. (2012). Statistical Analysis and Modeling of Data. In: Computational Methods for Physicists. Graduate Texts in Physics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32478-9_5

Download citation

Publish with us

Policies and ethics