Probability, Statistics, and Related Methods

  • Boris L. MilmanEmail author


The probability/statistical methods used for identification purposes are briefly considered. The basic statement is that many phenomena and procedures included in qualitative analysis are of a probabilistic nature. The probability of yes/no responses in target detection is described by binomial distribution. Values of quantities required for identification, such as retention times in chromatography, wavelengths and frequencies in optical spectroscopy, masses in mass spectrometry, intensities (heights, areas) of any analytical signals, are considered as normally distributed (including t-distributed) ones over probabilities. Parameters of the distributions are used in calculations incorporated into procedures of detection and identification. Multivariate statistics connected with chemometrics is essential for classification/authentication of samples, i.e., qualitative analysis II. Bayesian statistics takes into account a prior probability that an analyte is present in a sample.

In the second part of this chapter, operations of setting up, testing, and screening of hypotheses as the core processes of qualitative analysis, are considered. The simplest are hypotheses for a detection operation, e.g., ‘\( {H_0} \): an analyte is absent in the sample’. In identification, analogous hypotheses: ‘\( {H_0} \): the analyte is compound A’, and ‘\( {\overline H_0} \): the analyte is not compound A’ are set up and tested. Identification hypotheses are transformed into experimental and statistical ones to be accepted or rejected on the basis of corresponding criteria, both range/tolerance and statistical criteria. False acceptance or rejection of hypotheses leads to false positive/negative results of identification or detection, the probability of which can be estimated.


Null Hypothesis Analytical Signal Prior Probability Statistical Hypothesis Identification Result 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Lloyd E (1984) Handbook of applicable mathematics, vol 6, Statistics. Wiley, ChichesterGoogle Scholar
  2. 2.
    Meier PC, Zund RE (1993) Statistical methods in analytical chemistry. Wiley, New YorkGoogle Scholar
  3. 3.
    Sharaf MA, Illman DL, Kowalski BR (1986) Chemometrics. Wiley, New YorkGoogle Scholar
  4. 4.
    Massart DL, Vandeginste BGM, Deming SN, Michotte Y, Kaufman L (1988) Chemometrics: a textbook. Elsevier, AmsterdamGoogle Scholar
  5. 5.
    Varmuza K, Filzmoser P (2009) Introduction to multivariate statistical analysis in chemometrics. CRC Press, Boca Raton, FLCrossRefGoogle Scholar
  6. 6.
    Thompson SK (1992) Sampling. Wiley, New YorkGoogle Scholar
  7. 7.
    Milman BL, Konopelko LA (2000) Identification of chemical substances by testing and screening of hypotheses. I. General. Fresenius J Anal Chem 367:621–628CrossRefGoogle Scholar
  8. 8.
    Jurado JM, Alcázar A, Pablos F, Martín MJ, González AG (2005) Classification of aniseed drinks by means of cluster, linear discriminant analysis and soft independent modelling of class analogy based on their Zn, B, Fe, Mg, Ca, Na and Si content. Talanta 66:1350–1354CrossRefGoogle Scholar
  9. 9.
    Pillati M, Viroli C (2010) Gene selection in classification problems using independent factor analysis. Accessed 1 May 2010
  10. 10.
    Goux WJ (1989) NMR pattern recognition of peracetylated mono- and oligosaccharide structures. Classification of residues using principal-component analysis, K-nearest neighbor analysis, and SIMCA class modeling. J Magn Reson 85:457–469Google Scholar
  11. 11.
    Aruga R, Mirti P, Casoli A, Palla G (1999) Classification of ancient proteinaceous painting media by the joint use of pattern recognition and factor analysis on GC/MS data. Fresenius J Anal Chem 365:559–566CrossRefGoogle Scholar
  12. 12.
    Hristozov D, Da Costa FB, Gasteiger J (2007) Sesquiterpene lactones-based classification of the family Asteraceae using neural networks and k-nearest neighbors. J Chem Inf Model 47:9–19CrossRefGoogle Scholar
  13. 13.
    Elomaa M, Lochmüller CH, Kudrjashova M, Kaljurand M (2000) Classification of polymeric materials by evolving factor analysis and principal component analysis of thermochromatographic data. Thermochimica Acta 362:137–144CrossRefGoogle Scholar
  14. 14.
    Anderson KA, Magnuson BA, Tschirgi ML, Smith B (1999) Determining the geographic origin of potatoes with trace metal analysis using statistical and neural network classifiers. J Agric Food Chem 47:1568–1575CrossRefGoogle Scholar
  15. 15.
    Pell M, Ljunggren H (1996) Composition of the bacterial population in sand-filter columns receiving artificial wastewater, evaluated by soft independent modelling of class analogy (SIMCA). Water Res 30:2479–2487CrossRefGoogle Scholar
  16. 16.
    Walczak B, Morin-Allory L, Lafosse M, Dreux M, Chrétien JR (1987) Factor analysis and experiment design in high-performance liquid chromatography. VII. Classification of 23 reversed-phase high-performance liquid chromatographic packings and identification of factors governing selectivity. J Chromatogr A 395:183–202CrossRefGoogle Scholar
  17. 17.
    Zeng Y, Hopke PK (1990) Methodological study applying three-mode factor analysis to three-way chemical data sets. Chemometrics Intell Lab Syst 7:237–250CrossRefGoogle Scholar
  18. 18.
    Harwood VJ, Whitlock J, Withington V (2000) Classification of antibiotic resistance patterns of indicator bacteria by discriminant analysis: use in predicting the source of fecal contamination in subtropical waters. Appl Environ Microbiol 66:3698–3704CrossRefGoogle Scholar
  19. 19.
    Serrano S, Villarejo M, Espejo R, Jodral M (2004) Chemical and physical parameters of Andalusian honey: classification of Citrus and Eucalyptus honeys by discriminant analysis. Food Chem 87:619–625CrossRefGoogle Scholar
  20. 20.
    Moret I, Di Leo F, Giromini V, Scarponi G (1994) Multiple discriminant analysis in the analytical differentiation of Venetian white wines. 4. Application to several vintage years and comparison with the k nearest-neighbor classification. J Agric Food Chem 32:329–333CrossRefGoogle Scholar
  21. 21.
    Ankerst M, Kastenmüller G, Kriegel HP, Seidl T (1999) Nearest neighbor classification in 3D protein databases. ISMB-99 Proceedings. Accessed 2 May 2010
  22. 22.
    Wiberg K, Hagman A, Burén P, Jacobsson SP (2001) Determination of the content and identity of lidocaine solutions with UV-visible spectroscopy and multivariate calibration. Analyst 126:1142–1148CrossRefGoogle Scholar
  23. 23.
    Vohradský J (1997) Adaptive classification of two-dimensional gel electrophoretic spot patterns by neural networks and cluster analysis. Electrophoresis 18:2749–2754CrossRefGoogle Scholar
  24. 24.
    McNeil VH, Cox ME, Preda M (2005) Assessment of chemical water types and their spatial variation using multi-stage cluster analysis, Queensland, Australia. J Hydrol 310:181–200CrossRefGoogle Scholar
  25. 25.
    Chun J, Atalan E, Ward AC, Goodfellow M (1993) Artificial neural network analysis of pyrolysis mass spectrometric data in the identification of Streptomyces strains. FEMS Microbiol Lett 107:321–326CrossRefGoogle Scholar
  26. 26.
    Song XH, Hopke PK (1999) Classification of single particles analyzed by ATOFMS using an artificial neural network, ART-2A. Anal Chem 71:860–865CrossRefGoogle Scholar
  27. 27.
    Sivia DS (2001) Data analysis: a Bayesian tutorial. Oxford University Press, ClarendonGoogle Scholar
  28. 28.
    Spiehler VR, O’Donnell CM, Gokhale DV (1988) Confirmation and certainty in toxicology screening. Clin Chem 34:1535–1539Google Scholar
  29. 29.
    Ellison SLR, Gregory S, Hardcastle WA (1998) Quantifying uncertainty in qualitative analysis. Analyst 123:1155–1161CrossRefGoogle Scholar
  30. 30.
    Milman BL, Konopelko LA (2000) Identification of chemical substances by testing and screening of hypotheses I. General. Fresenius J Anal Chem 367:621–628CrossRefGoogle Scholar
  31. 31.
    Milman BL (2005) Identification of chemical compounds. Trends Anal Chem 24:493–508CrossRefGoogle Scholar
  32. 32.
    Emerenciano VDP, Ferreira MJP, Branco MD, Dubois JE (1998) The application of Bayes’ theorem in natural products as a guide for skeleton identification. Chemometrics Intell Lab Syst 40:83–92CrossRefGoogle Scholar
  33. 33.
    Latorre MJ, Peña R, García S, Herrero C (2000) Authentication of Galician (N.W. Spain) honeys by multivariate techniques based on metal content data. Analyst 125:307–312CrossRefGoogle Scholar
  34. 34.
    Roussel S, Bellon-Maurel V, Roger JM, Grenier P (2003) Fusion of aroma. FT-IR and UV sensor data based on the Bayesian inference. Application to the discrimination of white grape varieties. Chemometrics Intell Lab Syst 65:209–219CrossRefGoogle Scholar
  35. 35.
    Alterovitz G, Liu J, Afkhami E, Ramoni MF (2007) Bayesian methods for proteomics. Proteomics 7:2843–2855CrossRefGoogle Scholar
  36. 36.
    Toher D, Downey G, Murphy TB (2007) A comparison of model-based and regression classification techniques applied to near infrared spectroscopic data in food authentication studies. Chemometrics Intell Lab Syst 89:102–115CrossRefGoogle Scholar
  37. 37.
    Hibbert DB, Armstrong N (2009) An introduction to Bayesian methods for analyzing chemistry data. II. A review of applications of Bayesian methods in chemistry. Chemometrics Intell Lab Syst 97:211–220CrossRefGoogle Scholar
  38. 38.
    Beyermann K (1984) Organic trace analysis. Ellis Horwood, ChicesterGoogle Scholar
  39. 39.
    Currie LA (1995) Nomenclature in evaluation of analytical methods, including detection and quantification capabilities (IUPAC Recommendations 1995). Pure Appl Chem 67:1699–1723CrossRefGoogle Scholar
  40. 40.
    Hartstra J, Franke JP, de Zeeuw RA (2000) How to approach substance identification in qualitative bioanalysis. J Chromatogr B 739:125–137CrossRefGoogle Scholar
  41. 41.
    Eriksson J, Chait BT, Fenyö D (2000) A statistical basis for testing the significance of mass spectrometric protein identification results. Anal Chem 72:999–1005CrossRefGoogle Scholar
  42. 42.
    Neyman J (1968) Introductory course of probability theory and mathematical statistics (In Russian). Nauka, MoscowGoogle Scholar
  43. 43.
    March JG (1994) Primer on decision making: how decisions happen. Simon and Schuster, New YorkGoogle Scholar
  44. 44.
    Vershinin VI, Derendyaev BG, Lebedev KS (2002) Computer-Assisted Identification of Organic Compounds (In Russian). Akademkniga, MoscowGoogle Scholar
  45. 45.
    Elyashberg M, Blinov K, Williams A (2009) A systematic approach for the generation and verification of structural hypotheses. Magn Reson Chem 47:371–389CrossRefGoogle Scholar
  46. 46.
    Easton VJ, McColl JH Statistics Glossary. Accessed 2 May 2010.
  47. 47.
    Nesvizhskii AI, Vitek O, Aebersold R (2007) Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat Methods 4:787–797CrossRefGoogle Scholar
  48. 48.
    Milman BL, Kovrizhnych MA (2000) Identification of chemical substances by testing and screening of hypotheses. II. Determination of impurities in n-hexane and naphthalene. Fresenius J Anal Chem 367:629–634CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  1. 1.D.I. Mendeleyev Inst. for Metrology (VNIIM) and Cent. for Ecol. Saf. of Russ. Acad. of SciencesSt. PetersburgRussia

Personalised recommendations