# Probability, Statistics, and Related Methods

• Boris L. Milman
Chapter

## Abstract

The probability/statistical methods used for identification purposes are briefly considered. The basic statement is that many phenomena and procedures included in qualitative analysis are of a probabilistic nature. The probability of yes/no responses in target detection is described by binomial distribution. Values of quantities required for identification, such as retention times in chromatography, wavelengths and frequencies in optical spectroscopy, masses in mass spectrometry, intensities (heights, areas) of any analytical signals, are considered as normally distributed (including t-distributed) ones over probabilities. Parameters of the distributions are used in calculations incorporated into procedures of detection and identification. Multivariate statistics connected with chemometrics is essential for classification/authentication of samples, i.e., qualitative analysis II. Bayesian statistics takes into account a prior probability that an analyte is present in a sample.

In the second part of this chapter, operations of setting up, testing, and screening of hypotheses as the core processes of qualitative analysis, are considered. The simplest are hypotheses for a detection operation, e.g., ‘$${H_0}$$: an analyte is absent in the sample’. In identification, analogous hypotheses: ‘$${H_0}$$: the analyte is compound A’, and ‘$${\overline H_0}$$: the analyte is not compound A’ are set up and tested. Identification hypotheses are transformed into experimental and statistical ones to be accepted or rejected on the basis of corresponding criteria, both range/tolerance and statistical criteria. False acceptance or rejection of hypotheses leads to false positive/negative results of identification or detection, the probability of which can be estimated.

## Keywords

Null Hypothesis Analytical Signal Prior Probability Statistical Hypothesis Identification Result
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

## References

1. 1.
Lloyd E (1984) Handbook of applicable mathematics, vol 6, Statistics. Wiley, ChichesterGoogle Scholar
2. 2.
Meier PC, Zund RE (1993) Statistical methods in analytical chemistry. Wiley, New YorkGoogle Scholar
3. 3.
Sharaf MA, Illman DL, Kowalski BR (1986) Chemometrics. Wiley, New YorkGoogle Scholar
4. 4.
Massart DL, Vandeginste BGM, Deming SN, Michotte Y, Kaufman L (1988) Chemometrics: a textbook. Elsevier, AmsterdamGoogle Scholar
5. 5.
Varmuza K, Filzmoser P (2009) Introduction to multivariate statistical analysis in chemometrics. CRC Press, Boca Raton, FL
6. 6.
Thompson SK (1992) Sampling. Wiley, New YorkGoogle Scholar
7. 7.
Milman BL, Konopelko LA (2000) Identification of chemical substances by testing and screening of hypotheses. I. General. Fresenius J Anal Chem 367:621–628
8. 8.
Jurado JM, Alcázar A, Pablos F, Martín MJ, González AG (2005) Classification of aniseed drinks by means of cluster, linear discriminant analysis and soft independent modelling of class analogy based on their Zn, B, Fe, Mg, Ca, Na and Si content. Talanta 66:1350–1354
9. 9.
Pillati M, Viroli C (2010) Gene selection in classification problems using independent factor analysis. http://www2.stat.unibo.it/viroli/publications/articleIFa.pdf. Accessed 1 May 2010
10. 10.
Goux WJ (1989) NMR pattern recognition of peracetylated mono- and oligosaccharide structures. Classification of residues using principal-component analysis, K-nearest neighbor analysis, and SIMCA class modeling. J Magn Reson 85:457–469Google Scholar
11. 11.
Aruga R, Mirti P, Casoli A, Palla G (1999) Classification of ancient proteinaceous painting media by the joint use of pattern recognition and factor analysis on GC/MS data. Fresenius J Anal Chem 365:559–566
12. 12.
Hristozov D, Da Costa FB, Gasteiger J (2007) Sesquiterpene lactones-based classification of the family Asteraceae using neural networks and k-nearest neighbors. J Chem Inf Model 47:9–19
13. 13.
Elomaa M, Lochmüller CH, Kudrjashova M, Kaljurand M (2000) Classification of polymeric materials by evolving factor analysis and principal component analysis of thermochromatographic data. Thermochimica Acta 362:137–144
14. 14.
Anderson KA, Magnuson BA, Tschirgi ML, Smith B (1999) Determining the geographic origin of potatoes with trace metal analysis using statistical and neural network classifiers. J Agric Food Chem 47:1568–1575
15. 15.
Pell M, Ljunggren H (1996) Composition of the bacterial population in sand-filter columns receiving artificial wastewater, evaluated by soft independent modelling of class analogy (SIMCA). Water Res 30:2479–2487
16. 16.
Walczak B, Morin-Allory L, Lafosse M, Dreux M, Chrétien JR (1987) Factor analysis and experiment design in high-performance liquid chromatography. VII. Classification of 23 reversed-phase high-performance liquid chromatographic packings and identification of factors governing selectivity. J Chromatogr A 395:183–202
17. 17.
Zeng Y, Hopke PK (1990) Methodological study applying three-mode factor analysis to three-way chemical data sets. Chemometrics Intell Lab Syst 7:237–250
18. 18.
Harwood VJ, Whitlock J, Withington V (2000) Classification of antibiotic resistance patterns of indicator bacteria by discriminant analysis: use in predicting the source of fecal contamination in subtropical waters. Appl Environ Microbiol 66:3698–3704
19. 19.
Serrano S, Villarejo M, Espejo R, Jodral M (2004) Chemical and physical parameters of Andalusian honey: classification of Citrus and Eucalyptus honeys by discriminant analysis. Food Chem 87:619–625
20. 20.
Moret I, Di Leo F, Giromini V, Scarponi G (1994) Multiple discriminant analysis in the analytical differentiation of Venetian white wines. 4. Application to several vintage years and comparison with the k nearest-neighbor classification. J Agric Food Chem 32:329–333
21. 21.
Ankerst M, Kastenmüller G, Kriegel HP, Seidl T (1999) Nearest neighbor classification in 3D protein databases. ISMB-99 Proceedings. http://www.aaai.org/Papers/ISMB/1999/ISMB99-005.pdf. Accessed 2 May 2010
22. 22.
Wiberg K, Hagman A, Burén P, Jacobsson SP (2001) Determination of the content and identity of lidocaine solutions with UV-visible spectroscopy and multivariate calibration. Analyst 126:1142–1148
23. 23.
Vohradský J (1997) Adaptive classification of two-dimensional gel electrophoretic spot patterns by neural networks and cluster analysis. Electrophoresis 18:2749–2754
24. 24.
McNeil VH, Cox ME, Preda M (2005) Assessment of chemical water types and their spatial variation using multi-stage cluster analysis, Queensland, Australia. J Hydrol 310:181–200
25. 25.
Chun J, Atalan E, Ward AC, Goodfellow M (1993) Artificial neural network analysis of pyrolysis mass spectrometric data in the identification of Streptomyces strains. FEMS Microbiol Lett 107:321–326
26. 26.
Song XH, Hopke PK (1999) Classification of single particles analyzed by ATOFMS using an artificial neural network, ART-2A. Anal Chem 71:860–865
27. 27.
Sivia DS (2001) Data analysis: a Bayesian tutorial. Oxford University Press, ClarendonGoogle Scholar
28. 28.
Spiehler VR, O’Donnell CM, Gokhale DV (1988) Confirmation and certainty in toxicology screening. Clin Chem 34:1535–1539Google Scholar
29. 29.
Ellison SLR, Gregory S, Hardcastle WA (1998) Quantifying uncertainty in qualitative analysis. Analyst 123:1155–1161
30. 30.
Milman BL, Konopelko LA (2000) Identification of chemical substances by testing and screening of hypotheses I. General. Fresenius J Anal Chem 367:621–628
31. 31.
Milman BL (2005) Identification of chemical compounds. Trends Anal Chem 24:493–508
32. 32.
Emerenciano VDP, Ferreira MJP, Branco MD, Dubois JE (1998) The application of Bayes’ theorem in natural products as a guide for skeleton identification. Chemometrics Intell Lab Syst 40:83–92
33. 33.
Latorre MJ, Peña R, García S, Herrero C (2000) Authentication of Galician (N.W. Spain) honeys by multivariate techniques based on metal content data. Analyst 125:307–312
34. 34.
Roussel S, Bellon-Maurel V, Roger JM, Grenier P (2003) Fusion of aroma. FT-IR and UV sensor data based on the Bayesian inference. Application to the discrimination of white grape varieties. Chemometrics Intell Lab Syst 65:209–219
35. 35.
Alterovitz G, Liu J, Afkhami E, Ramoni MF (2007) Bayesian methods for proteomics. Proteomics 7:2843–2855
36. 36.
Toher D, Downey G, Murphy TB (2007) A comparison of model-based and regression classification techniques applied to near infrared spectroscopic data in food authentication studies. Chemometrics Intell Lab Syst 89:102–115
37. 37.
Hibbert DB, Armstrong N (2009) An introduction to Bayesian methods for analyzing chemistry data. II. A review of applications of Bayesian methods in chemistry. Chemometrics Intell Lab Syst 97:211–220
38. 38.
Beyermann K (1984) Organic trace analysis. Ellis Horwood, ChicesterGoogle Scholar
39. 39.
Currie LA (1995) Nomenclature in evaluation of analytical methods, including detection and quantification capabilities (IUPAC Recommendations 1995). Pure Appl Chem 67:1699–1723
40. 40.
Hartstra J, Franke JP, de Zeeuw RA (2000) How to approach substance identification in qualitative bioanalysis. J Chromatogr B 739:125–137
41. 41.
Eriksson J, Chait BT, Fenyö D (2000) A statistical basis for testing the significance of mass spectrometric protein identification results. Anal Chem 72:999–1005
42. 42.
Neyman J (1968) Introductory course of probability theory and mathematical statistics (In Russian). Nauka, MoscowGoogle Scholar
43. 43.
March JG (1994) Primer on decision making: how decisions happen. Simon and Schuster, New YorkGoogle Scholar
44. 44.
Vershinin VI, Derendyaev BG, Lebedev KS (2002) Computer-Assisted Identification of Organic Compounds (In Russian). Akademkniga, MoscowGoogle Scholar
45. 45.
Elyashberg M, Blinov K, Williams A (2009) A systematic approach for the generation and verification of structural hypotheses. Magn Reson Chem 47:371–389
46. 46.
Easton VJ, McColl JH Statistics Glossary. http://www.stats.gla.ac.uk/steps/glossary/hypothesis_testing.html#h0. Accessed 2 May 2010.
47. 47.
Nesvizhskii AI, Vitek O, Aebersold R (2007) Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat Methods 4:787–797
48. 48.
Milman BL, Kovrizhnych MA (2000) Identification of chemical substances by testing and screening of hypotheses. II. Determination of impurities in n-hexane and naphthalene. Fresenius J Anal Chem 367:629–634