Classification of Acute Leukemia Based on DNA Microarray Gene Expressions Using Partial Least Squares

  • Danh V. Nguyen
  • David M. Rocke


Analysis of microarray data, when presented with raw gene expression intensity data, often take two main steps when analyzing the data. First pre-process the data by rescaling and standardizing so that overall intensities for each array are equivalent. Second, apply statistical methodologies to answer scientific questions of interest. In this paper, for the data pre-processing step, we introduce a thresholding algorithm for rescaling each array. Step 2 involves statistical classification and dimension reduction methodologies. For this we introduce the method of partial least squares (PLS) and apply it to the leukemia microarray data set of Golub et al. (1999). We also discuss the use of principal components analysis (PCA), quadratic discriminant analysis (QDA) and logistic discrimination (LD). Finally, we discuss other potential applications of PLS in analyzing gene expression data that address prediction of a target gene, prediction of the reaction in cell lines, assessment of patient survival, and generalisations in predicting multiple classes.

Key words

Dimension reduction Logistic Discrimination Prediction Quadratic discriminant analysis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Alon et al. (1999), “Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays,” Proceedings of the National Academy of Sciences, 96, 6745–6750.CrossRefGoogle Scholar
  2. Alizadeh et al. (2000), “Distinct Types of Diffuse Large B—Cell Lymphoma Identified by Gene Expression Profiling,” Nature, 403, 503–511.PubMedCrossRefGoogle Scholar
  3. Bittner  et al. (2000), “Molecular Classification of Cutaneous Malignant Melanoma by Gene Expression Profiling,” Nature, 406, 536–540.PubMedCrossRefGoogle Scholar
  4. de Jong, S. (1993), “SIMPLS: An Alternative Approach to Partial Least Squares Regression,” Chemometrics and Intelligent Laboratory Systems, 18, 251–263.CrossRefGoogle Scholar
  5. Dudoit, S., Fridlyand, J., Speed, T.P. (2000), “Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data,” Technical Report #576, Department of Statistics, U. C. Berkeley.Google Scholar
  6. Flury, B. (1997), A First Course in Multivariate Analysis. Springer-Verlag, New York.Google Scholar
  7. Frank, I.E., and Friedman, J.H. (1993), “A Statistical View of Some Chemometric Regression Tools” (with discussion), Technometrics, 35, 109–148.CrossRefGoogle Scholar
  8. Garthwaite, P.H. (1994), “An Interpretation of Partial Least Squares,” Journal of the American Statistical Association, 89, 122–127.CrossRefGoogle Scholar
  9. Geladi, P., and Kowalski, B.R. (1986), “Partial Least Squares Regression: Tutorial,” Analytica Chimica Acta, 185, 1–17.CrossRefGoogle Scholar
  10. Golub  et al. (1999), “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, 286, 531–537.PubMedCrossRefGoogle Scholar
  11. Hand, J.D. (1981), Discrimination and Classification. John Wiley Sons, Chichester, England.Google Scholar
  12. Hand, J.D. (1997), Construction and Assessment of Classification Rules. John Wiley Sons, Chichester, England.Google Scholar
  13. Helland, I.S. (1988), “On the Structure of Partial Least Squares,” Communications in Statistics-Simulation and Computation, 17, 581–607.CrossRefGoogle Scholar
  14. Helland, S., and Almoy, T. (1994), “Comparison of Prediction Methods When Only a Few Components are Relevant,” Journal of the American Statistical Association, 89, 583–591.CrossRefGoogle Scholar
  15. Hoskuldsson, A. (1988), “PLS Regression Methods,” Journal of Chemometrics, 2, 211–228.CrossRefGoogle Scholar
  16. Johnson, R.A. and Wichern, D.W. (1992), Applied Multivariate Analysis. Prentice-Hall, New Jersey, 4th edition.Google Scholar
  17. Jolliffe, I.T. (1986), Principal Component Analysis. Springer-Verlag, New York.Google Scholar
  18. Lorber, A., Wangen, L.E., and Kowalski, B.R. (1997), “A Theoretical Foundation for the PLS Algorithm,” Journal of Chemometrics, 1, 19–31.CrossRefGoogle Scholar
  19. Mardia, K.V., Kent, J.T., and Bibby, J.M. (1979), Multivariate Analysis. Academic Press, London.Google Scholar
  20. Martens, H. and Naes, T. (1989), Multivariate Calibration, John Wiley Sons, New York.Google Scholar
  21. Massey, W.F. (1965), “Principal Components Regression in Exploratory Statistical Research,” Journal of the American Statistical Association, 60, 234–246.CrossRefGoogle Scholar
  22. Nguyen, D.V. and Rocke, D.M. (2000), “Classification in High Dimension with Application to DNA Microarray Data,” manuscript.Google Scholar
  23. Nguyen, D.V. and Rocke, D.M. (2001), “Tumor Classification by Partial Least Squares Using Microarray Gene Expression Data,” to appear in Bioinformatics.Google Scholar
  24. Nguyen, D.V. and Rocke, D.M. (2001b), “Partial Least Squares Proportional Hazard Regression for Application to DNA Microarray Data,” manuscript.Google Scholar
  25. Nguyen, D.V. and Rocke, D.M. (2001c), “Multi-Class Cancer Classification Via Partial Least Squares Using Gene Expression Profiles,” manuscript.Google Scholar
  26. Perou N et al. (2000), “Molecular Portrait of Human Breast Tumors,” Nature, 406, 747–752.PubMedCrossRefGoogle Scholar
  27. Perou N et al. (1999), “Distinctive Gene Expression Patterns in Human Mammary Epithelial Cells and Breast Cancer,” Proceedings of the National Academiy of Sciences, USA, 96, 9112–9217.Google Scholar
  28. Phatak, A., and Reilly, P.M., and Penlidis, A. (1992), “The Geometry of 2-Block Partial Least Squares,” Communications in Statistics-Theory and Methods, 21, 1517–1553.CrossRefGoogle Scholar
  29. Press, S.J. (1982), Applied Multivariate Analysis: Using Bayesian and Frequentist Methods of Inference. Robert E. Krieger Publishing Company Inc., Malabar, Florida, 2nd edition.Google Scholar
  30. Rocke, D.M. and Durbin, B. (2000), “A Model for Measurement Error for Gene Expression Arrays,” to appear in Journal of Computational Biology.Google Scholar
  31. Ross  et al. (2000), “Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines,” Nature Genetics, 24, 227–235.PubMedCrossRefGoogle Scholar
  32. Scherf  et al. (2000), “A Gene Expression Database for the Molecular Pharmacology of Cancer,” Nature Genetics, 24, 236–244.PubMedCrossRefGoogle Scholar
  33. Stone, M., and Brooks, R. J. (1990), “Continuum Regression: Cross-validated Sequentially Constructed Prediction Embracing Ordinary Least Squares, Partial Least Squares, and Principal Components Regression” (with discussion), Journal of the Royal Statistical Society, Series B, 52, 237–269.Google Scholar

Copyright information

© Springer Science+Business Media New York 2002

Authors and Affiliations

  • Danh V. Nguyen
    • 1
  • David M. Rocke
    • 1
    • 2
  1. 1.Center for Image Processing and Integrated ComputingUniversity of CaliforniaDavisUSA
  2. 2.Department of Applied ScienceUniversity of CaliforniaDavisUSA

Personalised recommendations