Skip to main content

Conclusions

KNNimpute is a fast, robust, and accurate method of estimating missing values for microarray data. Both KNNimpute and SVDimpute methods far surpass the currently accepted solutions (filling missing values with zeros or row average) by taking advantage of the structure of microarray data to estimate missing expression values.

We recommend KNNimpute over SVDimpute method for several reasons. First, the KNNimpute method is more robust than SVD to the type of data for which estimation is performed, performing better on non-time series or noisy data. Second, while both KNN and SVD methods are robust to increasing the fraction of missing data, KNN-based imputation shows less deterioration in performance with increasing percent of missing entries. And third, KNNimpute is less sensitive to the exact parameters used (number of nearest neighbors), whereas the SVD-based method shows sharp deterioration in performance when a non-optimal fraction of missing values is used. In addition, KNNimpute has the advantage of providing accurate estimation for missing values in genes that belong to small tight expression clusters. Such genes may not be similar to any of the eigengenes used for regression in SVDimpute, and their missing values could thus be estimated poorly by SVD-based estimation.

KNNimpute is a robust and sensitive approach to estimating missing data for gene expression microarrays. However, scientists should exercise caution when drawing critical biological conclusions from partially imputed data. The goal of this and other estimation methods is to provide an accurate way of estimating missing data points in order to minimize the bias introduced in the performance of microarray analysis methods. Estimated data should be flagged where possible, and their impact on the discovery of biological results should be assessed in order to avoid drawing unwarranted conclusions.

Parts of the work presented in this chapter were originally published in Bioinformatics (Troyanskaya et al., 2001).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Alizadeh A.A., Eisen M.B., Davis R.E., Ma C., Lossos I.S., Rosenwald A., Boldrick J.C., Sabet H., Tran T., Yu X., Powell J.I., Yang L., Marti G.E., Moore T., Hudson J., Jr., Lu L. Lewis D.B., Tibshirani R., Sherlock G., Chan W.C., Greiner T.C., Weisenburger D.D., Armitage J.O., Warnke R., Staudt L.M. et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403: 503–11.

    PubMed  CAS  Google Scholar 

  • Alter O., Brown P.O., Botstein D. (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97: 10101–6.

    Article  PubMed  CAS  Google Scholar 

  • Anderson T.W. (1984). An introduction to multivariate statistical analysis. Wiley, New York.

    Google Scholar 

  • Bar-Joseph Z., Gerber G., Gifford D.K., Jaakkola T.S., Simon I. (2002). A new approach to analyzing gene expression time series data. Proceedings of the Sixth Annual International Conference on Computational Biology (RECOMB), Washingon DC, USA, ACM Press.

    Google Scholar 

  • Brown M.P., Grundy W.N., Lin D., Cristianini N., Sugnet C.W., Furey T.S., Ares M., Jr., Haussler D. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 97: 262–7.

    PubMed  CAS  Google Scholar 

  • Butte A.J., Ye J. et al. (2001). “Determining Significant Fold Differences in Gene Expression Analysis.” Pacific Symposium on Biocomputing 6: 6–17.

    Google Scholar 

  • Chu S., DeRisi J., Eisen M., Mulholland J., Botstein D., Brown P.O., Herskowitz I. (1998) The transcriptional program of sporulation in budding yeast. Science 282: 699–705.

    Article  PubMed  CAS  Google Scholar 

  • DeRisi J.L., Iyer V.R., Brown P.O. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278: 680–6.

    Article  PubMed  CAS  Google Scholar 

  • Eisen M.B., Spellman P.T., Brown P.O., Botstein D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad of Sci USA 95: 14863–8.

    CAS  Google Scholar 

  • Gasch, A. P., P. T. Spellman, C. M. Kao, O. Carmel-Harel, M. B. Eisen, G. Storz, D. Botstein, and P. O. Brown. 2000. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell, in press.

    Google Scholar 

  • Golub G.H., Van Loan C.F. (1996). Matrix computations. Johns Hopkins University Press, Baltimore.

    Google Scholar 

  • Golub T.R., Slonim D.K., Tamayo P., Huard C., Gaasenbeek M., Mesirov J.P., Coller H., Loh M.L., Downing J.R., Caligiuri M.A., Bloomfield C.D., Lander E.S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531–7.

    Article  PubMed  CAS  Google Scholar 

  • Hastie T., Tibshirani R., Eisen M., Alizadeh A., Levy R., Staudt L., Chan W.C., Botstein D., Brown P. (2000). “Gene shaving” as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol 1: research0003.1-research0003.21.

    Google Scholar 

  • Heyer L.J., Kruglyak S., Yooseph S. (1999). Exploring expression data: identification and analysis of coexpressed genes. Genome Res 9: 1106–15.

    Article  PubMed  CAS  Google Scholar 

  • Little R.J.A., Rubin D.B. (1987). Statistical analysis with missing data. Wiley, New York.

    Google Scholar 

  • Loh W., Vanichsetakul N. (1988). Tree-Structured Classification via generalized discriminant analysis. Journal of the American Statistical Association 83: 715–725.

    Google Scholar 

  • Perou C.M., Sorlie T., Eisen M.B., van de Rijn M., Jeffrey S.S, Rees C.A., Pollack J.R., Ross D.T., Johnsen H., Akslen L.A., Fluge O., Pergamenschikov A., Williams C., Zhu S.X., Lonning P.E., Borresen-Dale A.L., Brown P.O., Botstein D. (2000). Molecular portraits of human breast tumours. Nature 406: 747–52.

    Article  PubMed  CAS  Google Scholar 

  • Raychaudhuri S., Stuart J.M., Altman R.B. (2000). Principal components analysis to summarize microarray experiments: application to sporulation time series, Pacific Symposium on Biocomputing: 455–66.

    Google Scholar 

  • Spellman P.T., Sherlock G., Zhang M.Q., Iyer V.R., Anders K., Eisen M.B., Brown P.O., Botstein D., Futcher B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9: 3273–97.

    PubMed  CAS  Google Scholar 

  • Tamayo P., Slonim D., Mesirov J., Zhu Q., Kitareewan S., Dmitrovsky E., Lander E.S., Golub T.R. (1999). Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96: 2907–12.

    Article  PubMed  CAS  Google Scholar 

  • Troyanskaya O., Cantor M., Sherlock G., Brown P., Hastie T., Tibshirani R., Botstein D., Altman R.B. (2001), Missing Value Estimation methods for DNA microarrays. Bioinformatics 17(6): 520–5.

    Article  PubMed  CAS  Google Scholar 

  • Wilkinson G.N. (1958). Estimation of missing values for the analysis of incomplete data. Biometrics 14: 257–286.

    Google Scholar 

  • Yates Y, (1933). The analysis of replicated experiments when the field results are incomplete. Emp. J. Exp. Agric. 1: 129–142.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Kluwer Academic Publishers

About this chapter

Cite this chapter

Troyanskaya, O.G., Botstein, D., Altman, R.B. (2003). Missing Value Estimation. In: Berrar, D.P., Dubitzky, W., Granzow, M. (eds) A Practical Approach to Microarray Data Analysis. Springer, Boston, MA. https://doi.org/10.1007/0-306-47815-3_3

Download citation

  • DOI: https://doi.org/10.1007/0-306-47815-3_3

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4020-7260-4

  • Online ISBN: 978-0-306-47815-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics