Missing Value Estimation

Troyanskaya, Olga G.; Botstein, David; Altman, Russ B.

doi:10.1007/0-306-47815-3_3

Olga G. Troyanskaya⁴,
David Botstein⁴ &
Russ B. Altman⁴

706 Accesses
3 Citations

Conclusions

KNNimpute is a fast, robust, and accurate method of estimating missing values for microarray data. Both KNNimpute and SVDimpute methods far surpass the currently accepted solutions (filling missing values with zeros or row average) by taking advantage of the structure of microarray data to estimate missing expression values.

We recommend KNNimpute over SVDimpute method for several reasons. First, the KNNimpute method is more robust than SVD to the type of data for which estimation is performed, performing better on non-time series or noisy data. Second, while both KNN and SVD methods are robust to increasing the fraction of missing data, KNN-based imputation shows less deterioration in performance with increasing percent of missing entries. And third, KNNimpute is less sensitive to the exact parameters used (number of nearest neighbors), whereas the SVD-based method shows sharp deterioration in performance when a non-optimal fraction of missing values is used. In addition, KNNimpute has the advantage of providing accurate estimation for missing values in genes that belong to small tight expression clusters. Such genes may not be similar to any of the eigengenes used for regression in SVDimpute, and their missing values could thus be estimated poorly by SVD-based estimation.

KNNimpute is a robust and sensitive approach to estimating missing data for gene expression microarrays. However, scientists should exercise caution when drawing critical biological conclusions from partially imputed data. The goal of this and other estimation methods is to provide an accurate way of estimating missing data points in order to minimize the bias introduced in the performance of microarray analysis methods. Estimated data should be flagged where possible, and their impact on the discovery of biological results should be assessed in order to avoid drawing unwarranted conclusions.

Parts of the work presented in this chapter were originally published in Bioinformatics (Troyanskaya et al., 2001).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alizadeh A.A., Eisen M.B., Davis R.E., Ma C., Lossos I.S., Rosenwald A., Boldrick J.C., Sabet H., Tran T., Yu X., Powell J.I., Yang L., Marti G.E., Moore T., Hudson J., Jr., Lu L. Lewis D.B., Tibshirani R., Sherlock G., Chan W.C., Greiner T.C., Weisenburger D.D., Armitage J.O., Warnke R., Staudt L.M. et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403: 503–11.
PubMed CAS Google Scholar
Alter O., Brown P.O., Botstein D. (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97: 10101–6.
Article PubMed CAS Google Scholar
Anderson T.W. (1984). An introduction to multivariate statistical analysis. Wiley, New York.
Google Scholar
Bar-Joseph Z., Gerber G., Gifford D.K., Jaakkola T.S., Simon I. (2002). A new approach to analyzing gene expression time series data. Proceedings of the Sixth Annual International Conference on Computational Biology (RECOMB), Washingon DC, USA, ACM Press.
Google Scholar
Brown M.P., Grundy W.N., Lin D., Cristianini N., Sugnet C.W., Furey T.S., Ares M., Jr., Haussler D. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 97: 262–7.
PubMed CAS Google Scholar
Butte A.J., Ye J. et al. (2001). “Determining Significant Fold Differences in Gene Expression Analysis.” Pacific Symposium on Biocomputing 6: 6–17.
Google Scholar
Chu S., DeRisi J., Eisen M., Mulholland J., Botstein D., Brown P.O., Herskowitz I. (1998) The transcriptional program of sporulation in budding yeast. Science 282: 699–705.
Article PubMed CAS Google Scholar
DeRisi J.L., Iyer V.R., Brown P.O. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278: 680–6.
Article PubMed CAS Google Scholar
Eisen M.B., Spellman P.T., Brown P.O., Botstein D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad of Sci USA 95: 14863–8.
CAS Google Scholar
Gasch, A. P., P. T. Spellman, C. M. Kao, O. Carmel-Harel, M. B. Eisen, G. Storz, D. Botstein, and P. O. Brown. 2000. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell, in press.
Google Scholar
Golub G.H., Van Loan C.F. (1996). Matrix computations. Johns Hopkins University Press, Baltimore.
Google Scholar
Golub T.R., Slonim D.K., Tamayo P., Huard C., Gaasenbeek M., Mesirov J.P., Coller H., Loh M.L., Downing J.R., Caligiuri M.A., Bloomfield C.D., Lander E.S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531–7.
Article PubMed CAS Google Scholar
Hastie T., Tibshirani R., Eisen M., Alizadeh A., Levy R., Staudt L., Chan W.C., Botstein D., Brown P. (2000). “Gene shaving” as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol 1: research0003.1-research0003.21.
Google Scholar
Heyer L.J., Kruglyak S., Yooseph S. (1999). Exploring expression data: identification and analysis of coexpressed genes. Genome Res 9: 1106–15.
Article PubMed CAS Google Scholar
Little R.J.A., Rubin D.B. (1987). Statistical analysis with missing data. Wiley, New York.
Google Scholar
Loh W., Vanichsetakul N. (1988). Tree-Structured Classification via generalized discriminant analysis. Journal of the American Statistical Association 83: 715–725.
Google Scholar
Perou C.M., Sorlie T., Eisen M.B., van de Rijn M., Jeffrey S.S, Rees C.A., Pollack J.R., Ross D.T., Johnsen H., Akslen L.A., Fluge O., Pergamenschikov A., Williams C., Zhu S.X., Lonning P.E., Borresen-Dale A.L., Brown P.O., Botstein D. (2000). Molecular portraits of human breast tumours. Nature 406: 747–52.
Article PubMed CAS Google Scholar
Raychaudhuri S., Stuart J.M., Altman R.B. (2000). Principal components analysis to summarize microarray experiments: application to sporulation time series, Pacific Symposium on Biocomputing: 455–66.
Google Scholar
Spellman P.T., Sherlock G., Zhang M.Q., Iyer V.R., Anders K., Eisen M.B., Brown P.O., Botstein D., Futcher B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9: 3273–97.
PubMed CAS Google Scholar
Tamayo P., Slonim D., Mesirov J., Zhu Q., Kitareewan S., Dmitrovsky E., Lander E.S., Golub T.R. (1999). Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96: 2907–12.
Article PubMed CAS Google Scholar
Troyanskaya O., Cantor M., Sherlock G., Brown P., Hastie T., Tibshirani R., Botstein D., Altman R.B. (2001), Missing Value Estimation methods for DNA microarrays. Bioinformatics 17(6): 520–5.
Article PubMed CAS Google Scholar
Wilkinson G.N. (1958). Estimation of missing values for the analysis of incomplete data. Biometrics 14: 257–286.
Google Scholar
Yates Y, (1933). The analysis of replicated experiments when the field results are incomplete. Emp. J. Exp. Agric. 1: 129–142.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Genetics, Stanford University School of Medicine, Stanford, CA, 94305, USA
Olga G. Troyanskaya, David Botstein & Russ B. Altman

Authors

Olga G. Troyanskaya
View author publications
You can also search for this author in PubMed Google Scholar
David Botstein
View author publications
You can also search for this author in PubMed Google Scholar
Russ B. Altman
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Biomedical Sciences, University of Ulster at Coleraine, Northern Ireland
Daniel P. Berrar
Faculty of Life and Health Science and Faculty of Informatics, University of Ulster at Coleraine, Northern Ireland
Werner Dubitzky
4T2consulting, Weingarten, Germany
Martin Granzow

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Troyanskaya, O.G., Botstein, D., Altman, R.B. (2003). Missing Value Estimation. In: Berrar, D.P., Dubitzky, W., Granzow, M. (eds) A Practical Approach to Microarray Data Analysis. Springer, Boston, MA. https://doi.org/10.1007/0-306-47815-3_3

Download citation

DOI: https://doi.org/10.1007/0-306-47815-3_3
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-7260-4
Online ISBN: 978-0-306-47815-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics