Weighted K-Means Clustering with Observation Weight for Single-Cell Epigenomic Data

Zhang, Wenyu; Wangwu, Jiaxuan; Lin, Zhixiang

doi:10.1007/978-3-030-33416-1_3

Wenyu Zhang⁸,
Jiaxuan Wangwu⁸ &
Zhixiang Lin⁸

Part of the book series: Emerging Topics in Statistics and Biostatistics ((ETSB))

1516 Accesses

Abstract

The recent advances in single-cell technologies have enabled us to profile genomic features at unprecedented resolution. Nowadays, we can measure multiple types of genomic features at single-cell resolution, including gene expression, protein-binding, methylation, and chromatin accessibility. One major goal in single-cell genomics is to identify and characterize novel cell types, and clustering methods are essential for this goal. The distinct characteristics in single-cell genomic datasets pose challenges for methodology development. In this work, we propose a weighted K-means algorithm. Through down-weighting cells with low sequencing depth, we show that the proposed algorithm can lead to improved detection of rare cell types in analyzing single-cell chromatin accessibility data. The weight of noisy cells is tuned adaptively. In addition, we incorporate sparsity constraints in our proposed method for simultaneous clustering and feature selection. We also evaluated our proposed methods through simulation studies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 99.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

The Human Cell Atlas Participants. (2017). Science forum: The human cell atlas. Elife, 6, e27041.
Article Google Scholar
Rotem, A., Ram, O., Shoresh, N., Sperling, R. A., Goren, A., Weitz, D. A., et al. (2015). Single-cell ChIP-seq reveals cell subpopulations defined by chromatin state. Nature Biotechnology, 33(11), 1165.
Article Google Scholar
Smallwood, S. A., Lee, H. J., Angermueller, C., Krueger, F., Saadeh, H., Peat, J., et al. (2014). Single-cell genome-wide bisulfite sequencing for assessing epigenetic heterogeneity. Nature Methods, 11(8), 817.
Article Google Scholar
Buenrostro, J. D., Wu, B., Litzenburger, U. M., Ruff, D., Gonzales, M. L., Snyder, M. P., et al. (2015). Single-cell chromatin accessibility reveals principles of regulatory variation. Nature, 523(7561), 486–490.
Article Google Scholar
Cusanovich, D. A., Daza, R., Adey, A., Pliner, H. A., Christiansen, L., Gunderson, K. L., et al. (2015). Multiplex single-cell profiling of chromatin accessibility by combinatorial cellular indexing. Science, 348(6237), 910–914.
Article Google Scholar
Xu, C., & Su, Z. (2015). Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics, 31(12), 1974–1980
Article Google Scholar
Yau, C. (2016). pcaReduce: Hierarchical clustering of single cell transcriptional profiles. BMC Bioinformatics, 17(1), 140.
Google Scholar
Grün, D., Muraro, M. J., Boisset, J. C., Wiebrands, K., Lyubimova, A., Dharmadhikari, G., et al. (2016). De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell, 19(2), 266–277.
Article Google Scholar
Kiselev, V. Y., Kirschner, K., Schaub, M. T., Andrews, T., Yiu, A., Chandra, T., et al. (2017). SC3: Consensus clustering of single-cell RNA-seq data. Nature Methods, 14(5), 483.
Article Google Scholar
Lin, P., Troup, M., & Ho, J. W. (2017). CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biology, 18(1), 59.
Article Google Scholar
Wang, B., Zhu, J., Pierson, E., Ramazzotti, D., & Batzoglou, S. (2017). Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nature Methods, 14(4), 414.
Article Google Scholar
Jiang, H., Sohn, L. L., Huang, H., & Chen, L. (2018). Single cell clustering based on cell-pair differentiability correlation and variance analysis. Bioinformatics, 34(21), 3684–3694.
Google Scholar
Yang, Y., Huh, R., Culpepper, H. W., Lin, Y., Love, M. I., & Li, Y. (2018). SAFE-clustering: Single-cell aggregated (from Ensemble) clustering for single-cell RNA-seq data. Bioinformatics, 35(8), 1269–1277.
Article Google Scholar
Zhu, L., Lei, J., Devlin, B., Roeder, K. (2019). Semi-soft clustering of single cell data. Proceedings of the National Academy of Sciences of the United States of America, 116(2), 466–471.
Article MathSciNet Google Scholar
Sun, Z., Wang, T., Deng, K., Wang, X. F., Lafyatis, R., Ding, Y., et al. (2017). DIMM-SC: A dirichlet mixture model for clustering droplet-based single cell transcriptomic data. Bioinformatics, 34(1), 139–146.
Article Google Scholar
Zamanighomi, M., Lin, Z., Daley, T., Chen, X., Duren, Z., Schep, A., et al. (2018). Unsupervised clustering and epigenetic classification of single cells. Nature Communications, 9(1), 2410.
Article Google Scholar
Makarenkov, V., & Legendre, P. (2001). Optimal variable weighting for ultrametric and additive trees and k-means partitioning: Methods and software. Journal of Classification, 18, 245–271.
Article MathSciNet Google Scholar
Modha, D. S., & Spangler, W. S. (2003). Feature weighting in k-means clustering. Machine Learning, 52(3), 217–237.
Article Google Scholar
Huang, J. Z., Ng, M. K., Rong, H., & Li, Z. (2005). Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 657–68.
Article Google Scholar
Jing, L., Ng, M. K., & Huang, J. Z. (2007). An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Transactions on Knowledge and Data Engineering, 19, 1026–1041.
Article Google Scholar
Wu, F. X. (2008). Genetic weighted k-means algorithm for clustering large-scale gene expression data. BMC Bioinformatics, 9(Suppl. 6), S12.
Article Google Scholar
Amorim, R., & Mirkin, B. (2012). Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering. Pattern Recognition, 45, 1061–1075.
Article Google Scholar
Tseng, G. (2007). Penalized and weighted k-means for clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics (Oxford, England), 23, 2247–55.
Article Google Scholar
Aloise, D., Deshpande, A., Hansen, P., & Popat, P. (2009). NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75,(2), 245–248.
Article Google Scholar
Hartigan, J. A., & Wong, M. A. (1979). Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100–108.
MATH Google Scholar
Witten, D. M., & Tibshirani, R. (2010). A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490), 713–726.
Article MathSciNet Google Scholar
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411–423.
Article MathSciNet Google Scholar
Park, H., & Kim, H. (2007). Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics, 23(12), 1495–1502.
Article Google Scholar
Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y., & Greenleaf, W. J. (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature Methods, 10(12), 1213.
Article Google Scholar
Buenrostro, J. D., Wu, B., Chang, H. Y., & Greenleaf, W. J. ATAC-seq: A method for assaying chromatin accessibility genome-wide. Current Protocols in Molecular Biology, 109(1), 21–29.
Google Scholar
Zhang, Y., Liu, T., Meyer, C. A., Eeckhoute, J., Johnson, D. S., Bernstein, B. E., et al. (2008). Model-based analysis of chip-seq (MACS). Genome Biology, 9(9), R137.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, The Chinese University of Hong Kong, Sha Tin, Hong Kong
Wenyu Zhang, Jiaxuan Wangwu & Zhixiang Lin

Authors

Wenyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiaxuan Wangwu
View author publications
You can also search for this author in PubMed Google Scholar
Zhixiang Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhixiang Lin .

Editor information

Editors and Affiliations

Math and Statistics, 1342, Georgia State University, Atlanta, GA, USA
Yichuan Zhao
School of Social Work, University of North Carolina, Chapel Hill, NC, USA
Ding-Geng (Din) Chen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhang, W., Wangwu, J., Lin, Z. (2020). Weighted K-Means Clustering with Observation Weight for Single-Cell Epigenomic Data. In: Zhao, Y., Chen, DG. (eds) Statistical Modeling in Biomedical Research. Emerging Topics in Statistics and Biostatistics . Springer, Cham. https://doi.org/10.1007/978-3-030-33416-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-33416-1_3
Published: 20 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33415-4
Online ISBN: 978-3-030-33416-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics