PeakPass: Automating ChIP-Seq Blacklist Creation

Wimberley, Charles E.; Heber, Steffen

doi:10.1007/978-3-030-20242-2_20

Charles E. Wimberley¹⁷ &
Steffen Heber¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11490))

Included in the following conference series:

International Symposium on Bioinformatics Research and Applications

753 Accesses
1 Altmetric

Abstract

ChIP-Seq blacklists contain genomic regions that frequently produce artifacts and noise in ChIP-Seq experiments. To improve signal-to-noise ratio, ChIP-Seq pipelines often remove data points that map to blacklist regions. Existing blacklists have been compiled in a manual or semi-automated way. In this paper we describe PeakPass, an efficient method to generate blacklists, and present evidence that blacklists can increase ChIP-Seq data quality. PeakPass leverages machine learning and attempts to automate blacklist generation. PeakPass uses a random forest classifier in combination with genomic features such as sequence, annotated repeats, complexity, assembly gaps, and the ratio of multi-mapping to uniquely mapping reads to identify artifact regions. We have validated PeakPass on a large dataset and tested it for the purpose of upgrading a blacklist to a new reference genome version. We trained PeakPass on the ENCODE blacklist for the hg19 human reference genome, and created an updated blacklist for hg38. To assess the performance of this blacklist we tested 42 ChIP-Seq replicates from 24 experiments using the Relative Strand Correlation (RSC) metric as a quality measure. Using the blacklist generated by PeakPass resulted in a statistically significant increase in RSC over the existing ENCODE blacklist for hg38 – average RSC was increased by 50% over the ENCODE blacklist, while only filtering an average of 0.1% of called peaks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Degner, J., et al.: Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25(24), 3207–3212 (2009)
Article Google Scholar
Kundaje, A.: A comprehensive collection of signal artifact blacklist regions in the human genome. http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg19-human/hg19-blacklist-README.pdf. Accessed 28 Mar 2019
Dolgalev, I., Sedlazeck, F., Busby, B.: DangerTrack: A scoring system to detect difficult-to-assess regions. F1000Research. 6(443) (2017)
Article Google Scholar
Wimberley, C.: PeakPass: a machine learning approach for ChIP-Seq blacklisting. Master’s thesis, North Carolina State University (2018)
Google Scholar
Carroll, T.S., Liang, Z., Salama, R., Stark, R., de Santiago, I.: Impact of artifact removal on ChIP quality metrics in ChIP-seq and ChIP-exo data. Front. Genet. 5, 75 (2014)
Article Google Scholar
Ramachandran, P., Palidwor, G., Porter, C., Perkins, T.: MaSC: mappability-sensitive cross-correlation for estimating mean fragment length of single-end short-read sequencing data. Bioinformatics 29(4), 444–450 (2013)
Article Google Scholar
The ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012)
Article Google Scholar
Ho, T.: Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282 (1995)
Google Scholar
Fix, E., Hodges, J.: Discriminatory analysis nonparametric discrimination: consistency properties (1951)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Farley, B., Clark, W.: Simulation of self-organizing systems by digital computer. Trans. IRE Prof. Group Inf. Theory 4(4), 76–84 (1954)
Article MathSciNet Google Scholar
John, G., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (2013)
Google Scholar
Derrien, T., et al.: Fast computation and applications of genome mappability. PLoS One 7(1), e30377 (2012)
Article Google Scholar
Smit, A., Hubley, R., Green, P.: RepeatMasker Open-4.0 (2013-2015). http://www.repeatmasker.org
Kuhn, M.: Building predictive models in R using the caret package. J. Stat. Softw. 28(5), 1–26 (2008)
Article Google Scholar
The ENCODE Project Consortium: Transcription Factor ChIP-seq Data Standards and Processing Pipeline. https://www.encodeproject.org/chip-seq/transcription_factor/
Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6(2), 65–70 (1979)
MathSciNet MATH Google Scholar
Landt, S., et al.: ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22(9), 1813–1831 (2012)
Article Google Scholar
Altemose, N., Miga, K.H., Maggioni, M., Willard, H.F.: Genomic characterization of large heterochromatic gaps in the human genome assembly. PLOS Comput. Biol. 10(5), e1003628 (2014)
Article Google Scholar
Kojima, K.: Human transposable elements in Repbase: genomic footprints from fish to humans. Mobile DNA. 9(2) (2018)
Google Scholar

Download references

Acknowledgments

This work was supported in part by the National Science Foundation (grant no. IOS1355019). We thank Robert G. Franks and Miguel Flores-Vergara (both NC State University) for extremely valuable discussions and advice. We are grateful for the ENCODE datasets we used for validating PeakPass. These data were produced by: Michael Snyder, Richard Myers, Sherman Weissman, Xiang-Dong Fu, Kevin Struhl, Bradley Bernstein, John Stamatoyannopoulos, Peggy Farnham, and Vishwanath Iyer.

Author information

Authors and Affiliations

NC State University, Raleigh, NC, 27606, USA
Charles E. Wimberley & Steffen Heber

Authors

Charles E. Wimberley
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Heber
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Charles E. Wimberley or Steffen Heber .

Editor information

Editors and Affiliations

Georgia State University, Atlanta, GA, USA
Zhipeng Cai
Georgia State University, Atlanta, GA, USA
Pavel Skums
Central South University, Changsha, China
Min Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wimberley, C.E., Heber, S. (2019). PeakPass: Automating ChIP-Seq Blacklist Creation. In: Cai, Z., Skums, P., Li, M. (eds) Bioinformatics Research and Applications. ISBRA 2019. Lecture Notes in Computer Science(), vol 11490. Springer, Cham. https://doi.org/10.1007/978-3-030-20242-2_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-20242-2_20
Published: 09 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20241-5
Online ISBN: 978-3-030-20242-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics