Abstract
ChIP-Seq blacklists contain genomic regions that frequently produce artifacts and noise in ChIP-Seq experiments. To improve signal-to-noise ratio, ChIP-Seq pipelines often remove data points that map to blacklist regions. Existing blacklists have been compiled in a manual or semi-automated way. In this paper we describe PeakPass, an efficient method to generate blacklists, and present evidence that blacklists can increase ChIP-Seq data quality. PeakPass leverages machine learning and attempts to automate blacklist generation. PeakPass uses a random forest classifier in combination with genomic features such as sequence, annotated repeats, complexity, assembly gaps, and the ratio of multi-mapping to uniquely mapping reads to identify artifact regions. We have validated PeakPass on a large dataset and tested it for the purpose of upgrading a blacklist to a new reference genome version. We trained PeakPass on the ENCODE blacklist for the hg19 human reference genome, and created an updated blacklist for hg38. To assess the performance of this blacklist we tested 42 ChIP-Seq replicates from 24 experiments using the Relative Strand Correlation (RSC) metric as a quality measure. Using the blacklist generated by PeakPass resulted in a statistically significant increase in RSC over the existing ENCODE blacklist for hg38 – average RSC was increased by 50% over the ENCODE blacklist, while only filtering an average of 0.1% of called peaks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Degner, J., et al.: Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25(24), 3207–3212 (2009)
Kundaje, A.: A comprehensive collection of signal artifact blacklist regions in the human genome. http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg19-human/hg19-blacklist-README.pdf. Accessed 28 Mar 2019
Dolgalev, I., Sedlazeck, F., Busby, B.: DangerTrack: A scoring system to detect difficult-to-assess regions. F1000Research. 6(443) (2017)
Wimberley, C.: PeakPass: a machine learning approach for ChIP-Seq blacklisting. Master’s thesis, North Carolina State University (2018)
Carroll, T.S., Liang, Z., Salama, R., Stark, R., de Santiago, I.: Impact of artifact removal on ChIP quality metrics in ChIP-seq and ChIP-exo data. Front. Genet. 5, 75 (2014)
Ramachandran, P., Palidwor, G., Porter, C., Perkins, T.: MaSC: mappability-sensitive cross-correlation for estimating mean fragment length of single-end short-read sequencing data. Bioinformatics 29(4), 444–450 (2013)
The ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012)
Ho, T.: Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282 (1995)
Fix, E., Hodges, J.: Discriminatory analysis nonparametric discrimination: consistency properties (1951)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Farley, B., Clark, W.: Simulation of self-organizing systems by digital computer. Trans. IRE Prof. Group Inf. Theory 4(4), 76–84 (1954)
John, G., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (2013)
Derrien, T., et al.: Fast computation and applications of genome mappability. PLoS One 7(1), e30377 (2012)
Smit, A., Hubley, R., Green, P.: RepeatMasker Open-4.0 (2013-2015). http://www.repeatmasker.org
Kuhn, M.: Building predictive models in R using the caret package. J. Stat. Softw. 28(5), 1–26 (2008)
The ENCODE Project Consortium: Transcription Factor ChIP-seq Data Standards and Processing Pipeline. https://www.encodeproject.org/chip-seq/transcription_factor/
Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6(2), 65–70 (1979)
Landt, S., et al.: ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22(9), 1813–1831 (2012)
Altemose, N., Miga, K.H., Maggioni, M., Willard, H.F.: Genomic characterization of large heterochromatic gaps in the human genome assembly. PLOS Comput. Biol. 10(5), e1003628 (2014)
Kojima, K.: Human transposable elements in Repbase: genomic footprints from fish to humans. Mobile DNA. 9(2) (2018)
Acknowledgments
This work was supported in part by the National Science Foundation (grant no. IOS1355019). We thank Robert G. Franks and Miguel Flores-Vergara (both NC State University) for extremely valuable discussions and advice. We are grateful for the ENCODE datasets we used for validating PeakPass. These data were produced by: Michael Snyder, Richard Myers, Sherman Weissman, Xiang-Dong Fu, Kevin Struhl, Bradley Bernstein, John Stamatoyannopoulos, Peggy Farnham, and Vishwanath Iyer.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Wimberley, C.E., Heber, S. (2019). PeakPass: Automating ChIP-Seq Blacklist Creation. In: Cai, Z., Skums, P., Li, M. (eds) Bioinformatics Research and Applications. ISBRA 2019. Lecture Notes in Computer Science(), vol 11490. Springer, Cham. https://doi.org/10.1007/978-3-030-20242-2_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-20242-2_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20241-5
Online ISBN: 978-3-030-20242-2
eBook Packages: Computer ScienceComputer Science (R0)