Skip to main content

PeakPass: Automating ChIP-Seq Blacklist Creation

  • Conference paper
  • First Online:
Bioinformatics Research and Applications (ISBRA 2019)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11490))

Included in the following conference series:

Abstract

ChIP-Seq blacklists contain genomic regions that frequently produce artifacts and noise in ChIP-Seq experiments. To improve signal-to-noise ratio, ChIP-Seq pipelines often remove data points that map to blacklist regions. Existing blacklists have been compiled in a manual or semi-automated way. In this paper we describe PeakPass, an efficient method to generate blacklists, and present evidence that blacklists can increase ChIP-Seq data quality. PeakPass leverages machine learning and attempts to automate blacklist generation. PeakPass uses a random forest classifier in combination with genomic features such as sequence, annotated repeats, complexity, assembly gaps, and the ratio of multi-mapping to uniquely mapping reads to identify artifact regions. We have validated PeakPass on a large dataset and tested it for the purpose of upgrading a blacklist to a new reference genome version. We trained PeakPass on the ENCODE blacklist for the hg19 human reference genome, and created an updated blacklist for hg38. To assess the performance of this blacklist we tested 42 ChIP-Seq replicates from 24 experiments using the Relative Strand Correlation (RSC) metric as a quality measure. Using the blacklist generated by PeakPass resulted in a statistically significant increase in RSC over the existing ENCODE blacklist for hg38 – average RSC was increased by 50% over the ENCODE blacklist, while only filtering an average of 0.1% of called peaks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Degner, J., et al.: Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25(24), 3207–3212 (2009)

    Article  Google Scholar 

  2. Kundaje, A.: A comprehensive collection of signal artifact blacklist regions in the human genome. http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg19-human/hg19-blacklist-README.pdf. Accessed 28 Mar 2019

  3. Dolgalev, I., Sedlazeck, F., Busby, B.: DangerTrack: A scoring system to detect difficult-to-assess regions. F1000Research. 6(443) (2017)

    Article  Google Scholar 

  4. Wimberley, C.: PeakPass: a machine learning approach for ChIP-Seq blacklisting. Master’s thesis, North Carolina State University (2018)

    Google Scholar 

  5. Carroll, T.S., Liang, Z., Salama, R., Stark, R., de Santiago, I.: Impact of artifact removal on ChIP quality metrics in ChIP-seq and ChIP-exo data. Front. Genet. 5, 75 (2014)

    Article  Google Scholar 

  6. Ramachandran, P., Palidwor, G., Porter, C., Perkins, T.: MaSC: mappability-sensitive cross-correlation for estimating mean fragment length of single-end short-read sequencing data. Bioinformatics 29(4), 444–450 (2013)

    Article  Google Scholar 

  7. The ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012)

    Article  Google Scholar 

  8. Ho, T.: Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282 (1995)

    Google Scholar 

  9. Fix, E., Hodges, J.: Discriminatory analysis nonparametric discrimination: consistency properties (1951)

    Google Scholar 

  10. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  11. Farley, B., Clark, W.: Simulation of self-organizing systems by digital computer. Trans. IRE Prof. Group Inf. Theory 4(4), 76–84 (1954)

    Article  MathSciNet  Google Scholar 

  12. John, G., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (2013)

    Google Scholar 

  13. Derrien, T., et al.: Fast computation and applications of genome mappability. PLoS One 7(1), e30377 (2012)

    Article  Google Scholar 

  14. Smit, A., Hubley, R., Green, P.: RepeatMasker Open-4.0 (2013-2015). http://www.repeatmasker.org

  15. Kuhn, M.: Building predictive models in R using the caret package. J. Stat. Softw. 28(5), 1–26 (2008)

    Article  Google Scholar 

  16. The ENCODE Project Consortium: Transcription Factor ChIP-seq Data Standards and Processing Pipeline. https://www.encodeproject.org/chip-seq/transcription_factor/

  17. Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6(2), 65–70 (1979)

    MathSciNet  MATH  Google Scholar 

  18. Landt, S., et al.: ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22(9), 1813–1831 (2012)

    Article  Google Scholar 

  19. Altemose, N., Miga, K.H., Maggioni, M., Willard, H.F.: Genomic characterization of large heterochromatic gaps in the human genome assembly. PLOS Comput. Biol. 10(5), e1003628 (2014)

    Article  Google Scholar 

  20. Kojima, K.: Human transposable elements in Repbase: genomic footprints from fish to humans. Mobile DNA. 9(2) (2018)

    Google Scholar 

Download references

Acknowledgments

This work was supported in part by the National Science Foundation (grant no. IOS1355019). We thank Robert G. Franks and Miguel Flores-Vergara (both NC State University) for extremely valuable discussions and advice. We are grateful for the ENCODE datasets we used for validating PeakPass. These data were produced by: Michael Snyder, Richard Myers, Sherman Weissman, Xiang-Dong Fu, Kevin Struhl, Bradley Bernstein, John Stamatoyannopoulos, Peggy Farnham, and Vishwanath Iyer.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Charles E. Wimberley or Steffen Heber .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wimberley, C.E., Heber, S. (2019). PeakPass: Automating ChIP-Seq Blacklist Creation. In: Cai, Z., Skums, P., Li, M. (eds) Bioinformatics Research and Applications. ISBRA 2019. Lecture Notes in Computer Science(), vol 11490. Springer, Cham. https://doi.org/10.1007/978-3-030-20242-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20242-2_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20241-5

  • Online ISBN: 978-3-030-20242-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics