Analyzing ChIP-seq Data: Preprocessing, Normalization, Differential Identification, and Binding Pattern Characterization

  • Cenny Taslim
  • Kun Huang
  • Tim Huang
  • Shili LinEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 802)


Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a high-throughput antibody-based method to study genome-wide protein–DNA binding interactions. ChIP-seq technology allows scientist to obtain more accurate data providing genome-wide coverage with less starting material and in shorter time compared to older ChIP-chip experiments. Herein we describe a step-by-step guideline in analyzing ChIP-seq data including data preprocessing, nonlinear normalization to enable comparison between different samples and experiments, statistical-based method to identify differential binding sites using mixture modeling and local false discovery rates (fdrs), and binding pattern characterization. In addition, we provide a sample analysis of ChIP-seq data using the steps provided in the guideline.

Key words

ChIP-seq Finite mixture model Model-based classification Nonlinear normalization Differential analysis 


  1. 1.
    Johnson DS, Mortazavi A, Myers R et al (2007) Genome-Wide Mapping of in Vivo Protein-DNA Interactions. Science 316: 1441–1442CrossRefGoogle Scholar
  2. 2.
    Liu E, Pott S, Huss M (2010) Q&A: ChIP-seq technologies and the study of gene regulation. BMC Biology 8: 56PubMedCrossRefGoogle Scholar
  3. 3.
    Cleveland WS (1988) Locally-Weighted Regression: An Approach to Regression Analysis by Local Fitting. J. Am. Stat. Assoc. 85: 596–610CrossRefGoogle Scholar
  4. 4.
    Taslim C, Wu J, Yan P et al (2009) Comparative study on ChIP-seq data: normalization and binding pattern characterization. Bioinformatics 25: 2334–2340PubMedCrossRefGoogle Scholar
  5. 5.
    Khalili A, Huang T, Lin S (2009) A robust unified approach to analyzing methylation and gene expression data. Computational Statistics and Data Analysis 53: 1701–1710PubMedCrossRefGoogle Scholar
  6. 6.
    Akaike H (1973) Information Theory and an Extension of the Maximum Likelihood Principle: 267–281Google Scholar
  7. 7.
    Efron B (2004) Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis. Journal of the American Statistical Association 99: 96–104CrossRefGoogle Scholar
  8. 8.
    Oetken G, Parks T, Schussler H (1975) New results in the design of digital interpolators. IEEE Transactions on Acoustics, Speech and Signal Processing [see also IEEE Transactions on Signal Processing] 23: 301–309Google Scholar
  9. 9.
    Pruitt KD, Tatusova T, Maglott DR (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins, Nucleic Acids Research 35: D61–65PubMedCrossRefGoogle Scholar
  10. 10.
    Lin CY, Strom A, Vega V et al (2004) Discovery of estrogen receptor alpha target genes and response elements in breast tumor cells. Genome Biology 5, R66PubMedCrossRefGoogle Scholar
  11. 11.
    Feng W, Liu Y, Wu J et al (2008) A Poisson mixture model to identify changes in RNA polymerase II binding quantity using high-throughput sequencing technology. BMC Genomics 9: S23PubMedCrossRefGoogle Scholar
  12. 12.
    Rozowsky J, Euskirchen G, Auerbach RK et al (2009) PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotech 27: 66–75CrossRefGoogle Scholar
  13. 13.
    Kharchenko PV, Tolstorukov MY, Park PJ (2008) Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nature biotechnology 26: 1351–1359PubMedCrossRefGoogle Scholar
  14. 14.
    Jothi R, Cuddapah S, Barski A et al (2008) Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucl. Acids Res. 36: 5221–5231PubMedCrossRefGoogle Scholar
  15. 15.
    McLachlan G, Peel D (2000) Finite Mixture Models. Wiley-Interscience, New YorkCrossRefGoogle Scholar
  16. 16.
    Mortazavi A, Williams BA, McCue K et al (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth 5:621–628CrossRefGoogle Scholar
  17. 17.
    The networks and functional analyses were generated through the use of Ingenuity Pathways Analysis (Ingenuity® Systems), see
  18. 18.
    KEGG pathway analysis, see
  19. 19.
    Gene Ontology website, see
  20. 20.
    WEB-based GEne SeT AnaLysis Toolkit, see
  21. 21.
    Software and datasets used can be downloaded, see

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Cenny Taslim
    • 1
    • 2
  • Kun Huang
    • 3
  • Tim Huang
    • 1
  • Shili Lin
    • 2
    Email author
  1. 1.Department of Molecular Virology, Immunology & Medical GeneticsThe Ohio State UniversityColumbusUSA
  2. 2.Department of StatisticsThe Ohio State UniversityColumbusUSA
  3. 3.Department of Biomedical InformaticsThe Ohio State UniversityColumbusUSA

Personalised recommendations