The Poisson Margin Test for Normalisation Free Significance Analysis of NGS Data
Motivation: The current methods for the determination of the statistical significance of peaks and regions in NGS data require an explicit normalisation step to compensate for (global or local) imbalances in the sizes of sequenced and mapped libraries. There are no canonical methods for performing such compensations, hence a number of different procedures serving this goal in different ways can be found in the literature. Unfortunately, the normalisation has a significant impact on the final results. Different methods yield very different numbers of detected “significant peaks” even in the simplest scenario of ChIP-Seq experiments which compare the enrichment in a single sample relative to a matching control. This becomes an even more acute issue in the more general case of the comparison of multiple samples, where a number of arbitrary design choices will be required in the data analysis stage, each option resulting in possibly (significantly) different outcomes.
Results: In this paper we investigate a principled statistical procedure which eliminates the need for a normalisation step. We outline its basic properties, in particular the scaling upon depth of sequencing. For the sake of illustration and comparison we report the results of re-analysing a ChIP-Seq experiment for transcription factor binding site detection. In order to quantify the differences between outcomes we use a novel method based on the accuracy of in silico prediction by SVM-models trained on part of the genome and tested on the remainder.
Availability: The supplementary material is available at .
- 1.Kowalczyk, A., Bedo, J., Conway, T., Beresford-Smith, B.: Poisson Margin Test for Normalisation Free Significance Analysis of NGS Data - Supplementary Materials (2009), http://www.genomics.csse.unimelb.edu.au/peakfiltsup
- 4.Robertson, G., Hirst, M., Bainbridge, M., Bilenky, M., Zhao, Y., Zeng, T., Euskirchen, G., Bernier, B., Varhol, R., Delaney, A., et al.: Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nature Methods 4, 651–657 (2007)CrossRefGoogle Scholar
- 5.Kowalczyk, A.: Some Formal Results for Significance of Short Read Concentrations (2009), http://www.genomics.csse.unimelb.edu.au/shortreadtheory
- 10.Keeping, E.: Introduction to Statistical Infernce. Dover, New York (1995) ISBN 0-486-68502-0; Reprint of 1962 edition by D. Van Nostrand Co., Princeton, New JerseyGoogle Scholar
- 11.Zhang, Y., Liu, T., Meyer, C., Eeckhoute, J., Johnson, D., Bernstein, B., Nussbaum, C., Myers, R., Brown, M., Li, W., Liu, X.S.: Model-based analysis of chip-seq (macs). Genome Biology 9(9), R137 (2008)Google Scholar
- 13.Sonnenburg, S., Zien, A., Ratsch, G.: Arts: accurate recognition of transcription starts in human. Bioinformatics 22, e423–e480 (2006)Google Scholar
- 14.Abeel, T., Van de Peer, Y., Saeys, Y.: Toward a gold standard for promoter prediction evaluation. Bioinformatics 25, i313–i320 (2009)Google Scholar
- 15.Bedo, J., MacIntyre, G., Haviv, I., Kowalczyk, A.: Simple SVM based whole-genome Segmentation (2009), Available from Nature Precedings http://dx.doi.org/10.1038/npre.2009.3811.1