Combinatorial Algorithms for Structural Variation Detection in High Throughput Sequenced Genomes
Recent studies show that, along with single nucleotide polymorphisms and small indels, larger structural variants among human individuals are common. These studies have typically been based high-cost library generation and Sanger sequencing; however, recent introduction of next-generation sequencing (NGS) technologies is changing how research in this area is conducted in a significant way. Highthroughput sequencing technologies such as 454, Illumina, Helicos, and AB SOLiD produce shorter reads than the traditional capillary sequencing, yet they reduce the cost (and/or the redundancy) by a factor of 10 - 100 and perhaps even more. Those NGS technologies with the capability of sequencing paired-ends (or matepairs) of a clone insert (which follows a tight length distribution) have made it feasible to perform detailed and comprehensive genome variation and rearrangement studies. Unfortunately, the few existing algorithms for identifying structural variation among individuals using paired-end reads have not been designed to handle the short read lengths and the errors implied by these platforms. Here, we describe, for the first time, algorithms for identifying various forms of structural variation between a paired-end NGS sequenced genome and a reference genome.