Recognition of Coding Regions in Genome Alignment
- 757 Downloads
Gene recognition is an old and important problem. Statistical and homology-based methods work relatively well, if one tries to find long exons or full genes but are unable to recognize relatively short coding fragments. Genome alignments and study of synonymous and non-synonymous substitutions give a chance to overcome this drawback. Our aim is to propose a criterion to distinguish short coding and non-coding fragments of genome alignment and to create an algorithm to locate aligned coding regions. We have developed a method to locate aligned exons in a given alignment. First, we scan the alignment with a window of a fixed size (∼ 40 bp) and assign a score to each window position. The score reflects if numbers KS of synonymous substitutions, KN of non-synonymous substitutions, and D of deleted symbols look like those for coding regions. Second, we mark the ‘qualified exon-like’ regions, QELRs, i.e., sequences of consecutive high-scoring windows. Presumably, each QELR contains one exon. Third, we point out an exon within every QELR. All the steps have to be performed twice, for the direct and reverse complement chains independently. Finally, we compare the predictions for two chains to exclude any possible predictions of ‘exon shadows’ on complementary chain instead of real exons. Tests have shown that ∼ 93 % of the marked QELRs have intersections with real exons and ∼ 93 % of the aligned annotated exons intersect the marked QELRs. The total length of marked QELRs is ∼ 1.30 of the total length of annotated exons. About 85 % of the total length of predicted exons belongs to annotated exons. The runtime of the algorithm is proportional to the length of a genome alignment.
Key wordscoding region gene recognition genome alignment synonymous and non-synonymous substitution
Unable to display preview. Download preview PDF.