Heterochromatic sequences in a Drosophila whole-genome shotgun assembly
- 21k Downloads
Most eukaryotic genomes include a substantial repeat-rich fraction termed heterochromatin, which is concentrated in centric and telomeric regions. The repetitive nature of heterochromatic sequence makes it difficult to assemble and analyze. To better understand the heterochromatic component of the Drosophila melanogaster genome, we characterized and annotated portions of a whole-genome shotgun sequence assembly.
WGS3, an improved whole-genome shotgun assembly, includes 20.7 Mb of draft-quality sequence not represented in the Release 3 sequence spanning the euchromatin. We annotated this sequence using the methods employed in the re-annotation of the Release 3 euchromatic sequence. This analysis predicted 297 protein-coding genes and six non-protein-coding genes, including known heterochromatic genes, and regions of similarity to known transposable elements. Bacterial artificial chromosome (BAC)-based fluorescence in situ hybridization analysis was used to correlate the genomic sequence with the cytogenetic map in order to refine the genomic definition of the centric heterochromatin; on the basis of our cytological definition, the annotated Release 3 euchromatic sequence extends into the centric heterochromatin on each chromosome arm.
Whole-genome shotgun assembly produced a reliable draft-quality sequence of a significant part of the Drosophila heterochromatin. Annotation of this sequence defined the intron-exon structures of 30 known protein-coding genes and 267 protein-coding gene models. The cytogenetic mapping suggests that an additional 150 predicted genes are located in heterochromatin at the base of the Release 3 euchromatic sequence. Our analysis suggests strategies for improving the sequence and annotation of the heterochromatic portions of the Drosophila and other complex genomes.
KeywordsBacterial Artificial Chromosome Transposable Element Additional Data File Bacterial Artificial Chromosome Contig Transposable Element Sequence
Heterochromatin was first distinguished from euchromatin cytologically, on the basis of differential staining properties . Molecular and genetic properties that further distinguish heterochromatin from euchromatin include DNA sequence composition, replication timing, condensation throughout the cell cycle, and the ability to silence gene expression [2,3,4]. In addition to genes required for viability and fertility , heterochromatin contains essential cis-acting chromosome inheritance loci, including elements required for centromere function , meiotic pairing [7,8,9], and sister chromatid cohesion [10,11]. A significant fraction of the fly and human genomes are heterochromatic, yet our current understanding of the sequence and organization of heterochromatin is very limited. Heterochromatin is concentrated in megabase-sized tracts in the centric and subtelomeric regions of the chromosomes. It contains tandemly repeated short sequences (satellite DNAs), middle repetitive elements (for example, transposable elements), and some single-copy sequences . Progress has been made in the analysis of the non-satellite component of Drosophila, Arabidopsis and human heterochromatin [12,13,14,15,16,17,18,19,20]. Less progress has been made in analysis of satellite sequences, although recent studies have revealed the structure and composition of centromeric satellites [21,22].
The transition between heterochromatin and euchromatin appears to be gradual rather than abrupt. For example, a hallmark of heterochromatin is a high density of transposable elements, and the density of these elements in the genomic sequence increases continuously toward the centric ends of the euchromatic portions of the chromosome arms [12,31,32]. This trend continues in the centric heterochromatin, which contains large blocks of specific types of middle-repetitive sequences [22,27].
Heterochromatic genes have been defined by mutations that affect viability or fertility . Genetic screens, reviewed in , have identified 14 vital loci in the heterochromatin of chromosome 2 [34,35] and 12 vital loci in the heterochromatin of chromosome 3 . Although no vital loci have been identified in the proximal heterochromatin of the X chromosome , several identified loci map near the boundary between the centric heterochromatin and the euchromatin (see ). There are six Y-linked loci required for male fertility (, reviewed in ). Thus, there are at least 32 identified genetic loci required for viability or fertility in the centric heterochromatin. This is likely to be an underestimate, because saturating genetic screens have not been done for all of the heterochromatin.
Molecularly characterized Drosophila heterochromatic genes encode diverse proteins and functions. Examples include light (post-Golgi protein trafficking), concertina (α-like G protein subunit), Nipped-B (morphogenesis), rolled (MAP kinase), poly(ADP-ribose) polymerase (chromatin structure), bobbed (ribosomal RNA), and the Y-linked fertility factors kl-2, kl-3 and kl-5 (dynein heavy chains) . Genes have been localized on the cytogenetic map through analysis of chromosomal rearrangements and by fluorescence in situ hybridization (FISH) [25,40,41]. The genomic structures of several of these genes have been determined, and they differ from those of euchromatic genes. Their introns and regulatory regions are composed of clusters of partial and complete transposable elements, and some introns are hundreds of kilobases in length [42,43,44,45,46].
The Release 1 whole-genome shotgun (WGS) assembly of Drosophila  included 3.8 Mb of short, unmapped scaffolds representing heterochromatic sequence. The most recent version, WGS3 , assembled a total of 137.7 Mb of the Drosophila genome, using an improved assembly algorithm and the same trace data used for Release 1. A high-quality sequence of 116.9 Mb that spans the euchromatic portions of the chromosome arms is reported in . For the sake of consistency, we refer to this 116.9-Mb sequence as the 'Release 3 euchromatic sequence' even though, on the basis of the cytological criteria for defining the boundary between euchromatin and heterochromatin described below, we believe this sequence extends into the centric heterochromatin of each chromosome arm. The annotation of genes and transposable elements in this approximately 2 Mb of heterochromatin-derived DNA are reported in [32,48]. Here, we characterize and annotate the 20.7 Mb of WGS3 sequence, the 'WGS heterochromatic sequence', that is not represented in the 116.9 Mb Release 3 euchromatic sequence.
Annotation of WGS3 heterochromatic sequences
We annotated the 12.0 Mb of WGS3 heterochromatic sequence in the 85 scaffolds longer than 40 kb, plus a 133-kb sequence at the centric end of the Release 3 euchromatic sequence of the X chromosome that was not annotated by Misra et al. . We arbitrarily excluded the 8.7 Mb of WGS3 heterochromatic sequence in the 2,512 scaffolds shorter than 40 kb from detailed annotation. Preliminary analysis suggested that gene identification in these small scaffolds is hampered in part by the separation of exons onto different scaffolds. This is illustrated by our annotation of seven 'super-scaffolds' that were constructed by linking together 25 short WGS3 heterochromatic scaffolds using cDNA evidence (see below). Thus, the scaffolds shorter than 40 kb do contain genes, but a reliable annotation of these sequences will require further analysis.
Identifying genes within heterochromatin presents challenges not encountered when annotating euchromatic sequences. Open reading frames (ORFs) in transposable elements can interfere with the identification of single-copy protein-coding genes, particularly when transposable elements are nested within introns. Also, heterochromatic genes can have large introns separating relatively small exons [42,43,44,45,46]. Therefore, transposable element and low-complexity sequences were masked; then the lengths of masked regions and sequence gaps were reduced to a maximum of 70 bp, which is the median intron length of euchromatic protein-coding genes in Drosophila [48,50].
We annotated the masked scaffolds using the computational annotation pipeline developed by Mungall et al.  and the annotation tool Apollo . The pipeline generates, stores and filters alignments of expressed sequence tags (ESTs), cDNAs, and the results of protein similarity searches and gene-prediction algorithms. Apollo displays the filtered results of the pipeline in tiers of evidence, and allows human curators to evaluate and use the evidence to construct gene models. The guidelines used to define gene models in the euchromatin  were modified slightly, to deal with the unique properties of heterochromatic sequence (see Materials and methods). We generated 351 preliminary gene models on the masked scaffolds.
Next, the preliminary gene models were re-curated on the unmasked WGS3 heterochromatic sequence scaffolds. The unmasked scaffolds were run through the computational annotation pipeline, and the preliminary gene models were aligned to this genomic sequence. Twenty-five preliminary gene models could not be aligned to the unmasked scaffolds using sim4 . After further examination, 11 of these were accepted as curated gene models, six were similar to transposable elements and were rejected, and eight could not be reconciled with the unmasked sequence. A higher-quality genomic sequence may be required to verify these eight models, and they have not been included here. The remaining preliminary gene models aligned in a consistent manner. After re-examination of the evidence, a total of 293 preliminary gene models were accepted, including 287 protein-coding gene models and five non-protein-coding gene models.
Because they are present on WGS3 heterochromatic scaffolds shorter than 40 kb, a number of previously known Y-linked protein-coding genes were missed by our analysis. We annotated six of these Y-linked genes (kl-2, kl-3, kl-5, Ory, Pp1-Y2 and Ppr-Y). The WGS sequence data were generated from clone libraries made from mixed-sex populations, so the male Y chromosome is represented by a four-fold lower density of sequence reads than the autosomes. In addition, Y-linked genes can have very large introns . Consequently, sequences on the Y chromosome are represented by shorter scaffolds in the WGS assemblies . In fact, five of the six Y-linked genes that we annotated were first characterized by analysis of short WGS scaffolds [55,56]. We used cDNA sequences to identify short WGS3 scaffolds bearing fragments of each of the six genes, concatenated these scaffolds into larger scaffolds ('super-scaffolds'; see Materials and methods), and annotated the resulting sequences to produce gene models. We produced and annotated one additional super-scaffold (linked_7) using EST evidence. The seven super-scaffolds contain 10 protein-coding gene models and one non-protein-coding gene model.
Evidence for the gene models
The WGS3 heterochromatic gene models are supported by fewer data than the Release 3 euchromatic gene models. In the WGS3 heterochromatin, 45% of gene models have an overlapping EST, compared to 78% in the Release 3 euchromatin. Twenty percent of gene models in the WGS3 heterochromatin are based on full-insert sequences of cDNAs, compared to over 70% in the Release 3 euchromatin. However, this observation is biased because half of the cDNAs in the Drosophila Gene Collection were selected for full-insert sequencing based on EST alignments to Release 2 gene models , and the WGS3 heterochromatin annotation preserves few of the Release 2 models and adds many new models. Many WGS3 heterochromatic gene models are based solely on gene predictions. Despite generating a large number of models, gene-finding algorithms were less successful at predicting heterochromatic genes than euchromatic genes. For example, only 72% of the heterochromatic gene models are supported by Genscan predictions, as opposed to 96% of the Release 3 euchromatic gene models.
Finally, we annotated six non-protein-coding RNA genes. WGS3 scaffold 211000022279294 contains rDNA sequences, including two complete copies each of the 5.8S and 28 rRNAs, a truncated 18S rRNA sequence that extends into a sequence gap, and a truncated 2S rRNA sequence that extends beyond the end of the scaffold. This region probably represents a portion of one of the two bobbed loci, which map to the X and Y chromosomes [58,59]. Two additional non-protein-coding gene models were annotated by similarity to euchromatic genes (see Supplementary Table 1 in the Additional data file).
Comparison to the Release 2 annotation
The annotation of WGS3 heterochromatic sequence has increased both the number and quality of gene models in the heterochromatin, relative to the corresponding portion of Release 2 (see Supplementary Table 2 in the additional data files). During the curation of the 297 protein-coding gene models in WGS3 heterochromatic sequence, 79 gene models from the corresponding portion of Release 2 were deleted, and 250 new gene models were created. Many of the deleted Release 2 annotations represent ORFs that overlap transposable elements. In annotating WGS3 heterochromatin, 10 Release 2 gene models were merged into five new models, one Release 2 model was split into two new models, and one Release 2 annotation was split into two models, one of which was merged with another model. Only 30 of the 130 Release 2 protein-coding gene models were preserved intact in the new annotation; 21 previous models were preserved with modifications. Thus, a much higher fraction of Release 2 gene models was modified in the WGS3 heterochromatin annotation than in the Release 3 euchromatic annotation, in which nearly two-thirds of predicted ORFs were unchanged .
As in the annotation of the Release 3 euchromatic sequence, the increased numbers of ESTs and cDNA sequences available for alignment to the WGS3 heterochromatic sequence resulted in significant improvements in the annotation of untranslated regions (UTRs), alternative transcripts, and intron-exon structures (see Supplementary Table 2 in additional data files). Twice as many genes in WGS3 heterochromatin have annotated 5' and 3' UTRs as in the corresponding portion of Release 2. The average 5' UTR length is 258 bp and the average 3' UTR length is 335 bp, both of which are close to the average UTR length of Release 3 euchromatic genes. There are 49 gene models with more than one transcript; only three were annotated in the corresponding portion of Release 2. There are 377 predicted protein-coding transcripts encoding 1,096 distinct exons, three times as many as were annotated in Release 2. The average number of introns per gene model increased from 2 to 2.7, but remains below the 3.6 average in the Release 3 euchromatin. The average length of introns increased significantly, from 892 bp in the corresponding portion of Release 2 to 3,743 bp in WGS3 heterochromatin. The longest annotated intron in WGS3 heterochromatin is 119,217 bp, dwarfing a 17,613 bp intron that was the longest in the corresponding portion of Release 2. Finally, only two introns longer than 10 kb were annotated in Release 2, but 76 gene models in WGS3 heterochromatin have introns longer than 10 kb. Whereas the majority of annotated introns in both the WGS3 heterochromatin and the Release 3 euchromatin are in the range of 50-70 bp, there are clearly more long introns in heterochromatic genes.
Annotations of selected regions
We carried out a preliminary analysis of the transposable element sequences found in the WGS3 heterochromatic sequence. We used a database of transposable elements  and RepeatMasker  to measure the amount of sequence that was derived from each transposable element family. Many of the sequences we identified represent only portions of elements; such fragmentary elements are often generated when transposable elements insert into one another to form complex nests . Despite this complication, we were able to estimate the contribution of each transposable element class.
The most striking observation is the high fraction of the WGS3 heterochromatic sequence that is derived from transposable elements. We found that 52% of the 20.7-Mb WGS3 heterochromatic sequence had similarity to known transposable elements. Using similar analyses, transposable elements account for just 5.0% of the Release 3 euchromatic sequence; a slightly lower value of 3.9% was obtained in the analyses reported in , which required a higher level of sequence conservation. There were also some differences in the relative contributions made by different classes of elements in heterochromatin and euchromatin. LTR elements represent 61% of euchromatic transposable elements and approximately 78% of heterochromatic elements. LINE elements represent 24% of the euchromatic and 17% of the heterochromatic transposable element sequence. TIR elements represent 15% in euchromatin and 5% in heterochromatin. No FB elements were identified using RepeatMasker; a more targeted search identified 12 kb (0.1%) of FB element sequence.
Although we found a much higher density of transposable element sequences in the heterochromatin than in the euchromatin, it is likely that we missed many heterochromatic transposable elements. In fact, we found ORFs with similarity to transposable elements, such as those encoding transposases, outside those regions we annotated as transposable elements (see Figure 7a, for example), suggesting the existence of novel transposable element families. Finally, many of the sequence gaps within scaffolds probably correspond to regions of the genome with very high transposable element density. Thus, our analysis almost certainly represents an underestimate of the total transposable element content of the WGS3 heterochromatic scaffolds. As repetitive elements are difficult to assemble using the WGS strategy, an accurate estimate of their contribution to the heterochromatic sequence awaits a more finished version of the sequence.
Cytological boundaries of centric heterochromatin
Localization of BACs to the mitotic and polytene cytogenetic maps
Distal to h26
Distal to h26
Distal to h26
Just distal to h26
h26, distal edge
h26, distal edge
h35, distal edge
h47, distal edge
Distal to h58
Distal to h58
For the purposes of defining the gene content of the heterochromatic portion of the genomic sequence, we provisionally designate the distal ends of the indicated BACs (Table 1) as defining the boundaries of the centric heterochromatin within the finished genomic sequence. However, the transition from euchromatin to heterochromatin appears to be gradual rather than sharp, and the resolution of these cytological mapping experiments appears to be on the order of 100 kb. Therefore, we have approximately localized the heterochromatin-euchromatin boundaries with respect to the genomic sequence, and have defined precise boundaries here simply as a convenience in discussing the genome annotation data. By this definition, the Release 3 sequence that spans the euchromatin includes 2.1 Mb of sequence in centric heterochromatin, and this sequence includes 150 curated genes described in Misra et al. . These genes are in addition to those identified in our analysis, which was restricted to the WGS3 heterochromatic scaffolds described above.
The BACs that we localized on the cytological map of the mitotic chromosomes have also been positioned using in situ hybridization to salivary gland polytene chromosomes [47,60]. Although banding in the proximal regions of the polytene chromosomes is not as distinct as in the euchromatic regions, comparison of the two datasets shows the approximate extent of overlap between the mitotic and polytene cytogenetic maps (Table 1). The boundary of the centric heterochromatin of the X chromosome at the distal edge of band h26 corresponds approximately to polytene division 20C, band h35 on chromosome arm 2L corresponds to polytene division 40A, band h45 on 2R corresponds to polytene division 41F, band h47 on 3L corresponds to polytene division 80A, and band h58 on 3R corresponds to polytene division 82C.
Our work has resulted in a substantially improved view of the sequence, organization, and gene content of the Drosophila heterochromatin. The 20.7-Mb heterochromatic WGS sequence we describe here, together with the essentially finished 116.9-Mb euchromatic sequence described in Celniker et al.  and Misra et al. , constitute the 137.7-Mb Release 3 version of the annotated D. melanogaster genomic sequence.
We have demonstrated the efficiency and utility of WGS sequencing in assembling the single-copy and middle-repetitive regions within the heterochromatic portion of a complex genome. WGS sequencing samples at random the entire portion of the genome that is clonable in 2-kb segments. The ability to clone genomic regions not clonable in BACs or other large-insert vectors makes WGS sequencing essential to study the heterochromatic regions of complex genomes. We also describe a successful annotation strategy for these highly repetitive regions of the Drosophila genome.
The heterochromatic portion of the genome has a far higher content of repetitive sequences and a lower gene density than the euchromatin. Nevertheless, the number and importance of heterochromatic genes are significant. Although the gene models are supported by fewer data than those in euchromatin, our analysis has identified 297 predicted protein-coding genes and six non-protein-coding genes in the WGS3 heterochromatic sequence, and suggests that approximately 150 genes in the Release 3 euchromatic sequence annotated by Misra et al.  are also located in the cytologically defined heterochromatin. The organization and composition of heterochromatic and euchromatic genes appear to differ; heterochromatic genes in general contain larger transcription units with some unusually large introns, and introns consist predominantly of transposable elements. Although heterochromatic genes appear to differ from euchromatic genes in some aspects of gene structure, they do not appear to be segregated in any obvious way based on function. The predicted products of the approximately 450 predicted heterochromatic genes represent diverse biochemical activities that are likely to be involved in a wide range of essential functions.
Annotation of the 2.9-Mb Adh region identified 55 vital loci and 218 protein-coding gene models (25% essential genes) in a presumed typical euchromatic region . Here, we describe 447 protein-coding gene models in the heterochromatin, including 150 models annotated in , but there are only 32 identified heterochromatic genes required for viability or fertility (7.2%) (see Background). This difference may or may not be significant, given that different euchromatic regions appear to have different ratios of essential genes to total genes . There are several possible reasons for the apparent discrepancy between our results and the genetic analyses. First, saturating genetic screens have not been reported for all of the heterochromatin, so the number of essential loci is underestimated. Second, the centric heterochromatin is defined more narrowly in the genetic analyses than in our analysis. For example, the WGS3 heterochromatic sequence includes suppressor of forked (su(f)) and the dicistronic stoned locus (stnA+stnB) (see Supplementary Table 1 in the additional data file), which map near the boundary on the X chromosome and have not been described previously as heterochromatic loci. Third, we may have predicted too many genes. In our annotation, predicted proteins encoded by gene models without full-length cDNA sequence data are shorter on average (297 amino acids) than those encoded by gene models based on full-length cDNA sequences (376 amino acids). Thus, we expect additional cDNA sequence data will result in merges of adjacent gene models, reducing the number of predicted genes. In addition, gene models with low levels of supporting evidence may not represent valid genes. In this context, it is important to note that annotation of the WGS3 scaffolds shorter than 40 kb will probably result in the identification of more heterochromatic genes. Thus, resolution of this issue will require further experimentation.
We have described assembled sequences representing 22.8 Mb of the heterochromatin, including 20.7 Mb in 'WGS3 heterochromatic sequence' and approximately 2.1 Mb in 'Release 3 euchromatic sequence'. Because most of the WGS3 scaffolds have not been mapped to chromosomes, we do not yet know how the assembled sequences are distributed within the 59 Mb of heterochromatin in the female genome and the additional 41 Mb of heterochromatin in the male genome. In addition to the 20.7-Mb sequence assembled in scaffolds, WGS3 includes 181,686 sequence traces clustered into 35,039 'degenerate scaffolds' representing repetitive sequences that were not assigned to unique locations in the assembly (E.W. Myers et al., unpublished work). These sequences include transposable elements and satellite sequences (unpublished data). Satellite sequences represent approximately 20% of the genome and can be cloned in plasmids, but such clones are inefficiently recovered and unstable [22,65]. Thus, we do not know what fraction of the remaining heterochromatic sequence is sampled by these additional, unassembled sequence traces. Therefore, WGS data cannot be used to estimate accurately the fraction of the heterochromatin that can be recovered in stable plasmid clones.
Improvements to the annotation
The annotation of the heterochromatic portion of the Drosophila genome described here is a work in progress. We have annotated protein-coding genes, and summarized preliminary observations on non-protein-coding genes and transposable elements. Our analysis was limited by the high repeat content of heterochromatin and by the unfinished quality of the WGS3 heterochromatic sequence. Our decision to delay annotation of the scaffolds shorter than 40 kb has probably resulted in failure to identify some genes, especially on the Y chromosome. Despite these limitations, the protein-coding gene annotations are generally reliable, as demonstrated by the identification of previously known heterochromatic genes, and the alignment of cDNA sequences to the draft genomic sequence and the annotated gene models. Nevertheless, the quality of the annotations will be greatly improved by the addition of more full-length cDNA sequences of heterochromatic genes and by comparative analysis using the mosquito  and D. pseudoobscura  WGS sequences.
Future analyses of the differences between euchromatic and heterochromatic sequence may lead to improvements in the performance of computational gene-prediction algorithms on heterochromatic sequence. Our observations that the gene-prediction tools Genie  and Genscan  performed relatively poorly in identifying heterochromatic genes suggests that these programs could be modified to improve their performance on heterochromatic sequence. Processing the genomic sequence, by masking repeats and reducing the distances between potential coding exons, improved the performance of the gene-prediction tools, and improved gene identification during subsequent re-annotation of the unmasked sequence. Optimization of these preprocessing steps should lead to improved performance. The annotated gene models that are supported by cDNA and/or high-quality TBLASTX matches provide a useful dataset for training and testing gene-prediction algorithms on heterochromatic sequence.
A collection of approximately 600 P transposable element insertions in heterochromatin has recently been generated  (A.Y. Konev, C.M. Yan, E. O'Hagan, S. Tickoo, G.H.K., unpublished data). These P element insertions will provide tools for the analysis of heterochromatic genes and manipulation of the heterochromatic portion of the genome. For example, P-element-mediated deletions of centric heterochromatin have been used to map genes and regions responsible for controlling gene expression and replication .
Improvements to the genomic sequence
We plan to improve the WGS3 heterochromatic sequence by filling sequence gaps and correcting assembly errors. The quality of the WGS assembly suggests a strategy for bringing these sequences to high quality: first, select a tiling path of 10-kb genomic clones from the WGS that span each scaffold; second, sequence each clone to high quality; third, assemble these 10-kb sequences to reconstruct the genomic sequence; and fourth, verify the assembly by comparison to cDNA sequence alignments and to restriction digests of genomic DNA, assayed if necessary on Southern blots. cDNA alignments will also be useful in linking separate scaffolds in cases in which the exons of a single gene lie in more than one scaffold. We gained extensive experience in each of these steps during our finishing of the euchromatic portion of the genome , and no new technology is required to bring the WGS scaffolds we have described here to finished quality.
Some regions of the heterochromatin are clonable in BACs. Three small BAC contigs from the genome physical map are located in the centric heterochromatin of chromosome arms 2L, 2R and 3L . Draft sequences of BACs spanning these contigs were produced during the Release 1 phase of the genome-sequencing project [12,71]. The small BAC contig on 2L corresponds to the WGS3 scaffold that includes light and concertina (see Figure 6) , and the contigs on 2R and 3L also align to WGS3 scaffolds. We have also identified BACs containing the rolled, PARP and SNAP25 genes in pilot STS content mapping experiments in the heterochromatin. Doubtless other regions of the heterochromatin will be represented in large-insert libraries, and BACs will be useful for linking and orienting short WGS3 heterochromatic scaffolds. However, we were unable to identify BACs containing the Y-linked genes ccy and kl-5 in available Drosophila BAC libraries [60,72], perhaps due to high satellite DNA content. This suggests that not all heterochromatic regions assembled in WGS3 will be represented in BACs. We also do not yet know whether the highly repetitive nature of heterochromatin decreases the stability of sequences cloned in BACs. For these reasons, we favor a sequence-finishing strategy based on the 10-kb clones generated in the WGS.
Materials and methods
Genomic sequence alignments
The alignment of WGS3 to the essentially finished Release 3 sequence spanning the euchromatin is described in Celniker et al. . WGS3 scaffolds that did not show significant alignment were defined to be heterochromatic. In addition, five WGS3 scaffolds aligned to the centric ends of the Release 3 euchromatic sequence and extended beyond them. We used sim4  and MUMmer  to realign these five scaffolds, and the sequence extending beyond the aligned portions of the Release 3 euchromatic sequence contigs was extracted with a 60-bp overlap and included in the WGS3 heterochromatic sequence. Finally, a 133-kb region including BACR48D21 at the centric end of the Release 3 X-chromosome sequence  was not included in the Release 3 annotation of the euchromatin . This sequence was included in the analysis of WGS3 heterochromatic sequence.
The WGS3 heterochromatic sequence scaffolds were aligned to the corresponding Release 2 sequence scaffolds using BLAST2 . The alignment results were carefully examined, and the two assemblies were found to have few discrepancies.
Masking sequence scaffolds
We masked WGS3 heterochromatic sequences before annotation. We used RepeatMasker , with the default settings, to mask transposable elements  and low-complexity sequences. Next, we shortened all sequence gaps and repeat-masked regions to 70 bp.
Databases and tools
To annotate the WGS3 heterochromatic sequence, we used the computational analysis pipeline and databases described in Mungall et al. . This pipeline aligns Drosophila ESTs, cDNA sequences, and other sequences in GenBank using sim4 , performs DNA and protein sequence similarity searches of the GenBank and SwissProt/TrEMBL databases using BLASTX and TBLASTX, and executes the gene-prediction algorithms Genie  and Genscan . The results generated by the pipeline were filtered using the Bioinformatics Output Parser (BOP) , and the filtered results were curated using the tool Apollo .
To determine the intron-exon structure of gene models, curators visually inspected the alignment of computational evidence types to the WGS3 heterochromatic sequences using Apollo. Alternative transcripts supported by EST evidence and UTRs were annotated. The criteria used to curate the gene models in the WGS3 sequence were identical to those used in the annotation of the euchromatin , with two exceptions. First, computational results derived from Genscan were not judged by their scores, as it was empirically determined that low-scoring Genscan results often correctly predicted the intron-exon structures suggested by other evidence types such as cDNAs. Second, the predicted proteins of the WGS3 annotation were compared to a curated transposable element dataset . Gene models with an alignment of at least 50 nucleotides with at least 95% identity to transposable elements were rejected.
Gene models produced on the masked scaffolds were mapped to the unmasked WGS3 sequence using sim4. Gene models that aligned with less than 95% identity to the unmasked sequence were not preserved. Gene models were checked to ensure that they did not overlap transposable elements and that they did not have major alterations of their intron-exon structure due to the presence of unmasked data. Gene models were refined using evidence that had not aligned to the masked WGS3 sequence.
Linking scaffolds with cDNAs
In WGS3, exons of the kl-5 gene are distributed over four scaffolds. These scaffolds were concatenated and annotated to produce a kl-5 gene model. Similar 'super-scaffolds' (linked_1 to linked_6) were constructed for the genes kl-2, kl-3, kl-5, Ory, Pp1-Y2, and Ppr-Y. An additional super-scaffold (linked_7) was constructed based on EST evidence. The super-scaffolds were constructed as follows, with each WGS3 scaffold indicated by the last five digits of the scaffold ID, and relative orientation indicated by F (forward) or R (reverse): linked_2(80774R-78545F-78270R-78383F-79796R-78126F-78519R); linked_2 (80705F-80550F-80543F-80769R-80048R); linked_3 (80569R-79234F-79561F-80324F); linked_4 (80310F-80189F-80349R-80306F); linked_5 (78068R-78764R-80118R); linked_6 (80329F-78590F); and linked_7 (78279R-78567F). Because the gaps between individual scaffolds in a super-scaffold are not spanned by identified genomic clones, their lengths are undefined. These gaps are represented with a string of 1,000 Ns, following the convention established for the Release 1 genomic sequence .
Fluorescence in situhybridization (FISH)
Mitotic chromosomes from third instar larval neuroblasts were obtained by standard procedures . Slides were aged at room temperature for 24 h or for 2 h at 60°C, pretreated in 100 μg/ml RNaseA/2x SSC, pH 7 at 37°C for 30-60 min, immersed in a 70/95/100% ethanol series for 2 min each, then air-dried and kept on a slide warmer at 45°C. BAC DNA (1 μg) was labeled with biotin-16-dUTP (Roche) or digoxigenin-11-dUTP (Roche) by nick translation. Labeled BAC DNA (200-300 ng per 22 × 22 mm hybridization area) was precipitated at -80°C for 30 min or overnight at -20°C with salmon sperm DNA (2 μg), 1/10th vol sodium acetate, and 2 vol cold 100% ethanol. Probes were centrifuged for 20 min at 14,000 rpm and 4°C, washed in 70% ethanol, and briefly dried in a Sorvall SpeedVac. Hybridization mix (10-15 μl) was added to each dried pellet. All probes were initially hybridized in a solution containing 55% formamide/2x SSC, 20% dextran sulfate, 1% Tween-20, incubated overnight at 37°C, and washed in 55% formamide/2x SSC at 42°C for 20 min, followed by four washes in 2x SSC at 37°C (2 min each) and 1-3 washes in 0.1x SSC at 60°C (1 min each). For those BACs that demonstrated significant cross-hybridizaton to other chromosomal regions, FISH was repeated using higher stringency (60% formamide) hybridization solution and post-hybridization washes. In single-color hybridization experiments, biotinylated or digoxigenin-labeled BAC probes were detected using FITC-avidin (Vector Laboratories) or FITC anti-digoxigenin (Roche), respectively. Detections were performed overnight at 4°C or 1-3 h at room temperature. For multi-BAC (two-color) FISH, biotinylated probes were detected with FITC avidin and digoxigenin-labeled probes were detected with Rhodamine anti-digoxigenin (Roche). After incubation with avidin or anti-digoxigenin, slides were washed in coplin jars on a rotating shaker for three 5-min washes in 4x SSC/0.1% Tween-20. DNA was counterstained with Vectashield (Vector Laboratories) containing 1-5 μg/ml 4,6-diamidino-2-phenylindole (DAPI). The location of the BAC signals relative to the DAPI banding pattern on the heterochromatic map was determined by visual analysis in Photoshop (Adobe). In addition, an independent quantitative analysis using IP Labs (Signal Analytics, Vienna, VA) and a fluorescence quantitation script  was performed on each image. The fluorescence levels along lines drawn through the chromosome axis were plotted for the DAPI and BAC signals (Figure 8b), which produces a more precise localization than is possible by visual inspection of the images. A minimum of 10 prometaphase chromosomes were analyzed for each BAC, and localizations were determined by consensus.
Heterochromatin in the 'Release 3 euchromatic sequence'
The BACs at the boundaries between euchromatin and centric heterochromatin (see Table 1) identify 2.1 Mb (150 curated gene models) of heterochromatic sequence at the centric ends of the high-quality Release 3 sequence spanning the euchromatin . The distal ends of the boundary BACs were identified using BAC end sequences. The sequence proximal to these positions is provisionally defined to be heterochromatic: X, 21,561,835 to 21,912,668 bp (0.351 Mb, six genes); 2L, 21,834,050 to 22,217,931 bp (0.384 Mb, 25 genes); 2R, 1 to 467,915 (0.468 Mb, 44 genes); 3L, 22,879,753 to 23,352,213 (0.472 Mb, 11 genes); and 3R, 1 to 378,655 (0.379 Mb, 64 genes).
WGS3 heterochromatic sequence and annotations will be deposited in GenBank and the Fly Base GadFly database , and the corresponding Release 2 sequences will be subsumed into the new sequence accessions. The WGS3 heterochromatic sequence, the annotations, and the evidence that supports them will be made available at FlyBase .
Additional data files
An additional data file containing Supplementary Tables 1 and 2 is available as a word document.
We thank Sima Misra and Casey Bergman for helpful discussions; Andrew Skora, Christopher Yan, and David Acevedo for assistance with FISH experiments; Robert Svirskas, John Tupy, Pavel Hradecky, Colin Wiel, Bruno Ribeiro, Marcelo Alvim and Maria Vibranovski for assistance with informatics; Erwin Frise, Eric Smith and Dave Hurley for computer systems support; and Catherine Nelson for editing the manuscript. This work was supported by Celera Genomics, the Howard Hughes Medical Institute, NSF grant MCB0213163 to B.T.W., fellowships from the CNPq and the Pew Latin American Fellows Program to A.B.C., NIH grant R01 HG00747 to G.H.K. and NIH grant P50 HG00750 to G.M.R. The work supported by P50-HG00750 was carried out under Department of Energy Contract DE-AC0376SF00098, University of California.
- 1.Heitz E: Das Heterochromatin der Moose. I Jahrb Wiss Botanik. 1928, 69: 762-818.Google Scholar
- 2.John B: The biology of heterochromatin. In Heterochromatin: Molecular and Structural Aspects. Edited by: Verma RS. 1988, Cambridge: Cambridge University Press, 1-147.Google Scholar
- 22.Sun X, Le H, Wahlstrom J, Karpen GH: Sequence analysis of a functional Drosophilacentromere. Genome Res. Google Scholar
- 23.Heitz E: Uber α- und β-Heterochromatin sowie Konstanz und Bau der Chromomeren bei Drosophila. Biol Zentbl. 1934, 54: 588-609.Google Scholar
- 32.Kaminker JS, Bergman CM, Kronmiller B, Carlson J, Svirskas R, Patel S, Frise E, Wheeler DA, Lewis SE, Rubin GM, et al: The transposable elements of the Drosophila melanogaster euchromatin: a genomics perspective. Genome Biol. 2002, 3: research0084.1-0084.20. 10.1186/gb-2002-3-12-research0084.CrossRefGoogle Scholar
- 33.FlyBase. [http://flybase.bio.indiana.edu]
- 47.Celniker SE, Wheeler DA, Kronmiller B, Carlson JW, Halpern A, Patel S, Adams M, Champe M, Dugan SP, Frise E, et al: Finishing a whole-genome shotgun sequence assembly: Release 3 of the Drosophila euchromatic sequence. Genome Biol. 2002, 3: research0079.1-0079.14. 10.1186/gb-2002-3-12-research0079.CrossRefGoogle Scholar
- 51.Mungall CJ, Misra S, Berman BP, Carlson J, Frise E, Harris N, Marshall B, Shu S, Kaminker JS, Prochnik SE, et al: An integrated computational pipeline and database to support whole-genome sequence annotation. Genome Biol. 2002, 3: research0081.1-0081.11. 10.1186/gb-2002-3-12-research0081.CrossRefGoogle Scholar
- 54.Carvalho AB, Vibranovski MD, Carlson JW, Celniker SE, Hoskins RA, Rubin GM, Sutton GG, Adams MD, Myers EW, Clark AG: Y chromosome and other heterochromatic sequences of the Drosophila melanogastergenome: how far can we go?. Genetica. Google Scholar
- 61.Yasuhara J, Marchetti M, Fanti L, Pimpinelli S, Wakimoto BT: A strategy for mapping the heterochromatin of chromosome 2 of Drosophila melanogaster. Genetica. Google Scholar
- 63.RepeatMasker. [http://ftp.genome.washington.edu/RM/RepeatMasker.html]
- 67.Human Genome Sequencing Center: Baylor College of Medicine. [http://hgsc.bcm.tmc.edu]
- 71.Berkeley Drosophila Genome Project. [http://www.fruitfly.org]
- 72.BACPAC Resources. [http://www.chori.org/bacpac]
- 75.FlyBase GadFly Genome Annotation Database. [http://www.fruitfly.org/cgi-bin/annot/query]