Retrotransposition of gene transcripts leads to structural variation in mammalian genomes
- 17k Downloads
Retroposed processed gene transcripts are an important source of material for new gene formation on evolutionary timescales. Most prior work on gene retrocopy discovery compared copies in reference genome assemblies to their source genes. Here, we explore gene retrocopy insertion polymorphisms (GRIPs) that are present in the germlines of individual humans, mice, and chimpanzees, and we identify novel gene retrocopy insertions in cancerous somatic tissues that are absent from patient-matched non-cancer genomes.
Through analysis of whole-genome sequence data, we found evidence for 48 GRIPs in the genomes of one or more humans sequenced as part of the 1,000 Genomes Project and The Cancer Genome Atlas, but which were not in the human reference assembly. Similarly, we found evidence for 755 GRIPs at distinct locations in one or more of 17 inbred mouse strains but which were not in the mouse reference assembly, and 19 GRIPs across a cohort of 10 chimpanzee genomes, which were not in the chimpanzee reference genome assembly. Many of these insertions are new members of existing gene families whose source genes are highly and widely expressed, and the majority have detectable hallmarks of processed gene retrocopy formation. We estimate the rate of novel gene retrocopy insertions in humans and chimps at roughly one new gene retrocopy insertion for every 6,000 individuals.
We find that gene retrocopy polymorphisms are a widespread phenomenon, present a multi-species analysis of these events, and provide a method for their ascertainment.
KeywordsAcute Myeloid Leukemia Reference Genome Source Gene Feline Immunodeficiency Virus Inbred Mouse Strain
acute myeloid leukemia
African ancestry in south-west US
Utah residents with northern and western European ancestry
Han Chinese in Beijing, China
Han Chinese in southern China
Colombian in Medellin, Colombia
copy number variant
false discovery rate
Finnish from Finland
British from England and Scotland
gene retrocopy insertion polymorphism
Japanese in Tokyo, Japan
long interspersed element
lung squamous carcinoma
Luhya in Webuye, Kenya
Mexican ancestry in Los Angeles
open reading frame
polymerase chain reaction
Puerto Rican in Puerto Rico
small interfering RNA
The Cancer Genome Atlas
Toscani in Italy
Yoruba in Ibadan, Nigeria
Mammalian genomes contain thousands of pseudogenes - stretches of DNA sequence with homology to functional genes. As an example, pseudogene.org documents 17,061 human pseudogenes in build 65, and 19,119 mouse pseudogenes in build 60 [1, 2, 3]. A recent, more stringent survey identified 14,112 pseudogenes in the human genome . Pseudogenes originate through a variety of mechanisms including retrotransposition of processed mRNAs (processed pseudogenes), segmental duplication, and inactivating mutations. Processed pseudogenes are derived from spliced transcripts and they lack the intron-exon structure of their source gene .
Retrotransposition refers to the insertion of DNA sequences mediated by an RNA intermediate . In humans, this process is carried out chiefly through the reverse-transcriptase  and endonuclease  functions of the LINE-1 ORF2 protein, with assistance from the ORF1 protein, which binds RNA  and functions as a chaperone . In addition to mobilizing its own transcripts, LINE-1 mobilizes other transcripts including, but probably not limited to, Alu , SINE-VNTR-Alu  and processed pseudogenes . The specific reverse-transcriptase responsible for processed pseudogene formation varies among species depending on the retroelement content in the genome. For example, in S. cerevisiae, processed pseudogenes are mobilized by Ty1 elements . In this study we refer to retroposed gene transcripts as gene retrocopies to avoid confusion with the functional connotations of terms such as 'pseudogene' and 'retrogene' . When used, 'pseudogene' (or retropseudogene) refers to a non-functional gene retrocopy while 'retrogene' refers to a gene retrocopy with intact activity.
A growing number of contemporary studies highlight the extent to which individuals differ in terms of inserted retrotransposon sequences , but there has not been significant study of how mammalian genomes differ from one another and from the reference assembly in a given population due to gene retrocopy insertions, although detection of the phenomenon has been discussed briefly [17, 18]. Retrogene insertion polymorphisms have been described in a study of 37 Drosophila melanogaster inbred lines  based on the detection of intron presence/absence polymorphisms.
Pseudogenes affect genome function in several important ways. Although most gene retrocopies lack the 5' promoter and regulatory regions present at the site of origin , mobilization to another genomic location can put the retrocopy in a novel regulatory context that may allow it to be transcribed [20, 21, 22]. Transcription of certain gene retrocopies can be either widespread, specific to a tissue or cell type, or specific to particular tumors . Transcribed gene retrocopies can regulate the source transcript through an antisense mechanism , are a source of siRNAs [25, 26, 27, 28], can affect the stability of the source transcript , and can affect expression of the source gene by providing a molecular sponge that competes with the source transcript for miRNA binding due to sequence similarity to the source gene . Retrocopies and retrogenes can exert direct effects if the nearby genomic architecture promotes their expression, as is the case for a novel insertion of the FGF4 transcript in the domestic dog, which leads to the chondrodysplastic phenotype that typifies many dog breeds . On an evolutionary timescale, the process of gene duplication through retrotransposition of processed transcripts constitutes a major mechanism for new gene formation , typified by examples such as the jingwei gene in Drosophila .
Here, we refer to a processed gene transcript that is present as a retrotransposed insertion in one or more individuals but absent from the reference genome as a gene retrocopy insertion polymorphism (GRIP). Insertions that are not polymorphic and not transmissible (somatic insertions) are not referred to as GRIPs. We present evidence that the interspersed insertion of processed mRNAs into the genome is an ongoing mechanism of mutation in humans, mice, and chimps, and can occur in tumors. Additionally, the availability of our application for detecting these events will enable all large-scale genome sequencing projects to include gene retrocopy insertions in their analysis of genomic variation.
Results and discussion
A catalog of non-reference human gene retrocopy insertions
In addition to samples sequenced by the 1,000 Genomes Project, we took advantage of the many samples sequenced to high depth by The Cancer Genome Atlas (TCGA) . One aim of TCGA is to study the whole genomes of tumor and normal samples obtained from the same patient. We analyzed 85 paired genomes sequenced to high coverage depth (Table S3 in Additional file 1) and found 26 distinct GRIPs (Table S4 in Additional file 1). This dataset also provided us with the opportunity to search for cancer-specific somatic gene retrocopy insertions.
Most of these insertions bear the hallmarks of processed transcript insertions generated by retrotransposition. The insertion side of the 3' junctions terminates in poly-A sequences; we detected target site duplications in all instances where both junctions are detectable, and we could obtain exon-exon junctions from 39 out of 48 of the inserted sequences through a local sequence assembly approach (see Materials and methods and Additional file 2 for junctions derived from 1,000 Genomes samples). Predicted endonuclease cleavage sites agree with the consensus TTTT/AA reported in prior studies (Figure S5 in Additional file 3) [8, 36].
In order to estimate the sensitivity and specificity of our detection scheme, we created a total of 2,000 simulated processed gene retrocopy insertions from 200 source genes and spiked them into the BAM file for sample TCGA-60-2711-11 and detected them by running GRIPper (see Materials and methods, Tables S10 and S11 in Additional file 1). The overall precision was 100% at the effective minimum read depth for paired TCGA samples (60× from the combined contribution of a tumor genome sequenced to 30× and the matched normal genome sequenced to 30× depth) and recall was 75.1% (Table S10 in Additional file 1). We found that recall varies depending on the identity of the source gene (Table S11 in Additional file 1, Figure S4 in Additional file 3).
Functional characteristics of source genes
GO term enrichment for human GRIP progenitor genes
GO:0006414: translational elongation
4.78 × 10-9
6.24 × 10-6
8.63 × 10-8
1.13 × 10-4
GO:0003735: structural constituent of ribosome
2.66 × 10-7
3.06 × 10-4
GO:0033279: ribosomal subunit
1.37 × 10-6
1.53 × 10-3
1.91 × 10-6
2.14 × 10-3
GO:0022626: cytosolic ribosome
2.92 × 10-6
3.28 × 10-3
GO:0005198: structural molecule activity
3.36 × 10-5
3.87 × 10-2
GO:0015934: large ribosomal subunit
3.58 × 10-5
4.02 × 10-2
GO:0044445: cytosolic part
6.23 × 10-5
7.01 × 10-2
GO:0030529: ribonucleoprotein complex
7.34 × 10-5
8.24 × 10-2
Detection of cancer-specific gene retrocopy insertions
The genomes sequenced by TCGA include DNA derived from both normal tissue and a tumor sample taken from the same individual, enabling discovery of putative cancer-specific somatic variants. We analyzed pairs of tumor and normal genomes from 6 different types of cancer: 24 pairs from acute myeloid leukemia (AML) patients, 12 breast cancer (BRCA), 5 colorectal adenocarcinoma (COAD), 15 glioblastoma multiforme (GBM), 6 lung adenocarcinoma (LUAD), 13 lung squamous carcinoma (LUSC), and 10 ovarian carcinoma (OV). In screening these 85 pairs of tumor and normal genomes by combining the calls as described in the Materials and methods section, we discovered three novel somatic gene retrocopy insertions from two lung tumors with no corresponding read pairs in the matched normal samples and no supporting read pairs in any other sample in this study. The three source genes are selenoprotein T precursor (SELT), smooth muscle myosin heavy chain 11 (MYH11), and a spliced non-coding RNA known as Homo sapiens growth arrest-specific 5 (GAS5). While the presence of these genes is not enough to make any causative link with carcinogenesis in this patient, this does strongly suggest that somatic insertions of spliced mRNAs derived from protein-coding genes may occur, at least in the context of cancer. We note that MYH11 rearrangements involving CBPβ are implicated in cancers including acute myeloid leukemia  and sarcomas of the small bowel [40, 41], and GAS5 depletion has been noted in breast cancer . We note that the MYH11 insertion site occurs in a region that is sometimes deleted as a segregating variant cataloged in the Database of Genomic Variants , but the sample LUSC-2722, which has the novel MYH11 retrocopy in the tumor genome, does not have this deletion (Table S12 in Additional file 1). Somatic LINE-1 mediated retrotransposition events have been observed in lung, colon, ovarian, and prostate tumors for transposable element transcripts [44, 45, 46], but the mobilization of gene-derived transcripts is novel, and may be a means for the amplification of oncogene copy number in some tumors. The discordant read mappings leading to these three calls are shown in Figures S1 to S3 in Additional file 3.
Novel gene retrocopy insertions in inbred mouse strains
We sought to extend our catalog of GRIPs to mice, as the deeply sequenced genomes of 17 different inbred mouse strains are now available , and a small set of GRIPs have been described . We applied the same method as described for detecting GRIPs by substituting mouse genome annotations, and identified a total of 755 insertions from 610 distinct source genes (Table S6 in Additional file 1). We found that 63 loci overlap with structural variants obtained from a mouse of the DBA inbred strain using HYDRA-SV . Since the mouse reference (mm9/NCBI m37) is assembled from sequences derived from the C57BL/6J strain, it is not surprising that we only detected one novel GRIP in that strain, which could have occurred in the generations between the last common ancestor of the mouse sequenced by the Mouse Genome Sequencing Consortium  and the more recently sequenced individual . Of the 755 insertions identified in our analysis, 201 (26.62%) occurred in annotated genes. This is a significant depletion compared to the 40.38% of the genome covered using the UCSC Genes annotation set (see Materials and methods, P = 1.76 × 10-14, proportions test).
Novel gene retrocopy insertions in chimpanzees
In addition to whole-genome sequence data available for humans and mice, genome sequences for ten individual chimpanzees are available through the PanMap project . These genomes were sequenced to approximately 10× average depth and are available in .bam format aligned to the Chimp Genome Sequencing Consortium 2.1/panTro2 reference assembly. We downloaded these and used the same pooling strategy used for the low-coverage data from the 1,000 Genomes Project to identify novel gene retrocopy insertions present in one or more of the ten individual chimps but absent from Clint, the reference chimp. In total, we identified 19 novel GRIPs, 9 of them in introns (Table S9 in Additional file 1).
Distribution of GRIPs in human populations
Estimating the rate of gene retroposition in humans
The rate at which new gene retrocopies are formed by retrotransposition may be related to the rate of new gene formation. Our population-level data in humans allows a straightforward estimate similar in method to a previous estimate of retrotransposition for LINE-1 elements . Watterson's equation  estimates the mutation rate μ, which in this context refers to the per generation rate of processed gene transcript retroposition (see Materials and methods). Using the 48 non-reference human GRIPs identified from the 1,024 human genomes analyzed in this study, we estimate that 1 in every 6,256 individuals has a novel, heritable, gene retrocopy. Since this ignores any segregating retrocopies in the reference genome, we sought to estimate the number of reference retrocopies by cross-referencing the deletion calls from the 1,000 Genomes Project  with annotated pseudogenes in the human reference genome. We found evidence for 10 GRIPs in the reference (see Materials and methods) yielding a total of 58 segregating insertions for 1,025 individuals (when the reference genome is included as one individual). This increases our estimate of μ to 1 new gene retrocopy insertion per 5,177 individuals per generation (see Materials and methods). In order to apply Watterson's formulae without bias, the chosen markers must be selectively neutral. A Tajima's D test yields a value of -0.99, indicating that while there may be some tendency toward purifying selection, the detected human GRIPs are, when considered on the whole, under neutral selection , validating this method of estimation. Performing the same estimation for chimpanzees using an effective population size of 11,413, which was calculated from the same whole genome sequence data , we arrive at an estimate of 1 new insertion per every 6,804 chimps, quite comparable to humans with the small discrepancy most likely due to a lack of information concerning pseudogene deletions relative to the chimp reference assembly.
A large fraction of the human genome is covered by copy number variants (CNVs), including regions containing genes , and a number of recent publications have highlighted the extent of variability in gene copy number due to CNVs between individual humans. Starting from a large-scale set of deletions detected in human populations , Schrider and Hahn calculate that any two humans differ by over 100 gene-containing CNVs . Approximately 9% of human genes appear to vary in copy number, mostly between 0 and 5 copies , likely through segmental duplication. The data we have presented here add to what is known about gene copy number variation by highlighting another mechanism separate from the large duplications that cause copy number variability of intron-containing gene loci. Through retrotransposition, GRIPs occur as interspersed insertions of processed transcripts. Whereas segmentally duplicated genes are likely to share the same regulatory regime, gene retrocopy insertions often mobilize copies into novel regulatory contexts, where they tend to experience an increased likelihood of adaptive evolution . Many of these new gene retrocopy insertions will be inactive due to missing promoters, frameshifts, and truncation. That said, the subset of GRIPs that are recent enough not to be lost or fixed through genetic drift are likely to be more recent insertions and likely to have suffered fewer inactivating mutations to the open reading frame and any intact regulatory elements.
It is clear that processed gene transcripts are retrotransposed in the germline, and by extension one might imagine that this also occurs in somatic tissues. Transgenic mice with a LINE-1 cassette facilitating detection of insertion events show extensive variation in transposition frequency across tissues , and in particular, neural progenitor cells in the brain . There is evidence for somatic retrotransposition during early development in Drosophila  and in humans . Somatic retrotransposition of retroelements may also occur in human cancers [44, 45] and contributes to a variety of human diseases . We have demonstrated that insertions of retrotransposed processed transcripts can contribute to somatic variation in tumor tissue. Given this observation, studies of somatic retrotransposition of processed mRNAs in a variety of somatic tissues including the brain may yield novel retrocopy insertions, given evidence for elevated retrotransposition in some specific neural tissues from quantitative PCR  and targeted ascertainment of insertion sites . That said, a recent study indicates some neural tissues do not appear to support a high level of retrotransposition .
Each new insertion of a gene retrocopy presents a new opportunity for the evolution of a new gene or the modification of an existing function at the site of insertion. There are a number of examples where inserted gene retrocopies have acquired new functions . One notable example is the insertion of cyclophilin A (PPIA) into TRIM5α in the owl monkey leading to a novel gene fusion that confers resistance to HIV-1 infection [72, 73]. A similar mutation involving the insertion of a cyclophilin A retrocopy into TRIM5α also occurred independently in rhesus macaques, leading to resistance to HIV-2 and feline immunodeficiency virus infection [74, 75]. In total, we report 22 human, 201 mouse, and 9 chimp GRIPs in introns or exons that could lead to novel gene fusions with modified functions . While human GRIPs occur in annotated genes about as often as would be expected by chance, we identified a marked depletion of mouse GRIPs in genes. This may indicate purifying selection due to deleterious effects on the genes hosting the GRIPs. In any case, this observation illustrates that the ability to detect this form of genomic variation opens new questions about the biological consequences of gene retrocopy insertion and provides a starting point for further investigation. In general, this study will provide a foundation for future investigation into the functional consequences of gene retrocopy insertion polymorphisms.
Materials and methods
Gene retrocopy insertion detection from mapped paired end reads
Paired end reads consist of two DNA sequences flanking an internal unsequenced region. Given the average insert size of a sequencing library, and the locations relative to a reference genome where either end of a paired end fragment map, a pair of mappings is termed concordant if the sequenced ends are mapped to the reference genome at an interval and orientation compatible with the library construction. Conversely, a pair of mappings is termed discordant if the paired ends are mapped too far apart or in the wrong orientation relative to the reference genome to which they are mapped. Given sufficient read depth and agreement between multiple paired reads, discordant read pairings can contain information about genome rearrangements relative to the reference if the rearrangements bring two pieces of the genome into proximity that are distant from one another in the reference genome. Here, we use discordant read mappings to detect GRIPs by finding multiple discordant mappings that connect exonic sequences to a consistent location distant from the exons. We refer to the genome or genomes from which a sequencing library was generated and analyzed as the query genome. For some region of a chromosome, if the sequence of the query genome matches the sequence of the reference genome, read pairs mapped to that region will be concordant as shown in the normal mapping of Figure 1. Alternately, if a region in the query genome contains a structural variant (insertion, deletion inversion, and so on) relative to the reference, some or all of the read pairs mapping to that location may be discordant. Figure 1 also demonstrates the pattern of discordant mappings indicative of a gene retrocopy insertion in the query genome. In order to confidently predict the presence of a gene retrocopy in a query genome or genomes, we require at least eight distinct mappings between the source gene and its insertion location, with at least two mappings spanning each junction. Illumina sequencing chemistry yields paired reads where the first read in the pair is sequenced on the top strand and the second read is sequenced on the bottom strand, such that the first read maps to the top (+) strand of the reference genome and the second read maps to the bottom (-) strand of the reference genome. Given this property, the reads mapping to the 5' side of the predicted insertion site must be on the top strand and the reads on the 3' side of the site must be on the bottom strand. Likewise, the mappings of the discordant reads themselves must be consistent with this pattern. We also require that the reads mapping to the source gene must correspond to at least two distinct exons. Additionally, we filter out putative insertion sites where the site is in a region of the genome that contains an annotated or unannotated pseudogene. Unannotated pseudogenes are ascertained by comparing the insertion site +/- 500 bp to the rest of the reference genome using BLAT . This method (GRIPper) was implemented in Python using pysam  and is available from github . An archival version of the software is also available as Additional file 4; however, we suggest using the most up-to-date version via github.
Breakpoint ascertainment from soft-clipped reads
Many of the human samples analyzed in this study were mapped using bwa , which allows for part of a read to align as long as the seed sequence meets the minimum mismatch criteria. The unaligned portion of these mappings is marked as soft-clipped. This provides a convenient means to check for breakpoints by looking for consistent break ends corresponding to the 5' and 3' junctions of the inserted gene retrocopy. Target site duplications are ascertained by searching for correspondence between the sequences on either side of the breakpoint.
Local sequence assembly to identify exon-exon junctions
In order to identify exon-exon junctions that are present in inserted processed gene retrocopy sequences, we employed a two-stage local assembly strategy. First, read pairs that map within 500 bp of a predicted insertion site that are discordant, one-end-anchored (reads where the mate is unmapped), or have at least one read in the pair that is soft-clipped are used as input to a short read assembler. For a first attempt at assembly, we use Velvet  with a k-mer size of 31, the shortPaired option to indicate the reads were paired, and an insert length of 300. The resulting contigs are aligned back to the reference genome using BLAT  to identify reads that map to exonic sequences corresponding to the source gene and without aligning to the intervening introns (spliced alignments). The majority of junctions are ascertained in this first step using Velvet which utilizes de Bruijn graphs to guide assembly. Secondarily, the discordant, one-end-anchored, and soft-clipped reads corresponding to the remaining insertions for which an exon-exon junction was not apparent were then assembled using PRICE , which utilizes a seed-and-extend assembly strategy, and aligned back to the reference to identify spliced junctions. We ran PRICE for 20 cycles using the anchored read pairs (those which map uniquely near the gene retrocopy insertion site) as the seed sequences.
Simulation of novel gene retrocopy insertions
Retrogene insertions were simulated by adding insertions of spliced, polyadenylated mRNA transcripts to sample TCGA-60-2711-11 (LUSC-2711 Normal) using bamsurgeon . Bamsurgeon can add structural variants (including insertions) to existing BAM files through local assembly followed by modification of the assembled contig, simulation of paired read coverage (100 paired end base pairs with 300 unsequenced insert base pairs), realignment, and replacement into the original BAM. We added a total of 2,000 insertions from 200 different processed mRNAs (Table S11 in Additional file 1) to LUSC-2711, and downsampled the resultant BAM from 60× average coverage to 40×, 30×, 20×, 10×, and 5× using DownSampleSam, part of the Picard suite of utilities . We used GRIPper to detect the spiked-in processed mRNAs to evaluate the detection characteristics. At 60× coverage we obtained perfect precision and a recall of 0.751 (1,501 true positives and 499 false negatives with no false positives). As expected, recall decreases with decreasing coverage (Table S10 in Additional file 1). In general, false negatives are due to single exon genes (for example, OR7G2) at high coverage and mainly due to insufficient read support at low coverage. Since we combined reads from both tumor and normal genomes for all TCGA samples in this study, which have coverage of 30× or greater, detection of germline insertions was done on samples with an effective coverage of 60× or greater.
Identifying gene retrocopy insertions included in the reference genome assembly
GRIPs in the reference genome that are not present in other individuals will appear as deletions relative to the reference. To detect these, we cross-referenced the deletion data from the 1,000 Genomes Project [34, 57] with pseudogene annotations from GENCODE/ENCODE  and Yale . Deletions were obtained in variant call format from the 1,000 Genomes Project FTP server, and pseudogene annotations where obtained from the UCSC Genome Browser , and from pseudogene.org human build 65 . To allow for repetitive sequences in gene UTRs we allowed the deletion to span a region up to three times larger than the surrounded pseudogene annotation. We also required homology between the deleted sequence and the source gene of the annotated pseudogene. A list of the GRIPs ascertained in this way is included in Additional file 1 (Table S7 in Additional file 1), two of which correspond to both of the processed pseudogene deletion polymorphisms (pseudocopies of GCSH and ITGB1) mentioned in a previous study .
Strategy for low-pass genome sequence data and tumor/normal pairs
In order to ascertain insertion sites from a large collection of genomes sequenced at low (2× to 5×) coverage, or to ensure maximum sensitivity in ascertaining cancer-specific insertions, we combine data across multiple samples. This is accomplished simply by extracting discordant reads where one end maps to an exon and the other end elsewhere in the reference genome from each genome of interest, and analyzing the merged set of discordant reads en masse while keeping track of the sample identifier associated with each discordant pair of mapped reads. When insertions are called, all genomes contributing reads to a call are considered to have the insertion.
Calculating coverage of gene annotations
In order to test for enrichment or depletion of gene retrocopy insertions relative to gene annotations, we must have an accurate figure for how much of the reference genome assembly is covered by the set of annotations used. For both human and mouse, we used UCSC genes : human version 5 and mouse version 5. From BED formatted versions of these annotation tracks, the bedCoverage tool from the Kent source utilities was used to calculate the fraction of the genome covered. To calculate enrichment, we performed a one-sample proportions test with continuity correction using the prop.test function in R .
Calculating distance between GRIP profiles
where A and B are sets of gene retrocopy insertions for two genomes.
Estimating the rate of gene retrocopy insertion
where S is the number of segregating sites and n is the number of individuals. Since we have n = 1,024 and S = 48, a n = 7.508, and . If we assume an effective population size of 10,000, μ = 6.394/40,000 ≈ 1/6,256 GRIPs per individual per generation. Including the 10 pseudogenes present in the reference but deleted in one or more individuals in the 1,000 Genomes Project data (Table S7 in Additional file 1), which likely indicate GRIPs that are included in the reference, our estimate for θ becomes yielding a rate of μ = 7.725/40,000 ≈ 1/5,178 gene retrocopy insertions per individual per generation.
Data from the 1,000 Genomes Project  is available from the website ; Table S1 in Additional file 1 contains a list of individual genomes downloaded for analysis as part of this study. Data from The Cancer Genome Atlas is available to authorized users through the Cancer Genomics Hub ; a list of tumor/normal pairs used in this analysis is included as Table S3 in Additional file 1. The genomes of 17 inbred mouse strains  are available through the Wellcome Trust Sanger Institute Mouse Genomes Project . The genomes of ten individual chimpanzees  are available through the PanMap project .
We thank members of the genome reconstruction and cancer groups at the UCSC Center for Biomolecular Science and Engineering, members of the TCGA publication committee, and others for reviewing our work. We also thank TCGA genome sequencing centers for data generation and sequence alignment. We acknowledge funding from the Howard Hughes Medical Institute to DH and the following National Institutes of Health grants to DH: 5P41HG002371-12 (NHGRI) and 5U24CA143858-03 (NCI).
- 23.Kalyana-Sundaram S, Kumar-Sinha C, Shankar S, Robinson D, Wu YM, Cao X, Asangani I, Kothari V, Prensner J, Lonigro R, Iyer M, Barrette T, Shanmugam A, Dhanasekaran S, Palanisamy N, Chinnaiyan A: Expressed pseudogenes in the transcriptional landscape of human cancers. Cell. 2012, 149: 1622-34. 10.1016/j.cell.2012.04.041.PubMedPubMedCentralCrossRefGoogle Scholar
- 25.Watanabe T, Totoki Y, Toyoda A, Kaneda M, Kuramochi-Miyagawa S, Obata Y, Chiba H, Kohara Y, Kono T, Nakano T, Surani MA, Sakaki Y, Sasaki H: Endogenous siRNAs from naturally formed dsRNAs regulate transcripts in mouse oocytes. Nature. 2008, 453: 539-43. 10.1038/nature06908.PubMedCrossRefGoogle Scholar
- 28.Wen YZ, Zheng LL, Liao JY, Wang MH, Wei Y, Guo XM, Qu LH, Ayala FJ, Lun ZR: Pseudogene-derived small interference RNAs regulate gene expression in African Trypanosoma brucei. Proceedings of the National Academy of Sciences of the United States of America. 2011, 108: 8345-50. 10.1073/pnas.1103894108.PubMedPubMedCentralCrossRefGoogle Scholar
- 31.Parker HG, VonHoldt BM, Quignon P, Margulies EH, Shao S, Mosher DS, Spady TC, Elkahloun A, Cargill M, Jones PG, Maslen CL, Acland GM, Sutter NB, Kuroki K, Bustamante CD, Wayne RK, Ostrander EA: An expressed fgf4 retrogene is associated with breed-defining chondrodysplasia in domestic dogs. Science. 2009, 325: 995-8. 10.1126/science.1173275.PubMedPubMedCentralCrossRefGoogle Scholar
- 35.The Cancer Genome Atlas. [http://cancergenome.nih.gov]
- 40.McKenna M, Arnold C, Catherwood MA, Humphreys MW, Cuthbert RJG, Bueso-Ramos C, McManus DT: Myeloid sarcoma of the small bowel associated with a CBFbeta/MYH11 fusion and inv(16)(p13q22): a case report. Journal of Clinical Pathology. 2009, 62: 757-9. 10.1136/jcp.2008.063669.PubMedCrossRefGoogle Scholar
- 41.Alvarez P, Navascués CA, Ordieres C, Pipa M, Vega IF, Granero P, Alvarez JA, Rodríguez M: Granulocytic sarcoma of the small bowel, greater omentum and peritoneum associated with a CBFβ/MYH11 fusion and inv(16) (p13q22): a case report. International Archives of Medicine. 2011, 4: 3-10.1186/1755-7682-4-3.PubMedPubMedCentralCrossRefGoogle Scholar
- 45.Lee E, Iskow R, Yang L, Gokcumen O, Haseley P, Luquette LJ, Lohr JG, Harris CC, Ding L, Wilson RK, Wheeler DA, Gibbs RA, Kucherlapati R, Lee C, Kharchenko PV, Park PJ: Landscape of somatic retrotransposition in human cancers. Science. 2012, 337: 967-71. 10.1126/science.1222077.PubMedPubMedCentralCrossRefGoogle Scholar
- 46.Solyom S, Ewing AD, Rahrmann EP, Doucet TT, Nelson HH, Burns MB, Harris RS, Sigmon DF, Casella A, Erlanger B, Wheelan S, Upton KR, Shukla R, Faulkner GJ, Largaespada DA, Kazazian HH: Extensive somatic L1 retrotransposition in colorectal tumors. Genome Research. 2012, 22: 2328-38. 10.1101/gr.145235.112.PubMedPubMedCentralCrossRefGoogle Scholar
- 47.Keane TM, Goodstadt L, Danecek P, White MA, Wong K, Yalcin B, Heger A, Agam A, Slater G, Goodson M, Furlotte NA, Eskin E, Nellåker C, Whitley H, Cleak J, Janowitz D, Hernandez-Pliego P, Edwards A, Belgard TG, Oliver PL, McIntyre RE, Bhomra A, Nicod J, Gan X, Yuan W, van der Weyden L, Steward CA, Bala S, Stalker J, Mott R, et al: Mouse genomic variation and its effect on phenotypes and gene regulation. Nature. 2011, 477: 289-94. 10.1038/nature10413.PubMedPubMedCentralCrossRefGoogle Scholar
- 49.Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, et al: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420: 520-62. 10.1038/nature01262.PubMedCrossRefGoogle Scholar
- 52.Jaccard P: Nouvelles recherches sur la distribution florale. Bulletin de la Socite vaudoise des sciences naturelles. 1908, 44: 223-Google Scholar
- 53.Mouse Genome Informatics. [http://informatics.jax.org]
- 54.Auton A, Fledel-Alon A, Pfeifer S, Venn O, Ségurel L, Street T, Leffler EM, Bowden R, Aneas I, Broxholme J, Humburg P, Iqbal Z, Lunter G, Maller J, Hernandez RD, Melton C, Venkat A, Nobrega MA, Bontrop R, Myers S, Donnelly P, Przeworski M, McVean G: A fine-scale chimpanzee genetic map from population sequencing. Science. 2012, 336: 193-8. 10.1126/science.1216872.PubMedPubMedCentralCrossRefGoogle Scholar
- 57.Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, Cheetham RK, Chinwalla A, Conrad DF, Fu Y, Grubert F, Hajirasouliha I, Hormozdiari F, Iakoucheva LM, Iqbal Z, Kang S, Kidd JM, Konkel MK, Korn J, Khurana E, Kural D, Lam HYK, Leng J, Li R, Li Y, Lin CY, Luo R, et al: Mapping copy number variation by population-scale genome sequencing. Nature. 2011, 470: 59-65. 10.1038/nature09708.PubMedPubMedCentralCrossRefGoogle Scholar
- 59.Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, González JR, Gratacòs M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, et al: Global variation in copy number in the human genome. Nature. 2006, 444: 444-54. 10.1038/nature05329.PubMedPubMedCentralCrossRefGoogle Scholar
- 67.van den Hurk JAJM, Meij IC, Seleme MDC, Kano H, Nikopoulos K, Hoefsloot LH, Sistermans EA, de Wijs IJ, Mukhopadhyay A, Plomp AS, de Jong PTVM, Kazazian HH, Cremers FPM: L1 retrotransposition can occur early in human embryonic development. Human Molecular Genetics. 2007, 16: 1587-92. 10.1093/hmg/ddm108.PubMedCrossRefGoogle Scholar
- 70.Baillie JK, Barnett MW, Upton KR, Gerhardt DJ, Richmond TA, De Sapio F, Brennan P, Rizzu P, Smith S, Fell M, Talbot RT, Gustincich S, Freeman TC, Mattick JS, Hume DA, Heutink P, Carninci P, Jeddeloh JA, Faulkner GJ: Somatic retrotransposition alters the genetic landscape of the human brain. Nature. 2011, 479: 534-7. 10.1038/nature10531.PubMedPubMedCentralCrossRefGoogle Scholar
- 71.Evrony G, Cai X, Lee E, Hills L, Elhosary PC, Lehmann H, Parker J, Atabay K, Gilmore E, Poduri A, Park P, Walsh C: Single-neuron sequencing analysis of l1 retrotransposition and somatic mutation in the human brain. Cell. 2012, 151: 483-96. 10.1016/j.cell.2012.09.035.PubMedPubMedCentralCrossRefGoogle Scholar
- 74.Virgen CA, Kratovac Z, Bieniasz PD, Hatziioannou T: Independent genesis of chimeric TRIM5-cyclophilin proteins in two primate species. Proceedings of the National Academy of Sciences of the United States of America. 2008, 105: 3563-8. 10.1073/pnas.0709258105.PubMedPubMedCentralCrossRefGoogle Scholar
- 75.Wilson SJ, Webb BLJ, Ylinen LMJ, Verschoor E, Heeney JL, Towers GJ: Independent evolution of an antiviral TRIMCyp in rhesus macaques. Proceedings of the National Academy of Sciences of the United States of America. 2008, 105: 3557-62. 10.1073/pnas.0709003105.PubMedPubMedCentralCrossRefGoogle Scholar
- 77.pysam. [http://code.google.com/p/pysam/]
- 78.GRIPper. [https://github.com/adamewing/GRIPper]
- 81.PRICE Genome Assembler. [http://derisilab.ucsf.edu/software/price/index.html]
- 82.Bamsurgeon. [https://github.com/adamewing/bamsurgeon]
- 85.Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, Cline MS, Goldman M, Barber GP, Clawson H, Coelho A, Diekhans M, Dreszer TR, Giardine BM, Harte RA, Hillman-Jackson J, Hsu F, Kirkup V, Kuhn RM, Learned K, Li CH, Meyer LR, Pohl A, Raney BJ, Rosenbloom KR, Smith KE, Haussler D, Kent WJ: The UCSC Genome Browser database: update 2011. Nucleic Acids Research. 2011, 39: D876-82. 10.1093/nar/gkq963.PubMedPubMedCentralCrossRefGoogle Scholar
- 87.R Development Core Team: R: A Language and Environment for Statistical Computing. 2011, R Vienna, Austria: Foundation for Statistical ComputingGoogle Scholar
- 88.1,000 Genomes Project. [http://www.1000genomes.org]
- 89.Cancer Genomics Hub. [https://cghub.ucsc.edu]
- 90.Wellcome Trust Sanger Institute Mouse Genomes Project. [http://www.sanger.ac.uk/resources/mouse/genomes/]
- 91.PanMap. [http://panmap.uchicago.edu]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.