Human endogenous retroviruses

Human endogenous retroviruses (ERVs) are remnants of infections of former exogenous retroviruses. Proviruses formed by numerous distinct exogenous retroviruses in the germline genome could be inherited by subsequent generations. About 8% of the human genome consists of sequences that are potentially of retroviral origin [1] and are distributed in about 700,000 different loci. In addition to proviruses, these sequences include solitary long terminal repeats (LTRs), nonretroviral sequences flanked by LTRs that may not be directly derived from infectious retroviruses and sequences similar to LTRs. ERVs and related sequences are thus part of the repetitive portions of the human genome, which comprise about 45% of the human genome mass, including mobile DNA such as L1, Alu and SVA elements.

Detailed analysis of the human genome sequence by wet-lab and bioinformatics approaches resulted in the definition of ERV groups, with the number depending on the methods used for defining groups: 31 groups were defined by Sperber et al.[2] and Blomberg et al.[3], 42 groups were defined by Mager and Medstrand [4], 30 groups were defined by Gifford and Tristem [5] and several hundred human ERV and LTR families were defined by Repbase [6].

Almost all human ERV loci no longer encode former retroviral proteins because of their ancient incorporation into the host genome and thus accumulation of nonsense mutations. Many loci are missing large proviral portions, and most loci have been reduced to so-called solitary LTRs by homologous recombination between proviral LTRs. For more detailed information on human ERVs, we refer interested readers to recent reviews on the topic and the references therein [710].

While protein coding capacity is very limited, many human ERV loci still are transcribed and usually are initiated by promoter sequences within the proviral LTRs. Obviously, mutations within LTRs have not yet rendered all LTRs in the human genome defective. In principle, promoters in flanking, non-ERV sequences may also contribute to transcription of those loci. Probably every human tissue and cell type, diseased or not, contains ERV transcripts [11, 12]. More than a single ERV group is usually found transcribed, and patterns of transcribed ERV groups differ between tissue and cell types. Transcription of ERV loci is thus regulated in some way. While expression of ERV sequences has been associated with a number of human diseases, such as germ cell tumours, melanoma and multiple sclerosis, the involvement of ERVs in human diseases remains to be elucidated. On the other side, some ERV loci very likely provide important biological functions, such as the syncytin [13] and syncytin 2 loci [14], referred to herein as ERVW-1 and ERVFRD-1, respectively. Other loci harbouring only partial open reading frames, such as a recently characterized HERV-W locus on chromosome Xq22.3 [15] (ERVW-2), may likewise produce partial retroviral proteins with potential biological functions. It is therefore of particular interest which ERV loci actually contribute to the human transcriptome.

Recent studies have identified transcribed ERV loci in normal and diseased human cells and tissues by means of reassigning ERV cDNA sequences to individual loci in the human reference genome sequence, employing characteristic nucleotide differences between individual loci of a regarded ERV group. Many more transcribed ERV loci are likely to be identified in future studies. It is therefore necessary to introduce a nomenclature for transcribed human ERV sequences.

Previous nomenclature used in the literature

The lack of an established nomenclature for transcribed ERV elements has led to confusion within the literature. These problems were previously reviewed in detail [16]. ERVs have been classified into groups (formerly known as "families", which is heresy to virologists because "family" refers to Retroviridae), although different classification systems have been used. For instance, some groups have been defined initially by molecular genetics means, others by sequence similarity and others by primer binding site sequences. Changing amounts of sequence information also showed that some ERV groups' designations needed to be revised. Different names have been used for the same ERV group. Likewise, individual loci have been referred to using a variety of different symbols (for example, see the aliases listed in Table 1 for the ERVK-6 locus). The use of different symbols for the same locus makes it difficult to retrieve all information on that particular locus.

Table 1 Nomenclature for transcribed human endogenous retrovirus loci

Previous ERV nomenclature and the Human Genome Organisation Gene Nomenclature Committee

The Human Genome Organisation (HUGO) Gene Nomenclature Committee (HGNC) works under the auspices of HUGO and is the only worldwide authority that assigns standardised nomenclature to human genes [17]. The HGNC has previously focused on approving nomenclature for protein-coding genes, pseudogenes, phenotypes and noncoding RNA. In the past, the committee has approved symbols for specific human ERVs only at the request of individual researchers. The symbols did not follow a systematic nomenclature: some symbols were of a simple format (for example, ERV1), some provided information on the group to which the ERV belonged (for example, ERVK2) and others included information on proteins encoded by the ERV (for example, ERVWE1 (endogenous retroviral family W, env(C7), member 1)). On reviewing the literature, it was clear that (1) many of the most frequently published loci were not represented by HGNC symbols, (2) by following more than one system, HGNC symbols were not serving the community, and (3) the nomenclature needed both updating and expansion.

HGNC editors curate relevant information for each gene that has approved nomenclature. In addition to approving a gene symbol and name for each transcribed human ERV, the HGNC records all known symbol aliases so that information on each gene can be retrieved using any known symbol. HGNC entries also include the chromosomal location of the ERV locus, links to GenBank, European Molecular Biology Laboratory (EMBL) and DNA Databank of Japan (DDBJ) sequence records and links to at least one PubMed reference. Where appropriate, links are also provided to annotation projects at both the genomic and proteomic levels. HGNC names are propagated to other major biological databases, such as Ensembl, UniProt and Entrez Gene. Therefore, this new nomenclature will provide a useful resource that is currently unavailable to the ERV community and other researchers concerned with ERVs.

A gene-based nomenclature

The primary definition of a gene used by the HGNC is "a DNA segment that contributes to phenotype/function" [18]. It is beyond the scope of this nomenclature effort to standardise the nomenclature of ERVs in general or to attempt to name every ERV element in the genome. As discussed above, there is evidence that some human ERVs encode functional proteins and that some encode transcripts and/or proteins which may be associated with disease, so the transcriptionally active loci come under the remit of the HGNC for naming. This category of ERVs represents most of the individual loci that have been published with individual names, so it is worth developing a standardised nomenclature for this subset. The three criteria for being accepted as a transcriptionally active ERV are as follows: (1) The ERV must be represented by an mRNA sequence in a public database, (2) the reported cDNA sequence must map unambiguously to the reference genome to allow identification and (3) the sequence must represent a viral gene rather than solely a solitary LTR. We acknowledge that there are sources of uncertainty. Many ERVs may be expressed at a low level [19], a "leakage" which can be hard to distinguish from perhaps more significant expression. Groups of recently integrated ERVs may be highly expressed, but their transcripts may be identical or almost identical and could be hard to map unambiguously. However, these difficulties should not prevent the naming of ERV loci which fulfil the criteria mentioned above. There is one symbol approved per ERV locus independently of how many viral genes the ERV may encode.

A systematic ERV nomenclature scheme

The nomenclature scheme described in this paper aims to be concise so that it is user-friendly. It also aims to be informative to researchers, including those who are less familiar with the field. To be informative, the nomenclature scheme is hierarchical, with each symbol beginning with the root symbol "ERV" so that the symbols are instantly recognisable and can be grouped together in searches. Note that many researchers have published papers using symbols beginning with "HERV", but it is against the guidelines of the HGNC ever to use H for "human" in symbols, mainly because this precludes the possibility of the nomenclature scheme's being extended to other species. Each ERV symbol, then, includes an identifier that represents the group to which the ERV belongs.

In order for the nomenclature scheme to be systematic, one method of sorting ERVs into groups needed to be selected. The Repbase system [6] is a widely known, comprehensive database of repetitive elements that groups ERVs together on the basis of sequence similarity. RepeatMasker annotations using Repbase designations are available on the University of California, Santa Cruz (UCSC) [20], and Ensembl [21] genome browsers, making these ERV groups highly accessible and recognisable to researchers in the field. Therefore, the nomenclature system uses the Repbase classification system for naming the ERVs within groups. Repbase groups, however, do not follow a systematic nomenclature and often contain an unallowable "H" for "human". When deciding on the group identifier to be included in each symbol, we compared Repbase symbols with those that have appeared frequently in the literature. In cases where there was a well-supported nomenclature present in the literature, we used this symbol in place of the Repbase symbol; for example, we used ERVW instead of the Repbase group designation HERV17, as we felt that these would be more likely to be used by the ERV community. For a comparison of the group symbols used in the new nomenclature scheme with Repbase designations, see Table 2.

Table 2 Comparison of Repbase group symbols with group symbols used in the nomenclature scheme presented herein

Finally, each ERV within a particular group is uniquely identified by a number, for example, ERVK-1. Numbers are assigned consecutively within each group to make the nomenclature system expandable. The number is used to make each symbol unique and has no intrinsic meaning. ERVK-2 has merely been assigned the next number following ERVK-1, but this provides no information on the position of the ERVs within the genome or the order in which an ERV may have been published. The use of numerical identifiers keeps the symbols as short as possible to encourage widespread use by researchers. Newly identified transcribed loci will take the next available consecutive number for their particular group; for example, if a newly transcribed ERVK locus is identified, it will take the symbol ERVK-26. Each symbol is accompanied by an expanded gene name which clearly and succinctly explains that derivation of the nomenclature; for example, the full name of ERVFRD-1 is "endogenous retrovirus group FRD, member 1".

We are aware that the proposed nomenclature scheme cannot encompass all conceivable (and sometimes known) unusual structures of ERV loci, such as hybrid loci consisting of different ERV groups and ERV insertions into existing ERV loci [22]. HGNC, after conferring with researchers who submit newly identified transcribed loci, will decide whether or how to name such unique loci on a case-by-case basis. For example, the scheme will not incorporate ERV locus transcripts that are part of another gene's transcript, as these elements will not be considered separate loci.

Table 1 lists transcribed human ERVs that have been named according to the new nomenclature system. All ERVs in the table either have been published or have been annotated by the RefSeq project. An initial list was sent to a number of researchers in the field for their comments. The list was expanded as these researchers suggested more loci. Where no transcript sequence was available, authors were asked to submit representative sequences to the GenBank, EMBL and DDBJ databases. We encourage researchers to contact the HGNC if they know of further ERVs that can be included in the scheme.

Finally, although only human gene nomenclature is under the remit of the HGNC, we wish to mention that the naming system introduced here for transcribed ERVs could, in principle, also be applied to other, non-ERV repetitive sequences in the human genome, as well as to repetitive DNA in nonhuman species. Future research will probably reveal numerous transcribed repetitive DNA sequences in various species. Judged just from ERV designations in different species, a standardised naming system for transcribed repeat loci may be highly beneficial to avoid future confusion.