The Spin/Ssty repeat: a new motif identified in proteins involved in vertebrate development from gamete to embryo
The homologous genes Spin (spindlin) and Ssty were first identified as genes involved in gametogenesis and seem to occur in multiple copies in vertebrate genomes. The mouse spindlin (Spin) protein was reported to interact with the spindle apparatus during oogenesis and to be a target for cell-cycle-dependent phosphorylation. The transcript of the mouse Ssty gene is specific to sperm cells. In the chicken, spindlin was found to co-localize with SUMO-1 to nuclear dots during interphase in fibroblasts, but to co-localize with chromosomes during mitosis. Thus, Spin/Ssty genes might be important in the transition from sperm cells and oocytes to the early embryo, as well as in mitosis.
Here we report the discovery of a new protein motif of around 50 amino acids in length, the Spin/Ssty repeat, in proteins of the Spin/Ssty (spindlin) family. We found that in one member of this family, the human SPIN gene, each repeat resides in its own exon, supporting our view that Spin/Ssty repeats are independent functional units. On the basis of different secondary-structure prediction methods, we propose a four-stranded β-structure for the Spin/Ssty repeat.
The discovery of the Spin/Ssty repeat might contribute to the further elucidation of the structure and function of spindlin-family proteins. We predict that the tertiary structure of spindlin-like proteins is composed of three modules of Spin/Ssty repeats.
KeywordsHide Markov Model Human Spin Pairwise Sequence Identity Full Code Region Hide Markov Model Training
During early oocyte development, the transcription of maternal genes ceases with the onset of meiosis. After fertilization and zygote formation, transcription of the embryonic genome starts at the two-cell stage or later, depending on the organism [1,2,3]. Thus, the amount of maternal mRNAs must be sufficient to drive the gamete through meiosis, fertilization and through the first zygotic cell division - a time span of almost 2 days in mice . During this period the activation of translation from many different deadenylated, and thus dormant, mRNAs is controlled by their cytoplasmic polyadenylation [1,4].
In these early phases of mouse development, one of the most frequent transcripts regulated in this manner is that of the spindlin (Spin) gene [1,5]. The protein encoded by Spin is a meiotic-spindle-associated protein specific to the oocyte [1,5], that is phosphorylated during meiosis [6,7]. Oh et al. showed that phosphorylation modulates the ability of the Spin protein to interact with the spindle apparatus during oogenesis . Phosphorylation is dependent on the Mos/MAP kinase pathway, which is controlled by meiotic-checkpoint proteins cyclin B and Cdc2 in Xenopus laevis oocytes [6,8]. Sequence similarity and mRNA expression suggest that a complementary role in sperm development seems to be fulfilled by the gene Ssty (Y-linked spermiogenesis specific transcript), a multicopy testis-specific spermatogenesis gene on the mouse Y chromosome long arm . In contrast to the oocyte-specific expression of Spin, the Ssty mRNA is specifically expressed in sperm cells . Dosage reduction by partial deletion of Ssty genes was suggested to cause deformed sperm heads and infertility [10,11]. However, reports on Ssty expression on the protein level are still lacking. Recently, two Spin-type genes from the chicken, Gallus gallus, have been cloned - Spin-W and Spin-Z, located on the W and Z sex chromosomes, respectively . They are nearly identical to each other in their coding regions, and both were reported to be expressed in early embryos, but Spin-Z is also expressed in various adult tissues. Transfection of fibroblasts with DNA expressing fluorescent protein-tagged chSpin-W and the small ubiquitin-related modifier SUMO-1 showed the co-localization of these proteins in nuclear dots during interphase. Localization was shown to depend on the carboxy-terminal 30 amino acids of chSpin-W, especially on the presence of two phenylalanines in positions 244 and 247. However, SUMO-1 and chSpin-W could not be shown to interact directly. In contrast to its interphase localization, the red fluorescent protein-chSpinW fusion associated with chromosomes during mitosis. Although experimental results indicate that the spindlin protein family includes important players in meiosis and early embryogenesis, as well as in mitosis, their biochemical function is largely unknown.
Results and discussion
Repeat identification and analysis
At the beginning of our analysis, pairwise sequence similarity among proteins of the spindlin family was already public knowledge, with the reported average sequence identity between members being approximately 70% (entry PF02513 (Spin/Ssty protein family) in the Pfam 6.2 protein database). When we tried to identify additional family members of this protein family by scanning the NCBI nonredundant protein database (nr) using BLASTP and the human Spin protein sequence (GenBank RefSeq identifier NP_006708) as a query, we noticed a second high-scoring segment pair in the hit of the human Spin sequence with itself. Therefore we scanned the human Spin sequence for internal repeats with the program dotter and found a triple repeat spanning nearly the complete protein sequence. We aligned the repeats using CLUSTALX and corrected the alignment manually for subsequent construction of a hidden Markov model (HMM). By scanning the nr database with this model we identified the repeat in open reading frames (ORFs) of other known members of the Spin/Ssty gene family with expectation (E) values below 1e-9. Among these, we detected three repeats of typical length of 53 amino acids in the ORF of mouse Ssty, encompassing the two smaller 71 base-pair (bp) repeats that were previously noticed at the cDNA level . Spindlin-family protein sequences in the nr database are from human, mouse and chicken. Among the human and mouse sequences, many were hypothetical protein sequences translated from genomic or cDNA sequences. These sequences were too similar at the protein level to conclude that they derive from different genes. To determine the number of Spin/Ssty-like genes for Mus musculus and Homo sapiens, we decided to isolate an initial redundant set of possible transcripts on the basis of the human and mouse RefSeq and UniGene databases and the database of confirmed peptides of the Ensembl human genome annotation project (Version 1.1.3), and finally to reduce the redundancy of identified transcripts by thorough sequence comparison. We identified the initial set of Spin/Ssty-like transcripts in these databases by TBLASTN searches using known spindlin-family protein sequences as queries.
For H. sapiens, we detected four different genes of the Spin/Ssty family. According to Ensembl, the chromosomal region Xp11.1 contains two SPIN-like genes: one coding for a spindlin-like transcript (Ensembl: ENST00000218159; RefSeq: NM_019003.1; UniGene: Hs.2294334; GenBankClone: Z82211) and a second in close proximity, which was named spindlin-like 2 (Ensembl: ENST00000252781; GenBankClone: Z82211). These transcripts are 99.7% identical to each other at the nucleotide level in their protein-coding regions and were first described by Laval et al. as members of the human X-linked DXF34 sequence family . Another SPIN-family gene resides on chromosome Xq12 (Ensembl: ENST00000253399). The best characterized family member, the human SPIN gene (Ensembl: ENST00000223559; RefSeq: NM_006717.1; UniGene: Hs.3335321; GenBankClone: AL353748) is located on chromosome 9q22.2 and comprises three exons.
For M. musculus, scanning the RefSeq and UniGene resources revealed three Spin/Ssty-like transcripts with complete coding regions. The known Spin gene (RefSeq: NM_011462.1; UniGene: Mm.S939555) and the Ssty gene (also called Smy; RefSeq: NM_009220.1; UniGene: Mm.S936711) are around 70% identical on the protein level. A novel 1,056 bp cDNA (RefSeq: NM_023546.1; UniGene: Mm.S1997937) seems to encode a complete spindlin family protein with around 80% protein sequence identity to Ssty. Other mouse transcripts that could potentially encode complete proteins of the spindlin family seem to exist, as there are 11 additional independent cDNA assemblies in UniGene (Mm.S1975038, Mm.S1922195, Mm.S499811, Mm.S227336, Mm.S1973836, Mm.S707442, Mm.S781768, Mm.S502745, Mm.S782972, Mm.S778767, Mm.S787945). Their ORFs are interrupted or incomplete, however. Increased expressed sequence tag (EST) coverage and quality of these assemblies might reveal more functional spindlin family members. The high number of SPIN-like transcripts in mice is in agreement with previous reports [11,13] that presented evidence for the existence of a multi-copy Ssty-like gene family on the mouse Y chromosome. As three of four human Spin/Ssty-like genes consist of a single exon, and alternative transcripts of the human triple-exon gene SPIN have not yet been reported, alternative splicing is unlikely to contribute to the diversity of Spin/Ssty transcripts in mouse.
The subsequent analysis of Spin/Ssty repeats is exclusively based on repeats from known proteins or complete ORFs, in order to exclude low-quality sequences from the analysis. To include Spin/Ssty repeats from a fish protein, an exception is made for the O. latipes EST AU169984, which contains an incomplete ORF comprising two complete Spin/Ssty repeats without interruption by frameshift errors.
Using our initial HMM we identified three repeats per protein (two for the incomplete O. latipes protein) with E values below 1e-15. We aligned the repeats (Figure 1) and constructed three HMMs: two by using only repeats with less than 75 and 90% pairwise sequence identity, another by using all repeats in the seed alignment. All HMMs re-identified the repeats with E values below 1e-22. However, scanning the nr database with these new models did not identify further Spin/Ssty repeats. We submitted a description and an alignment of the Spin/Ssty repeat to Pfam (Pfam 6.6: PF02513), which replaced the previous Spin/Ssty protein family entry.
For single combinations of Spin/Ssty repeats, the pairwise sequence identity drops below 15%. To test the significance of the similarity among the repeat subtypes (amino-terminal, central, carboxy-terminal) and to exclude HMM training artifacts, we carried out a cross-validation test. We constructed HMMs for each repeat subtype and tried to detect the repeats of the remaining subtypes. For this approach we used five nonredundant proteins (gg_SPINZ, bt_SPINH, hs_SPINX2, mm_SSTY, mm_SPINL; Figure 1). We could identify the complete set of repeats from the five proteins with E values below 5e-3 and thus confirmed that the subgroups are evolutionarily related.
We made secondary-structure predictions using several programs via the Jpred2 server with the alignment of the whole family and the alignments of each of the amino-terminal, central and carboxy-terminal repeat subfamilies as a query. The consensus prediction for the whole alignment suggests four β strands for the Spin/Ssty repeat. Although the isolated central Spin/Ssty repeat is predicted to form an α helix in exchange for the second β strand, the single predictions for the amino- and carboxy-terminal repeat subtypes are in agreement with the prediction based on the whole family. Because in most cases the accuracy of secondary-structure predictions is higher when alignments of more diverse protein-family members are used, we believe that the predictions based on the whole family are the most reliable, and we suggest an all-β structure with four β strands for all Spin/Ssty repeats. Attempts to assign a known protein fold to the Spin/Ssty repeat using different fold-prediction methods via the Structure Prediction Meta Server did not lead to significant predictions.
Our findings might serve as a basis for future work on this new class of repeats. The Spin/Ssty repeat alignment will assist in detecting further family members in other species and in the search for an evolutionary origin of the spindlin family of proteins. The detection of Spin/Ssty repeats in proteins with other domain architectures might provide a clue to the function of the spindlin family. Knowledge of the repeat structure of spindlin-like proteins can support further experimental work. Once interaction partners or biochemical functions are identified for the spindlin-like proteins, hypotheses based on the repeat architecture can be generated for further experiments: site-directed mutagenesis studies, that are targeted on conserved residues, are most likely to disrupt the structure or destroy the function of a protein; attempts to delete certain regions of spindlin-family proteins or to swap regions between family members in order to explore their function, can now be guided by the repeat architecture in these sequences to choose more reasonable borders. Finally, we hope that our findings will support the exploration of the tertiary structure of spindlin-like protein, as the Spin/Ssty sequence repeat is probably reflected by a repeated structural element with four β strands, which currently cannot be assigned to a known type of protein fold.
Materials and methods
Searching sequence databases
We scanned several databases to identify ESTs, ORFs, known protein sequences or gene structures of the Spin/Ssty gene family. We used the following databases, which can all be downloaded from the NCBI ftp server  (database filenames are given in brackets) or the ENSEMBL ftp server : the Non-redundant Protein Sequence Database (nr.Z), dbEST (est.Z), the mouse and human RefSeq mRNA and peptide sequences (hs.fna.gz, hs.faa.gz, mouse.fna.gz, mouse.faa.gz), the mouse and human UniGene databases (Hs.seq.uniq, Build #141; Mm.seq.uniq, Build #95) and the ENSEMBL set of confirmed human peptides and corresponding transcripts (ensembl.pep.gz, ensembl.cdna.gz, Ver.1.1.3). Pairwise sequence-similarity searches in these databases were carried out using the gapped versions of the programs of the BLAST program package version 2.1.2 with default scoring schemes .
The aim of the program dotter  is to visualize local sequence similarity between two sequences by allowing the user to view the dot matrix of the sequence comparisons and the alignment of the sequences in parallel. Here, dotter was used to compare sequences with themselves to examine them for repeats. Finally, it was used to refine the borders of repeat regions before their selection for the alignment.
Multiple alignment and phylogenetic tree construction
Multiple alignments were carried out with CLUSTALX version 1.8.1  using the BLOSUM62 substitution matrix. The neighbor-joining algorithm  of CLUSTALX was used to build phylogenetic trees after gaps were removed from the alignments. The drawtree program of the PHYLIP package version 3.5 was used to visualize the tree .
Protein-sequence profile searches
For sensitive detection of repeats we built profile HMMs from the diverse alignments using the HMMER program suite  with default options for model building with hmmbuild (hmmls/domain alignment) and calibration with hmmcalibrate (sampled sequences: 5,000; mean length 350). Protein database searches with these HMMs were carried out using the hmmsearch program.
Having identified ESTs of a putative novel SPIN-family gene, we used the program Gap version 4.4  for their assembly to derive a consensus representation of the complete mRNA sequence.
Secondary-structure predictions were performed with the consensus method of the Jpred2 server . This method is built on several other well-known secondary-structure prediction algorithms such as DSC , Jnet , NNSSP , PHD  and Zpred . According to the authors, the Jpred2 consensus method reaches a level of 75% accuracy in secondary-structure prediction and outperforms the single methods.
- 14.National Center for Biotechnology Information ftp server. [ftp://ncbi.nlm.nih.gov]
- 15.ENSEMBL ftp server. [ftp://ftp.ensembl.org]
- 20.Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics. 1989, 5: 164-166.Google Scholar