Abstract
Reliable automatic protein function annotation requires methods for detecting orthologs with known function from closely related species. While current approaches are restricted to finding ortholog clusters from complete proteomes, most annotation problems arise in the context of partially sequenced genomes. We use a combinatorial optimization method for extracting candidate ortholog clusters robustly from incomplete genomes. The proposed algorithm focuses exclusively on sequence relationships across genomes and finds a subset of sequences from multiple genomes where every sequence is highly similar to other sequences in the subset. We then use an optimization criterion similar to the one for finding ortholog clusters to annotate the target sequences.
We report on a candidate annotation for proteins in the rice genome using ortholog clusters constructed from four partially complete cereal genomes – barley, maize, sorghum, wheat and the complete genome of Arabidopsis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abascal, F., Valencia, A.: Automatic annotation of protein function based on family identification. Proteins 53, 683–692 (2003)
Tatusov, R., Koonin, E., Lipmann, D.: A genomic perspective on protein families. Science 278, 631–637 (1997)
Enright, A.J., Van Dongen, S., Ouzonis, C.: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30, 1575–1584 (2002)
Petryszak, R., Kretschmann, E., Wieser, D., Apweiler, R.: The predictive power of the CluSTr database. Bioinformatics 21, 3604–3609 (2005)
Wu, C.H., Huang, H., Yeh, L.S.L., Barker, W.C.: Protein family classification and functional annotation. Comput. Biol. Chem. 27, 37–47 (2003)
Bru, C., Courcelle, E., Carrre, S., Beausse, Y., Dalmar, S., Kahn, D.: The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res 33, D212–215 (2005)
Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L.L., Studholme, D.J., Yeats, C., Eddy, S.R.: The Pfam protein families database. Nucleic Acids Res 32, 138–141 (2004)
Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J.P., Chothia, C., Murzin, A.G.: SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 32, 226–229 (2004)
Fleishmann, W., Moller, S., Gateau, A., Apweiler, R.: A novel method for automatic functional annotation of proteins. Bioinformatics 15, 228–233 (1999)
Curwen, V., Wyras, E., Andrews, T.D., Clarke, L., Mongin, E., Searle, S.M., Clamp, M.: The Ensembl automatic gene annotation system. Genome Res 14, 942–950 (2004)
Eisen, J., Wu, M.: Phylogenetic analysis and gene functional predictions: phylogenomics in action. Theor. Popul. Biol. 61, 481–487 (2002)
Galperin, M.Y., Koonin, E.V.: Who’s your neighbor? new computational approaches for functional genomics. Nat. Biotechnol. 18, 609–613 (2000)
Altschul, S., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402 (1997)
Koski, L.B., Golding, G.B.: The closest BLAST hit is often not the nearest neighbor. J. Mol. Biol. 52, 540–542 (2001)
Remm, M., Strom, C.E., Sonnhammer, E.L.: Automatics clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 1041–1052 (2001)
Li, L., Stoeckert, C.K., Roos, D.S.: OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Res 13, 2178–2189 (2003)
Tatusov, R., Fedorova, N., Jackson, J., Jacobs, A., Kiryutin, B., Koonin, E., Krylov, D., Mazumdes, R., Mekhedov, S., Nikolskaya, A., Rao, B., Smirnov, S., Sverdlov, A., Vasudevan, S., Wolf, Y., Yin, J., Natale, D.: The COG database: an updated version includes eukaryotes. BioMed Central Bioinformatics (2003)
Abascal, F., Valencia, A.: Clustering of proximal sequence space for identification of protein families. Bioinformatics 18, 908–921 (2002)
Vashist, A., Kulikowski, C., Muchnik, I.: Ortholog clustering on a multipartite graph. In: Workshop on Algorithms in Bioinformatics, pp. 328–340 (2005)
Kamvysselis, M., Patterson, N., Birren, B., Berger, B., Lander, E.: Whole-genome comparative annotation and regulatory motif discovery in multiple yeast species. In: RECOMB, pp. 157–166 (2003)
Huynen, M.A., Bork, P.: Measuring genome evolution. Proc. Natl. Acad. Sci. USA 95, 5849–5856 (1998)
Fujibuchi, W., Ogata, H., Matsuda, H., Kanehisa, M.: Automatic detection of conserved gene clusters in multiple genomes by graph comparison and P-quasi grouping. Nucleic Acids Res 28, 4036–4096 (2002)
Overbeek, R., Fonstein, M., D’Souza, M., Pusch, G.D., Maltsev, N.: The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. USA 96, 2896–2901 (1999)
He, X., Goldwasser, M.H.: Identifying conserved gene clusters in the presence of orthologous groups. In: RECOMB, pp. 272–280 (2004)
Dandekar, T., Snel, B., Huynen, M., Bork, P.: Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 23, 324–328 (1998)
Heber, S., Stoye, J.: Algorithms for finding gene clusters. In: Workshop on Algorithms in Bioinformatics, pp. 252–263 (2001)
Cannon, S.B., Young, N.D.: OrthoParaMap: distinguishing orthologs from paralogs by integrating comparative genome data and gene phylogenies. BMC Bioinformatics 4 (2003)
Dong, Q., Schlueter, D., Brendel, V.: PlantGDB, plant genome database and analysis tools. Nucleic Acids Res 32, D354–D359 (2004)
Schoof, H., Zaccaria, P., Gundlach, H., Lemcke, K., Rudd, S., Kolesov, G., Mewes, R.A.H., Mayer, K.: MIPS arabidopsis thaliana database (MAtDB): an integrated biological knowledge resource based on the first complete plant genome. Nucleic Acids Res 30, 91–93 (2002)
Kellogg, E.A.: Relationships of cereal crops and other grasses. Proc. Natl. Acad. Sci. USA 95, 2005–2010 (1998)
Darlingto, H., Rouster, J., Hoffmann, L., Halford, N., Shewry, P., Simpson, D.: Identification and molecular characterisation of hordoindolines from barley grain. Plant Mol. Biol. 47, 785–794 (2001)
Castleden, C.K., Aoki, N., Gillespie, V.J., MacRae, E.A., Quick, W.P., Buchner, P., Foyer, C.H., Furbank, R.T., Lunn, J.E.: Evolution and function of the sucrose-phosphate synthase gene families in wheat and other grasses. Plant Physiology 135, 1753–1764 (2004)
Song, R., Llaca, V., Linton, E., Messing, J.: Sequence, regulation, and evolution of the maize 22-kD alpha zein gene family. Genome Res. 11, 1817–1825 (2001)
Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms, 2nd edn. The MIT Press, Cambridge (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vashist, A., Kulikowski, C., Muchnik, I. (2006). Protein Function Annotation Based on Ortholog Clusters Extracted from Incomplete Genomes Using Combinatorial Optimization. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P.A., Waterman, M. (eds) Research in Computational Molecular Biology. RECOMB 2006. Lecture Notes in Computer Science(), vol 3909. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11732990_10
Download citation
DOI: https://doi.org/10.1007/11732990_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33295-4
Online ISBN: 978-3-540-33296-1
eBook Packages: Computer ScienceComputer Science (R0)