Abstract
We propose a novel hub-finding algorithm which relies on the use of dipeptide composition and amino acid sequence likeness. For extracting the most prominent features in hub identification, two feature selection techniques are widely used in data preprocessing for machine learning problems: fast correlation-based feature selection (FCBFS) and correlation-based feature selection (CFS) algorithms. The performance of two types of classifiers such as random forest classifier (RFC) and RBF network was evaluated with these filter approaches. Our proposed model led to successful prediction of hub proteins from primary structure with 92.52 and 91.28% accuracy for RFC and RBF network, respectively, in case of FCBFS and 90.92 and 93.76% accuracy for RFC and RBF network, respectively, in case of CFS algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Vallabhajosyula RR, Chakravarti D, Lutfeali S, Ray A, Raval A. Identifying hubs in protein interaction networks. PLoS One. 2009;4(4):e5344.
Albert R, Jeong H, Barabási AL. Error and attack tolerance of complex networks. Nature. 2000;406:378–82.
Tun K, Rao RK, Samavedham L, Tanaka H, Dhar PK. Rich can get poor: conversion of hub to non-hub proteins. Syst Synth Biol. 2009;2:75–82.
Patil A, Kinoshita K, Nakamura H. Hub promiscuity in protein-protein interaction networks. Int J Mol Sci. 2006;11:1930–43.
Aswathi BL, Nair AN, Atmaja S, Pawan KD. Identification of hub proteins from sequence. Bioinformation. 2011;7(4):163–8.
He X, Zhang J. Why do hubs tend to be essential in protein networks? PLoS Genet. 2006;2:e88.
Hsing M, Byler KG, Cherkasov A. The use of Gene Ontology terms for predicting highly-connected “hub” nodes in protein-protein interaction networks. BMC Syst Biol. 2006;2:80.
Dandekar T, Snel B, Huynen M, Bork P. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci. 1998;23:324–8.
Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N. The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA. 1999;96:2896–901.
Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D. Detecting protein function and protein-protein interactions from genome sequences. Science. 1999;285:751–3.
Enright J, Iliopoulos I, Kyrpides NC, Ouzounis A. Protein interaction maps for complete genomes based on gene fusion events. Nature. 1999;402:86–90.
Ge H, Liu Z, Church GM, Vidal M. Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat Genet. 2001;29:482–6.
Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA. 1999;96:4285–8.
Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C, et al. IntAct –open source resource for molecular interaction data. Nucleic Acids Res. 2007;35:D561–5. http://www.ebi.ac.uk/intact/main.xhtml
Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, et al. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 2004;32:D115–9. http://www.uniprot.org
Weizhong Li, Adam Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9.
Ekman D, Light S, Björklund ÅK, Elofsson A. What properties characterize the hub proteins of the protein-protein interaction network of Saccharomyces cerevisiae? Genome Biol. 2006;7:R45.
Prachumwat A, Wen-Hsiung Li. Protein function, connectivity, and duplicability in yeast. Mol Biol Evol. 2006;23(1):30–9.
Han JD, Bertin N, Hao T, Goldberg DS, Berriz GF, et al. Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature. 2004;430:88–93.
Jin G, Zhang S, Zhang XS, Chen L. Hubs with network motifs organize modularity dynamically in the protein-protein interaction network of yeast. PLoS One. 2007;2:e1207.
Batada NN, Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, et al. Stratus not altocumulus: a new view of the yeast protein interaction network. PLoS Biol. 2006;4:1720–31.
Garg A, Gupta D. VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens. BMC Bioinform. 2008;9:62.
Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid Res. 1997;25:3389–402.
Goli B, Aswathi BL, Nair AS. A novel algorithm for prediction of protein coding DNA from non-coding DNA in microbial genomes using genomic composition and dinucleotide compositional skew, advances in computer science and engineering lecture notes of the institute for computer sciences, social informatics and telecommunications engineering. 2012;85:535–42
Hall M, Holmes G. Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng. 2003;15:1–16.
Wang C, Ding C, Meraz RF, Holbrook SR. PSoL: a positive sample only learning algorithm for finding non-coding RNA genes. Bioinformatics. 2006;22:2590–6.
Liu H, Yu L. Towards integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng. 2005;17(3):1–12.
Huan Liu, Lei Yu. Feature selection for high-dimensional data a fast correlation-based filter solution. IEEE Trans Knowl Data Eng. 2005;17(4):491–502.
Hall MA. Correlation based feature selection for machine learning. Doctoral dissertation, The University of Waikato, Department of Computer Science; 1999.
Werbos PJ. Beyond regression: new tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University; 1974.
Parker DB. Learning-logic. Technical report, TR-47, Sloan School of Management, MIT, Cambridge, MA; 1985.
Rumelhart DE, Hinton GE, Williams RJ. Learning internal representations by error propagation in Parallel distributed processing: explorations in the microstructure of cognition, vol. I. Cambridge: Bradford Books; 1986.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32, 18.
Kira K, Rendell LA. A practical approach to feature selection. In:Proceedings of the ninth international workshop on machine learning. Morgan Kaufmann Publishers Inc; 1992. p. 249–56.
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. SIGKDD Explor. 2009;11(1).
Cherian BS, Nair AS. Protein location prediction using atomic composition and global features of the amino acid sequence. Biochem Biophys Res Commun. 2010;391:1670–4.
Namboodiri S, Verma C, Dhar PK, Giuliani A, Nair AS. Sequence signatures of allosteric proteins towards rational design. Syst Synth Biol. 2011;4:271–80.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer India
About this paper
Cite this paper
Aswathi, B.L., Goli, B., Govindarajan, R., Nair, A.S. (2012). A Novel Algorithm for Prediction of Hub Proteins from Primary Structure in Eukaryotic Proteome Using Dipeptide Compositional Skew Information and Amino Acid Sequence Likeness. In: Sabu, A., Augustine, A. (eds) Prospects in Bioscience: Addressing the Issues. Springer, India. https://doi.org/10.1007/978-81-322-0810-5_4
Download citation
DOI: https://doi.org/10.1007/978-81-322-0810-5_4
Published:
Publisher Name: Springer, India
Print ISBN: 978-81-322-0809-9
Online ISBN: 978-81-322-0810-5
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)