Abstract
We generalize chaos game representation (CGR) to higher dimensional spaces while maintaining its bijection, keeping such method sufficiently representative and mathematically rigorous compare to previous attempts. We first state and prove the asymptotic property of CGR and our generalized chaos game representation (GCGR) method. The prediction follows that the dissimilarity of sequences which possess identical subsequences but distinct positions would be lowered exponentially by the length of the identical subsequence; this effect was taking place unbeknownst to researchers. By shining a spotlight on it now, we show the effect fundamentally supports (G)CGR as a similarity measure or feature extraction technique. We develop two feature extraction techniques: GCGR-Centroid and GCGR-Variance. We use the GCGR-Centroid to analyze the similarity between protein sequences by using the datasets 9 ND5, 24 TF and 50 beta-globin proteins. We obtain consistent results compared with previous studies which proves the significance thereof. Finally, by utilizing support vector machines, we train the anticancer peptide prediction model by using both GCGR-Centroid and GCGR-Variance, and achieve a significantly higher prediction performance by employing the 3 well-studied anticancer peptide datasets.
Similar content being viewed by others
References
Almeida JS, Carrico JA, Maretzek A, Noble PA, Fletcher M (2001) Analysis of genomic sequences by chaos game representation. Bioinformatics 17(5):429–437
Basu S, Pan A, Dutta C, Das J (1997) Chaos game representation of proteins. J Mol Gr Model 15(5):279–289
Chan HS, Dill KA (1989) Compact polymers. Macromolecules 22(12):4559–4573
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27
Chang G, Wang T (2011) Phylogenetic analysis of protein sequences based on distribution of length about common substring. Protein J 30(3):167–172
Chen K, Kurgan L, Rahbari M (2007) Prediction of protein crystallization using collocation of amino acid pairs. Biochem Biophys Res Commun 355(3):764–769
Chen K, Kurgan LA, Ruan J (2008a) Prediction of protein structural class using novel evolutionary collocation-based sequence representation. J Comput Chem 29(10):1596–1604
Chen W, Ding H, Feng P, Lin H, Chou KC (2016) IACP: a sequence-based tool for identifying anticancer peptides. Oncotarget 7(13):16,895
Chen YZ, Tang YR, Sheng ZY, Zhang Z (2008b) Prediction of mucin-type o-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs. BMC Bioinform 9(1):101
Chou KC (2001a) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct Funct Bioinform 43(3):246–255
Chou KC (2001b) Using subsite coupling to predict signal peptides. Protein Eng 14(2):75–79
Chou KC, Zhang CT (1995) Prediction of protein structural classes. Crit Rev Biochem Mol Biol 30(4):275–349
Cortes C, Vapnik V (1995) Support vector machine. Mach Learn 20(3):273–297
Deschavanne P, Tufféry P (2008) Exploring an alignment free approach for protein classification and structural class prediction. Biochimie 90(4):615–625
Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B (1999a) Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evolut 16(10):1391–1399
Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B (1999b) Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evolut 16(10):1391–1399
Dill KA (1985) Theory for the folding and stability of globular proteins. Biochemistry 24(6):1501–1509
Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A (2005) Misc functions of the department of statistics (e1071), tu wien. R package version pp 1–5
Fang G, Bhardwaj N, Robilotto R, Gerstein MB (2010) Getting started in gene orthology and functional analysis. PLoS Comput Biol 6(3):e1000–703
Fiser A, Tusnady GE, Simon I (1994) Chaos game representation of protein structures. J Mol Graph 12(4):302–304
Fitch WM (1970) Distinguishing homologous from analogous proteins. Syst Zool 19(2):99–113
Ford MJ (2001) Molecular evolution of transferrin: evidence for positive selection in salmonids. Mol Biol Evolut 18(4):639–647
Hajisharifi Z, Piryaiee M, Beigi MM, Behbahani M, Mohabatkar H (2014) Predicting anticancer peptides with chous pseudo amino acid composition and investigating their mutagenicity via ames test. J Theor Biol 341:34–40
He P, Li X, Yang J, Wang J (2011) A novel descriptor for protein similarity analysis. MATCH: communications in mathematical and in computer. Chemistry 65(2):445–458
He PA, Zhang YP, Yao YH, Tang YF, Nan XY (2010) The graphical representation of protein sequences based on the physicochemical properties and its applications. J Comput Chem 31(11):2136–2142
He Pa, Li D, Zhang Y, Wang X, Yao Y (2012) A 3d graphical representation of protein sequences based on the gray code. J Theor Biol 304:81–87
Hoang T, Yin C, Yau SST (2016) Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison. Genomics 108(3–4):134–142
Jeffrey HJ (1990) Chaos game representation of gene structure. Nucleic Acids Res 18(8):2163–2170
Lam W, Bacchus F (1994) Learning Bayesian belief networks: an approach based on the MDL principle. Comput Intell 10(3):269–293
Li FM, Wang XQ (2016) Identifying anticancer peptides by using improved hybrid compositions. Sci Rep 6:33910
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659
Liao B, Liao B, Lu X, Cao Z (2011) A novel graphical representation of protein sequences and its application. J Comput Chem 32(12):2539–2544
Liu Y, Zhang Y (2010) A new method for analyzing H5N1 avian influenza virus. J Comput Chem 47(3):1129–1144
Luo Ry, Feng Zp, Liu Jk (2002) Prediction of protein structural class by amino acid and polypeptide composition. Eur J Biochem 269(17):4219–4225
Matsuda S, Vert JP, Saigo H, Ueda N, Toh H, Akutsu T (2005) A novel representation of protein sequences for prediction of subcellular location using support vector machines. Protein Sci 14(11):2804–2813
Mu Z, Wu J, Zhang Y (2013) A novel method for similarity/dissimilarity analysis of protein sequences. Phys A Stat Mech Appl 392(24):6361–6366
Nakashima H, Nishikawa K (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J Mol Biol 238(1):54–61
Paradis E, Claude J, Strimmer K (2004) Ape: analyses of phylogenetics and evolution in r language. Bioinformatics 20(2):289–290
Randić M, Novič M, Vračko M (2008) On novel representation of proteins based on amino acid adjacency matrix. SAR QSAR Environ Res 19(3–4):339–349
Robinson O, Dylus D, Dessimoz C (2016) Phylo. io: interactive viewing and comparison of large phylogenetic trees on the web. Mol Biol Evolut 33(8):2163–2166
Sahu SS, Panda G (2010) A novel feature representation method based on chou’s pseudo amino acid composition for protein structural class prediction. Comput Biol Chem 34(5):320–327
Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evolut 4(4):406–425
Shamim MTA, Anwaruddin M, Nagarajaram HA (2007) Support vector machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics 23(24):3320–3327
Shi JY, Zhang SW, Pan Q, Zhou GP (2008) Using Pseudo amino acid composition to predict protein subcellular location: approached with amino acid composition distribution. Amino Acids 35(2):321–327
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J et al (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Mol Syst Biol 7(1):539
Singh R, Xu J, Berger B (2008) Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc Nat Acad Sci 105(35):12,763–12,768
Suna D, Xua C, Zhanga Y (2016) A novel method of 2d graphical representation for proteins and its application. RNA 18:20
Tanchotsrinon W, Lursinsap C, Poovorawan Y (2015) A high performance prediction of HPV genotypes by Chaos game representation and singular value decomposition. BMC Bioinform 16(1):71
Thompson JD, Higgins DG, Gibson TJ (1994) Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673–4680
Tyagi A, Kapoor P, Kumar R, Chaudhary K, Gautam A, Raghava G (2013) In silico models for designing and discovering novel anticancer peptides. Sci Rep 3:2984
Wang G, Li X, Wang Z (2008) Apd2: the updated antimicrobial peptide database and its application in peptide design. Nucleic Acids Res 37(suppl-1):D933–D937
Welch P (1967) The use of fast fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Transact Audio Electroacoust 15(2):70–73
Wu H, Zhang Y, Chen W, Mu Z (2015) Comparative analysis of protein primary sequences with graph energy. Phys A Stat Mech Appl 437:249–262
Xu C, Sun D, Liu S, Zhang Y (2016) Protein sequence analysis by incorporating modified chaos game and physicochemical properties into chou’s general pseudo amino acid composition. J Theor Biol 406:105–115
Yang JY, Peng ZL, Yu ZG, Zhang RJ, Anh V, Wang D (2009) Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. J Theor Biol 257(4):618–626
Yao YH, Dai Q, Li C, He PA, Nan XY, Zhang YZ (2008) Analysis of similarity/dissimilarity of protein sequences. Proteins Struct Funct Bioinform 73(4):864–871
Yau SST, Yu C, He R (2008) A protein map and its application. DNA and Cell Biol 27(5):241–250
Yu HJ, Huang DS (2013) Normalized feature vectors: a novel alignment-free sequence comparison method based on the numbers of adjacent amino acids. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 10(2):457–467
Yu ZG, Anh V, Lau KS (2004) Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses. J Theor Biol 226(3):341–348
Zhang L, Liao B, Li D, Zhu W (2009) A novel representation for apoptosis protein subcellular localization prediction using support vector machine. J Theor Biol 259(2):361–365
Zhang Y, Yu X (2010) Analysis of protein sequence similarity. In: 2010 IEEE fifth international conference on bio-inspired computing: theories and applications (BIC-TA), IEEE, pp 1255–1258
Acknowledgements
We gratefully acknowledge the anonymous reviewers who read our paper and gave some constructive comments. This work is supported by the National Natural Science Foundation of China (Nos. 61877064). Matthias Dehmer thanks the Austrian Science Funds for supporting this work (project P26142).
Author information
Authors and Affiliations
Corresponding author
Additional information
The authors sincerely thank the referees for their valuable and highly constructive comments that have significantly improved this paper. This study was supported by the Shandong Natural Science Foundation (Grant No. ZR2015AM017). Matthias Dehmer thanks the Austrian Science Funds for supporting this work (Project P26142).
Rights and permissions
About this article
Cite this article
Ge, L., Liu, J., Zhang, Y. et al. Identifying anticancer peptides by using a generalized chaos game representation. J. Math. Biol. 78, 441–463 (2019). https://doi.org/10.1007/s00285-018-1279-x
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00285-018-1279-x