Journal of Mathematical Biology

, Volume 78, Issue 1–2, pp 441–463 | Cite as

Identifying anticancer peptides by using a generalized chaos game representation

  • Li Ge
  • Jiaguo Liu
  • Yusen ZhangEmail author
  • Matthias Dehmer


We generalize chaos game representation (CGR) to higher dimensional spaces while maintaining its bijection, keeping such method sufficiently representative and mathematically rigorous compare to previous attempts. We first state and prove the asymptotic property of CGR and our generalized chaos game representation (GCGR) method. The prediction follows that the dissimilarity of sequences which possess identical subsequences but distinct positions would be lowered exponentially by the length of the identical subsequence; this effect was taking place unbeknownst to researchers. By shining a spotlight on it now, we show the effect fundamentally supports (G)CGR as a similarity measure or feature extraction technique. We develop two feature extraction techniques: GCGR-Centroid and GCGR-Variance. We use the GCGR-Centroid to analyze the similarity between protein sequences by using the datasets 9 ND5, 24 TF and 50 beta-globin proteins. We obtain consistent results compared with previous studies which proves the significance thereof. Finally, by utilizing support vector machines, we train the anticancer peptide prediction model by using both GCGR-Centroid and GCGR-Variance, and achieve a significantly higher prediction performance by employing the 3 well-studied anticancer peptide datasets.


Chaos game representation Similarity analysis Anticancer peptides Support vector machine 

Mathematics Subject Classification




We gratefully acknowledge the anonymous reviewers who read our paper and gave some constructive comments. This work is supported by the National Natural Science Foundation of China (Nos. 61877064). Matthias Dehmer thanks the Austrian Science Funds for supporting this work (project P26142).


  1. Almeida JS, Carrico JA, Maretzek A, Noble PA, Fletcher M (2001) Analysis of genomic sequences by chaos game representation. Bioinformatics 17(5):429–437Google Scholar
  2. Basu S, Pan A, Dutta C, Das J (1997) Chaos game representation of proteins. J Mol Gr Model 15(5):279–289Google Scholar
  3. Chan HS, Dill KA (1989) Compact polymers. Macromolecules 22(12):4559–4573Google Scholar
  4. Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27Google Scholar
  5. Chang G, Wang T (2011) Phylogenetic analysis of protein sequences based on distribution of length about common substring. Protein J 30(3):167–172MathSciNetGoogle Scholar
  6. Chen K, Kurgan L, Rahbari M (2007) Prediction of protein crystallization using collocation of amino acid pairs. Biochem Biophys Res Commun 355(3):764–769Google Scholar
  7. Chen K, Kurgan LA, Ruan J (2008a) Prediction of protein structural class using novel evolutionary collocation-based sequence representation. J Comput Chem 29(10):1596–1604Google Scholar
  8. Chen W, Ding H, Feng P, Lin H, Chou KC (2016) IACP: a sequence-based tool for identifying anticancer peptides. Oncotarget 7(13):16,895Google Scholar
  9. Chen YZ, Tang YR, Sheng ZY, Zhang Z (2008b) Prediction of mucin-type o-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs. BMC Bioinform 9(1):101Google Scholar
  10. Chou KC (2001a) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct Funct Bioinform 43(3):246–255Google Scholar
  11. Chou KC (2001b) Using subsite coupling to predict signal peptides. Protein Eng 14(2):75–79Google Scholar
  12. Chou KC, Zhang CT (1995) Prediction of protein structural classes. Crit Rev Biochem Mol Biol 30(4):275–349Google Scholar
  13. Cortes C, Vapnik V (1995) Support vector machine. Mach Learn 20(3):273–297zbMATHGoogle Scholar
  14. Deschavanne P, Tufféry P (2008) Exploring an alignment free approach for protein classification and structural class prediction. Biochimie 90(4):615–625Google Scholar
  15. Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B (1999a) Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evolut 16(10):1391–1399Google Scholar
  16. Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B (1999b) Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evolut 16(10):1391–1399Google Scholar
  17. Dill KA (1985) Theory for the folding and stability of globular proteins. Biochemistry 24(6):1501–1509Google Scholar
  18. Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A (2005) Misc functions of the department of statistics (e1071), tu wien. R package version pp 1–5Google Scholar
  19. Fang G, Bhardwaj N, Robilotto R, Gerstein MB (2010) Getting started in gene orthology and functional analysis. PLoS Comput Biol 6(3):e1000–703Google Scholar
  20. Fiser A, Tusnady GE, Simon I (1994) Chaos game representation of protein structures. J Mol Graph 12(4):302–304Google Scholar
  21. Fitch WM (1970) Distinguishing homologous from analogous proteins. Syst Zool 19(2):99–113MathSciNetGoogle Scholar
  22. Ford MJ (2001) Molecular evolution of transferrin: evidence for positive selection in salmonids. Mol Biol Evolut 18(4):639–647Google Scholar
  23. Hajisharifi Z, Piryaiee M, Beigi MM, Behbahani M, Mohabatkar H (2014) Predicting anticancer peptides with chous pseudo amino acid composition and investigating their mutagenicity via ames test. J Theor Biol 341:34–40Google Scholar
  24. He P, Li X, Yang J, Wang J (2011) A novel descriptor for protein similarity analysis. MATCH: communications in mathematical and in computer. Chemistry 65(2):445–458MathSciNetGoogle Scholar
  25. He PA, Zhang YP, Yao YH, Tang YF, Nan XY (2010) The graphical representation of protein sequences based on the physicochemical properties and its applications. J Comput Chem 31(11):2136–2142Google Scholar
  26. He Pa, Li D, Zhang Y, Wang X, Yao Y (2012) A 3d graphical representation of protein sequences based on the gray code. J Theor Biol 304:81–87MathSciNetzbMATHGoogle Scholar
  27. Hoang T, Yin C, Yau SST (2016) Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison. Genomics 108(3–4):134–142Google Scholar
  28. Jeffrey HJ (1990) Chaos game representation of gene structure. Nucleic Acids Res 18(8):2163–2170Google Scholar
  29. Lam W, Bacchus F (1994) Learning Bayesian belief networks: an approach based on the MDL principle. Comput Intell 10(3):269–293Google Scholar
  30. Li FM, Wang XQ (2016) Identifying anticancer peptides by using improved hybrid compositions. Sci Rep 6:33910Google Scholar
  31. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659Google Scholar
  32. Liao B, Liao B, Lu X, Cao Z (2011) A novel graphical representation of protein sequences and its application. J Comput Chem 32(12):2539–2544Google Scholar
  33. Liu Y, Zhang Y (2010) A new method for analyzing H5N1 avian influenza virus. J Comput Chem 47(3):1129–1144MathSciNetzbMATHGoogle Scholar
  34. Luo Ry, Feng Zp, Liu Jk (2002) Prediction of protein structural class by amino acid and polypeptide composition. Eur J Biochem 269(17):4219–4225Google Scholar
  35. Matsuda S, Vert JP, Saigo H, Ueda N, Toh H, Akutsu T (2005) A novel representation of protein sequences for prediction of subcellular location using support vector machines. Protein Sci 14(11):2804–2813Google Scholar
  36. Mu Z, Wu J, Zhang Y (2013) A novel method for similarity/dissimilarity analysis of protein sequences. Phys A Stat Mech Appl 392(24):6361–6366Google Scholar
  37. Nakashima H, Nishikawa K (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J Mol Biol 238(1):54–61Google Scholar
  38. Paradis E, Claude J, Strimmer K (2004) Ape: analyses of phylogenetics and evolution in r language. Bioinformatics 20(2):289–290Google Scholar
  39. Randić M, Novič M, Vračko M (2008) On novel representation of proteins based on amino acid adjacency matrix. SAR QSAR Environ Res 19(3–4):339–349Google Scholar
  40. Robinson O, Dylus D, Dessimoz C (2016) Phylo. io: interactive viewing and comparison of large phylogenetic trees on the web. Mol Biol Evolut 33(8):2163–2166Google Scholar
  41. Sahu SS, Panda G (2010) A novel feature representation method based on chou’s pseudo amino acid composition for protein structural class prediction. Comput Biol Chem 34(5):320–327zbMATHGoogle Scholar
  42. Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evolut 4(4):406–425Google Scholar
  43. Shamim MTA, Anwaruddin M, Nagarajaram HA (2007) Support vector machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics 23(24):3320–3327Google Scholar
  44. Shi JY, Zhang SW, Pan Q, Zhou GP (2008) Using Pseudo amino acid composition to predict protein subcellular location: approached with amino acid composition distribution. Amino Acids 35(2):321–327Google Scholar
  45. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J et al (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Mol Syst Biol 7(1):539Google Scholar
  46. Singh R, Xu J, Berger B (2008) Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc Nat Acad Sci 105(35):12,763–12,768Google Scholar
  47. Suna D, Xua C, Zhanga Y (2016) A novel method of 2d graphical representation for proteins and its application. RNA 18:20Google Scholar
  48. Tanchotsrinon W, Lursinsap C, Poovorawan Y (2015) A high performance prediction of HPV genotypes by Chaos game representation and singular value decomposition. BMC Bioinform 16(1):71Google Scholar
  49. Thompson JD, Higgins DG, Gibson TJ (1994) Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673–4680Google Scholar
  50. Tyagi A, Kapoor P, Kumar R, Chaudhary K, Gautam A, Raghava G (2013) In silico models for designing and discovering novel anticancer peptides. Sci Rep 3:2984Google Scholar
  51. Wang G, Li X, Wang Z (2008) Apd2: the updated antimicrobial peptide database and its application in peptide design. Nucleic Acids Res 37(suppl-1):D933–D937Google Scholar
  52. Welch P (1967) The use of fast fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Transact Audio Electroacoust 15(2):70–73Google Scholar
  53. Wu H, Zhang Y, Chen W, Mu Z (2015) Comparative analysis of protein primary sequences with graph energy. Phys A Stat Mech Appl 437:249–262zbMATHGoogle Scholar
  54. Xu C, Sun D, Liu S, Zhang Y (2016) Protein sequence analysis by incorporating modified chaos game and physicochemical properties into chou’s general pseudo amino acid composition. J Theor Biol 406:105–115Google Scholar
  55. Yang JY, Peng ZL, Yu ZG, Zhang RJ, Anh V, Wang D (2009) Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. J Theor Biol 257(4):618–626zbMATHGoogle Scholar
  56. Yao YH, Dai Q, Li C, He PA, Nan XY, Zhang YZ (2008) Analysis of similarity/dissimilarity of protein sequences. Proteins Struct Funct Bioinform 73(4):864–871Google Scholar
  57. Yau SST, Yu C, He R (2008) A protein map and its application. DNA and Cell Biol 27(5):241–250Google Scholar
  58. Yu HJ, Huang DS (2013) Normalized feature vectors: a novel alignment-free sequence comparison method based on the numbers of adjacent amino acids. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 10(2):457–467MathSciNetGoogle Scholar
  59. Yu ZG, Anh V, Lau KS (2004) Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses. J Theor Biol 226(3):341–348MathSciNetGoogle Scholar
  60. Zhang L, Liao B, Li D, Zhu W (2009) A novel representation for apoptosis protein subcellular localization prediction using support vector machine. J Theor Biol 259(2):361–365zbMATHGoogle Scholar
  61. Zhang Y, Yu X (2010) Analysis of protein sequence similarity. In: 2010 IEEE fifth international conference on bio-inspired computing: theories and applications (BIC-TA), IEEE, pp 1255–1258Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Mathematics and StatisticsShandong University at WeihaiWeihaiChina
  2. 2.Department of Mechatronics and Biomedical Computer ScienceUMITHall in TyrolAustria

Personalised recommendations