Abstract
Computational intelligence and pattern recognition techniques are gaining more and more attention as the main computing tools in bioinformatics applications. This is due to the fact that biology by definition, deals with complex systems and that computational intelligence can be considered as an effective approach when facing the general problem of complex systems modelling. Moreover, most data available on shared databases are represented by sequences and graphs, thus demanding the definition of meaningful dissimilarity measures between patterns, which are often non-metric in nature. Especially in such cases, evolutive and fully automatic machine learning systems are mandatory for dealing with parametric dissimilarity measures and/or for performing suitable feature selection. Besides other approaches, such as kernel methods and embedding in dissimilarity spaces, granular computing is a very promising framework not only for designing effective data-driven modelling systems able to determine automatically the correct representation (abstraction) level, but also for giving to field-experts (biologists) the possibility to investigate information granules (frequent substructures) that have been discovered by the machine learning system as the most relevant for the problem at hand. We expect that many important discoveries in biology and medicine in the next future will be determined by an increasingly stronger integration between the ongoing research efforts of natural sciences and modern inductive modelling tools based on computational intelligence, pattern recognition and granular computing techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
For example, let us consider a classification/clustering algorithm driven by the Euclidean distance. A common problem with the Euclidean distance is that features spanning a wider range of values have more influence in the resulting distance measure, therefore normalising all attributes in the same range (usually [0, 1] or \([-1,+1]\)) ensures fair contribution from all attributes, regardless of their original range.
- 2.
In Statistics, outliers are “anomalous data” that for a given dissimilarity measure lie far away from most observations.
- 3.
Non sunt multiplicanda entia sine necessitate (Entities are not to be multiplied without necessity), commonly known as “The Ockham’s Razor” Criterion (William of Ockham, circa 1287–1347). This criterion states that among a set of predicting models sharing the same performances, the simplest one (i.e. the one with the simplest decision surfaces) should be preferred. It is for sure one of the fundamental axioms for thoughtful and practical data-driven modelling.
- 4.
Also known as hyperparameters in the Machine Learning terminology.
- 5.
That is why evolutionary optimisation metaheuristics fall within the derivative-free methods.
- 6.
A common choice for a genetic algorithm fitness function takes into account both the model performance and its structural complexity. Specifically, whilst the former should be maximised, the latter should be minimised in order to avoid overfitting (cf. the Ockham’s Razor Criterion).
- 7.
That is why in most of the Chapter, unless explicitly specified, the generic term (dis)similarity will be used.
- 8.
Indeed, the anatomical structure changes in the order of months/years depending on the age of subjects.
- 9.
A finite set of points equipped with a notion of distance in a finite multidimensional space.
- 10.
According to which the distance between two strings of equal length is given by the number of mismatches.
- 11.
Also known as the Gram Matrix, after Danish mathematician Jørgen Pedersen Gram.
- 12.
If the similarity measure at hand is not symmetric, patterns’ distance vectors as taken by rows or columns will be different. In order to overcome this problem, one can ‘force’ a similarity measure to be symmetric by considering \(\mathbf {S}:=(\mathbf {S}+\mathbf {S}^T)/2\) (e.g. [14]).
- 13.
Also known as the Krebs cycle.
- 14.
Protein molecules driving the folding of other protein systems.
- 15.
Indeed, the absolute entity of metabolic rate can vary for a lot of reasons going from anatomical differences among patients to their actual nutrition state.
References
S. Alelyani, J. Tang, H. Liu, Feature selection for clustering: a review. Data Clust. Algorithms Appl. 29, 110–121 (2013)
C. Anderson, The end of theory: the data deluge makes the scientific method obsolete. Wired mag. 16(7), 16–07 (2008)
M. Ankerst, M.M. Breunig, H.P. Kriegel, J. Sander, Optics: ordering points to identify the clustering structure. ACM Sigmod Rec. 28, 49–60 (1999)
A. Bargiela, W. Pedrycz, Granular Computing: An Introduction (Kluwer Academic Publishers, Boston, 2003)
V. Beckers, L.M. Dersch, K. Lotz, G. Melzer, O.E. Bläsing, R. Fuchs, T. Ehrhardt, C. Wittmann, In silico metabolic network analysis of arabidopsis leaves. BMC Syst. Biol. 10(1), 102 (2016)
J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012)
H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne, The protein data bank. Nucleic Acids Res. 28(1), 235–242 (2000)
F.M. Bianchi, L. Livi, A. Rizzi, A. Sadeghian, A granular computing approach to the design of optimized graph classification systems. Soft Comput. 18(2), 393–412 (2014)
F.M. Bianchi, S. Scardapane, A. Rizzi, A. Uncini, A. Sadeghian, Granular computing techniques for classification and semantic characterization of structured data. Cogn. Comput. 8(3), 442–461 (2016)
P.S. Bradley, O.L. Mangasarian, W.N. Street, Clustering via concave minimization, in Advances in Neural Information Processing Systems (1997), pp. 368–374
E. Bullmore, O. Sporns, Complex brain networks: graph theoretical analysis of structural and functional systems. Nat. Rev. Neurosci. 10(3), 186–198 (2009)
G. Carlsson, Topology and data. Bull. Am. Math. Soc. 46(2), 255–308 (2009)
C. Cellucci, Rethinking Logic: Logic in Relation to Mathematics, Evolution, and Method (Springer Science & Business Media, 2013)
Y. Chen, E.K. Garcia, M.R. Gupta, A. Rahimi, L. Cazzanti, Similarity-based classification: concepts and algorithms. J. Mach. Learn. Res. 10, 747–776 (2009)
Y. Chen, M.R. Gupta, B. Recht, Learning kernels from indefinite similarities, in Proceedings of the 26th Annual International Conference on Machine Learning (ACM, 2009), pp. 145–152
A. Colorni, M. Dorigo, V. Maniezzo, Distributed optimization by ant colonies, in Toward a Practice of Autonomous Systems: Proceedings of the First European Conference on Artificial Life (Mit Press, 1992), p. 134
D. Counsell, A review of bioinformatics education in the uk. Brief. Bioinform. 4(1), 7–21 (2003)
J. Damoiseaux, S. Rombouts, F. Barkhof, P. Scheltens, C. Stam, S.M. Smith, C. Beckmann, Consistent resting-state networks across healthy subjects. Proc. Natl. Acad. Sci. 103(37), 13848–13853 (2006)
D.L. Davies, D.W. Bouldin, A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979)
G. Del Vescovo, L. Livi, F.M. Frattale Mascioli, A. Rizzi, On the problem of modeling structured data with the minsod representative. Int. J. Comput. Theory Eng. 6(1), 9 (2014)
A. Di Noia, P. Montanari, A. Rizzi, Occupational diseases risk prediction by cluster analysis and genetic optimization, in Proceedings of the International Joint Conference on Computational Intelligence (SCITEPRESS-Science and Technology Publications, Lda, 2014), pp. 68–75
A. Di Noia, P. Montanari, A. Rizzi, Occupational diseases risk prediction by genetic optimization: Towards a non-exclusive classification approach, in Computational Intelligence (Springer, Berlin, 2016), pp. 63–77
L. Di Paola, M. De Ruvo, P. Paci, D. Santoni, A. Giuliani, Protein contact networks: an emerging paradigm in chemistry. Chem. Rev. 113(3), 1598–1613 (2012)
M. Ester, H.P. Kriegel, J. Sander, X. Xu et al., A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96, 226–231 (1996)
M.D. Fox, M.E. Raichle, Spontaneous fluctuations in brain activity observed with functional magnetic resonance imaging. Nat. Rev. Neurosci. 8(9), 700–711 (2007)
K.J. Friston, C.D. Frith, R.S. Frackowiak, R. Turner, Characterizing dynamic brain responses with fmri: a multivariate approach. Neuroimage 2(2), 166–172 (1995)
J. Gao, B. Barzel, A.L. Barabási, Universal resilience patterns in complex networks. Nature 530(7590), 307–312 (2016)
A. Giuliani, S. Filippi, M. Bertolaso, Why network approach can promote a new way of thinking in biology. Front. Genet. 5 (2014)
A. Giuliani, A. Krishnan, J.P. Zbilut, M. Tomita, Proteins as networks: usefulness of graph theory in protein science. Curr. Protein Peptide Sci. 9(1), 28–38 (2008)
D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning (Addison-Wesley, USA, 1989)
M.D. Greicius, B. Krasnow, A.L. Reiss, V. Menon, Functional connectivity in the resting brain: a network analysis of the default mode hypothesis. Proc. Natl. Acad. Sci. 100(1), 253–258 (2003)
S. Guha, R. Rastogi, K. Shim, Cure: an efficient clustering algorithm for large databases. ACM Sigmod Rec. 27, 73–84 (1998)
R.W. Hamming, Error detecting and error correcting codes. Bell Labs Tech. J. 29(2), 147–160 (1950)
B. He, K. Wang, Y. Liu, B. Xue, V.N. Uversky, A.K. Dunker, Predicting intrinsic disorder in proteins: an overview. Cell Res. 19(8), 929–949 (2009)
D.R. Hofstadter, I Am a Strange Loop, Basic Books (2007)
J. Horgan, From complexity to perplexity. Sci. Am. 272(6), 104–109 (1995)
A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999)
G. Jurman, R. Visintainer, C. Furlanello, An introduction to spectral distances in networks. Front. Artif. Intell. Appl. 226, 227–234 (2011)
L. Kaufman, P. Rousseeuw, Clustering by means of medoids. Stat. Data Anal. Based L1-Norm Relat. Methods, 405–416 (1987)
J. Kennedy, R. Eberhart, Particle swarm optimization, in Proceedings of the IEEE International Conference on Neural Networks, vol. 4 (IEEE, 1995), pp. 1942–1948
S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing. Science 220(4598), 671–680 (1983)
V.I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady. 10, 707–710 (1966)
A.W.C. Liew, H. Yan, M. Yang, Pattern recognition techniques for the emerging field of bioinformatics: A review. Pattern Recognition 38(11), 2055–2073 (2005)
L. Livi, A. Giuliani, A. Rizzi, Toward a multilevel representation of protein molecules: comparative approaches to the aggregation/folding propensity problem. Inf. Sci. 326, 134–145 (2016)
L. Livi, A. Giuliani, A. Sadeghian, Characterization of graphs for protein structure modeling and recognition of solubility. Curr. Bioinform. 11(1), 106–114 (2016)
L. Livi, E. Maiorino, A. Giuliani, A. Rizzi, A. Sadeghian, A generative model for protein contact networks. J. Biomol. Struct. Dyn. 34(7), 1441–1454 (2016)
L. Livi, A. Rizzi, The graph matching problem. Pattern Anal. Appl. 16(3), 253–283 (2013)
L. Livi, A. Rizzi, A. Sadeghian, Optimized dissimilarity space embedding for labeled graphs. Inf. Sci. 266, 47–64 (2014)
L. Livi, A. Rizzi, A. Sadeghian, Granular modeling and computing approaches for intelligent analysis of non-geometric data. Appl. Soft Comput. 27, 567–574 (2015)
L. Livi, A. Sadeghian, Granular computing, computational intelligence, and the analysis of non-geometric input spaces. Granul. Comput. 1(1), 13–20 (2016)
S. Lloyd, Least squares quantization in pcm. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
L. MacQueen, Some methods for classification and analysis of multivariate observations, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1 (Oakland, USA, 1967), pp. 281–297
H.A. Maghawry, M.C. Mostafa, M.H. Abdul-Aziz, T.E. Gharib, A modified cutoff scanning matrix protein representation for enhancing protein function prediction, in 9th International Conference on Informatics and Systems (INFOS) (IEEE, 2014), pp. DEKM–40
E. Maiorino, A. Rizzi, A. Sadeghian, A. Giuliani, Spectral reconstruction of protein contact networks. Phys. A: Stat. Mech. Appl. 471, 804–817 (2017)
A. Martino, E. Maiorino, A. Giuliani, M. Giampieri, A. Rizzi, Supervised approaches for function prediction of proteins contact networks from topological structure information, in Scandinavian Conference on Image Analysis (Springer, Berlin, 2017), pp. 285–296
A. Martino, A. Rizzi, F.M. Frattale Mascioli, Efficient approaches for solving the large-scale k-medoids problem, in Proceedings of the 9th International Joint Conference on Computational Intelligence. IJCCI, vol. 1 (INSTICC, 2017), pp. 338–347
J. Mercer, Functions of positive and negative type, and their connection with the theory of integral equations, in Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, vol. 209 (1909), pp. 415–446
D.C. Mikulecky, Network thermodynamics and complexity: a transition to relational systems theory. Comput. Chem. 25(4), 369–391 (2001)
T.M. Mitchell, Machine Learning (McGraw-Hill Boston, MA, 1997)
M. Neuhaus, H. Bunke, Bridging the Gap Between Graph Edit Distance and Kernel Machines, vol. 68 (World Scientific, 2007)
M. Pagani, A. Giuliani, J. Öberg, A. Chincarini, S. Morbelli, A. Brugnolo, D. Arnaldi, A. Picco, M. Bauckneht, A. Buschiazzo et al., Predicting the transition from normal aging to alzheimer’s disease: a statistical mechanistic evaluation of fdg-pet data. NeuroImage 141, 282–290 (2016)
M. Pagani, A. Giuliani, J. Öberg, F. De Carli, S. Morbelli, N. Girtler, F. Bongioanni, D. Arnaldi, J. Accardo, M. Bauckneht et al., Progressive disgregation of brain networking from normal aging to alzheimer’s disease. independent component analysis on fdg-pet data. J. Nucl. Med. jnumed–116 (2017)
E. Parzen, On estimation of a probability density function and mode. Ann. Math. Stat. 33(3), 1065–1076 (1962)
M. Pascual, S.A. Levin, From individuals to population densities: searching for the intermediate scale of nontrivial determinism. Ecology 80(7), 2225–2236 (1999)
K. Pearson, Mathematical contributions to the theory of evolution. iii. regression, heredity, and panmixia, in Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, vol. 187 (1896), pp. 253–318
E. Pękalska, R.P. Duin, The Dissimilarity Representation for Pattern Recognition: Foundations and Applications (World Scientific, 2005)
K. Peng, P. Radivojac, S. Vucetic, A.K. Dunker, Z. Obradovic, Length-dependent prediction of protein intrinsic disorder. BMC Bioinform. 7(1), 208 (2006)
J.B. Pereira, M. Mijalkov, E. Kakaei, P. Mecocci, B. Vellas, M. Tsolaki, I. Kłoszewska, H. Soininen, C. Spenger, S. Lovestone et al., Disrupted network topology in patients with stable and progressive mild cognitive impairment and alzheimer’s disease. Cereb. Cortex 26(8), 3476–3493 (2016)
F. Possemato, A. Rizzi, Automatic text categorization by a granular computing approach: facing unbalanced data sets, in The International Joint Conference on Neural Networks (IJCNN) (IEEE, 2013), pp. 1–8
J.S. Richardson, The anatomy and taxonomy of protein structure. Adv. Protein Chem. 34, 167–339 (1981)
D. de Ridder, J. de Ridder, M.J. Reinders, Pattern recognition in bioinformatics. Brief. Bioinform. 14(5), 633–647 (2013)
A. Rizzi, F. Possemato, L. Livi, A. Sebastiani, A. Giuliani, F.M. Frattale Mascioli, A dissimilarity-based classifier for generalized sequences by a granular computing approach, in The International Joint Conference on Neural Networks (IJCNN) (IEEE, 2013), pp. 1–8
P. Romero, Z. Obradovic, X. Li, E.C. Garner, C.J. Brown, A.K. Dunker, Sequence complexity of disordered protein. Proteins Struct. Funct. Bioinform. 42(1), 38–48 (2001)
P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
M. Rubinov, O. Sporns, Complex network measures of brain connectivity: uses and interpretations. Neuroimage 52(3), 1059–1069 (2010)
H. Sakoe, S. Chiba, Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 26(1), 43–49 (1978)
B. Schölkopf, A.J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (MIT press, 2002)
J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis (Cambridge university press, Cambridge, 2004)
D.H. Silverman, G.W. Small, C.Y. Chang, C.S. Lu, M.A.K. de Aburto, W. Chen, J. Czernin, S.I. Rapoport, P. Pietrini, G.E. Alexander et al., Positron emission tomography in evaluation of dementia: regional brain metabolism and long-term outcome. Jama 286(17), 2120–2127 (2001)
G.P. Singh, M. Ganapathi, D. Dash, Role of intrinsic disorder in transient interactions of hub proteins. Proteins Struct. Funct. Bioinform. 66(4), 761–765 (2007)
J. Smucny, K.P. Wylie, J.R. Tregellas, Functional magnetic resonance imaging of intrinsic brain networks for translational drug discovery. Trends Pharmacol. Sci. 35(8), 397–403 (2014)
C. Soguero-Ruiz, K. Hindberg, J.L. Rojo-Álvarez, S.O. Skrøvseth, F. Godtliebsen, K. Mortensen, A. Revhaug, R.O. Lindsetmo, K.M. Augestad, R. Jenssen, Support vector feature selection for early detection of anastomosis leakage from bag-of-words in electronic health records. IEEE J. Biomed. Health Inf. 20(5), 1404–1415 (2016)
P.G. Spetsieris, J.H. Ko, C.C. Tang, A. Nazem, W. Sako, S. Peng, Y. Ma, V. Dhawan, D. Eidelberg, Metabolic resting-state brain networks in health and disease. Proc. Natl. Acad. Sci. 112(8), 2563–2568 (2015)
J.M. Stanton, Galton, pearson, and the peas: A brief history of linear regression for statistics instructors. J. Stat. Education 9(3), 1–16 (2001)
S. Theodoridis, K. Koutroumbas, Pattern Recognition, 4th edn. (Academic Press, 2008)
M.K. Transtrum, B.B. Machta, K.S. Brown, B.C. Daniels, C.R. Myers, J.P. Sethna, Perspective: sloppiness and emergent theories in physics, biology, and beyond. J. Chem. Phys. 143(1), 07B201_1 (2015)
V.N. Uversky, Natively unfolded proteins: a point where biology waits for physics. Protein Sci. 11(4), 739–756 (2002)
B.C. Van Wijk, C.J. Stam, A. Daffertshofer, Comparing brain networks of different size and connectivity density using graph theory. PloS one 5(10), e13701 (2010)
J.P. Vert, K. Tsuda, B. Schölkopf, Kernel Methods in Computational Biology, A primer on kernel methods (2004), pp. 35–70
Y.C. Wang, Y. Wang, Z.X. Yang, N.Y. Deng, Support vector machine prediction of enzyme function with conjoint triad feature and hierarchical context. BMC Syst. Biol. 5(1), S6 (2011)
L. Wasserman, Topological data analysis. Ann. Rev. Stat. Appl. 5(1) (2018)
W. Weaver, Science and complexity. Am. Sci. 36(4), 536 (1948)
A. Wright, A.B. McCoy, S. Henkin, A. Kale, D.F. Sittig, Use of a support vector machine for categorizing free-text notes: assessment of accuracy across two institutions. J. Am. Med. Inf. Assoc. 20(5), 887–890 (2013)
Y. Yang, L. Han, Y. Yuan, J. Li, N. Hei, H. Liang, Gene co-expression network analysis reveals common system-level properties of prognostic genes across cancer types. Nat. Commun. 5, 3231 (2014)
F. Yates, K. Mather, Ronald aylmer fisher, 1890–1962. Biogr. Mem. Fellows R. Soc. 9, 91–129 (1963)
L.A. Zadeh, Soft computing and fuzzy logic. IEEE Softw. 11(6), 48–56 (1994)
T. Zhang, R. Ramakrishnan, M. Livny, Birch: an efficient data clustering method for very large databases. ACM Sigmod Rec. 25, 103–114 (1996)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Martino, A., Giuliani, A., Rizzi, A. (2018). Granular Computing Techniques for Bioinformatics Pattern Recognition Problems in Non-metric Spaces. In: Pedrycz, W., Chen, SM. (eds) Computational Intelligence for Pattern Recognition. Studies in Computational Intelligence, vol 777. Springer, Cham. https://doi.org/10.1007/978-3-319-89629-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-89629-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-89628-1
Online ISBN: 978-3-319-89629-8
eBook Packages: EngineeringEngineering (R0)