Abstract
Data analysis methods and techniques are revisited in the case of biological data sets. Particular emphasis is given to clustering and mining issues. Clustering is still a subject of active research in several fields such as statistics, pattern recognition, and machine learning. Data mining adds to clustering the complications of very large data-sets with many attributes of different types. And this is a typical situation in biology. Some cases studies are also described.
Chapter PDF
References
Brudno, M., Malde, S., Poliakov, A.: Glocal alignment: finding rearrangements during alignment. Bioinformatics 19(1), 54–62 (2003)
Rogic, S.: The role of pre-mRNA secondary structure in gene splicing in Saccharomyces cerevisiae, PhD Dissertation, University of British Columbia (2006)
Bourne, P.E., Shindyalov, I.N.: Structure Comparison and Alignment. In: Bourne, P.E., Weissig, H. (eds.) Structural Bioinformatics, Wiley-Liss, Hoboken, NJ (2003)
Zhang, Y., Skolnick, J.: The protein structure prediction problem could be solved using the current PDB library. Proc. Natl. Acad. Sci. USA 102(4), 1029–1034 (2005)
Gould, S.J.: The Structure of Evolutionary Theory. Belknap Press (2002)
Matsuda, T., Motoda, H., Yoshida, T., Washio, T.: Mining Patterns from Structured Data by Beam-wise Graph-Based Induction. In: Lange, S., Satoh, K., Smith, C.H. (eds.) DS 2002. LNCS, vol. 2534, pp. 422–429. Springer, Heidelberg (2002)
Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V., Altschul, S.F.: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 29(14), 2994–3005 (2001)
Karp, P.D., Riley, M., Saier, M., Paulsen, I.T., Paley, S.M., Pellegrini-Toole, A.: The EcoCyc and MetaCyc databases. Nucleic Acids Research 28, 56–59 (2000)
Vert, J.-P.: Support Vector Machine Prediction of Signal Peptide Cleavage Site Using a New Class of Kernels for Strings. In: Proceedings of the Pacific Symposium on Biocomputing, vol. 7, pp. 649–660 (2002)
Aerts, S., Thijs, G., Coessens, B., Staes, M., Moreau, Y., De Moor, B.: Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Research 31(6), 1753–1764 (2003)
Cappé, O., Moulines, E., Rydén, T.: Inference in Hidden Markov Models. Springer, Heidelberg (2005)
Kielbasa, S.M., Blüthgen, N., Sers, C., Schäfer, R., Herze, H.: Prediction of Cis-Regulatory Elements of Coregulated Genes Szymon. Genome Informatics 15(1), 117–124 (2004)
Cheng Cheung, L.-L., Siu-Ming Yiu, D.W.: Approximate string matching in DNA sequences. In: Proceedings DASFAA 2003, pp. 303–310 (2003)
Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM 46(3), 395–415 (1999)
Aoki, K.F., Yamaguchi, A., Okuno, Y.: Effcient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics 14, 134–143 (2003)
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, The Press Syndacate of the University of Cambridge, UK (1999)
Taylor, W.R.: Protein Structure Comparison Using Bipartite Graph Matching and Its Application to Protein Structure Classification. Molecular & Cellular Proteomics 1(4), 334–339 (2002)
Yang, Q., Sze, S.-H.: Path Matching and Graph Matching in Biological Networks. Journal of Computational Biology 14(1), 56–67 (2007)
Sholom, M.W., Indurkhya, N.: Predictive Data-Mining: A Practical Guide. Morgan Kaufmann, San Francisco (1998)
Tana, A.H., Panb, H.: Predictive neural networks for gene expression data analysis. Neural Networks 18, 297–306 (2005)
Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering gene expression patterns. Journal of Computational Biology 6(3/4), 281–297 (1999)
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95(25), 14863–14868 (1998)
MacQueen, J.B.: Some Methods for classification and Analysis of Multivariate Observations. In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, vol. 1, pp. 281–297. University of California Press (1967)
Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.H.: Systematic determination of genetic network architecture. Nature Genet. 22(3), 281–285 (1999)
Herwig, R., Poustka, A.J., Muller, C., Bull, C., Lehrach, H., O’Brien, J.: Large-Scale Clustering of cDNA Fingerprinting Data. Genome Research 9(11), 1093–1105 (1999)
Heyer, L.J., Kruglyak, S., Yooseph, S.: Exploring expression data: identification and analysis of coexpressed genes. Genome Research 9(11), 1106–1115 (1999)
De Smet, F., Mathys, J., Marchal, K., Thijs, G., De Moor, B., Moreau, Y.: Adaptive quality-based clustering of gene expression profiles. Bioinformatics 18, 735–746 (2002)
Kohonen, T.: Self-Organization and Associative Memory. Springer, Berlin (1984)
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., Golub, T.R.: Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 96(6), 2907–2912 (1999)
Mahony, S., Golden, A., Smith, T.J., Benos, P.V.: Improved detection of DNA motifs using a self-organized clustering of familial binding profiles. Bioinformatics 21(Suppl 1), 283–291 (2005)
Yeung, K.Y., Fraley, C., Mura, A., Raftery, A.E., Ruzzo, W.L.: Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 977–987 (2001)
Yeang, C.-H., Jaakkola, T.: Time Series Analysis of Gene Expression and Location Data. In: Proceedings of the Third IEEE Symposium on BioInformatics and BioEngineering (BIBE 2003), pp. 1–8 (2003)
Ramoni, M.F., Sebastiani, P., Kohane, I.S.: Cluster analysis of gene expression dynamics. Proc. Natl. Acad. Sci. USA 99(14), 9121–9126 (2002)
Koski, T.T.: Hidden Markov Models for Bioinformatics. Series: Computational Biology, vol. 2. Springer, Heidelberg (2002)
Hartuv, E., Shamir, R.: A clustering algorithm based on graph connectivity. Information Processing Letters 76(4/6), 175–181 (2000)
Xu, Y., Olman, V., Xu, D.: Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics 18, 536–545 (2002)
Jiang, D., Pei, J., Zhang, A.: Interactive Exploration of Coherent Patterns in Time-Series Gene Expression Data. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003), Washington, DC, USA, pp. 24–27 (2003)
Sultan, M., Wigle, D.A., Cumbaa, C.A., Marziar, M., Glasgow, J., Tsao, M.S., Jurisca, J.: Binary tree-structured vector quantization approach to clustering and visualizing microarray data. Bioinformatics 18(1), 111–119 (2002)
Bellaachia, A., Portnoy, D., Chen, Y., Elkahloun, A.G.: E-CAST: a data mining algorithm for gene expression data. In: Proceedings of the ACM SIGKDD Workshop on Data Mining in Bioinformatics (BIOKDD 2002), pp. 49–54 (2002)
Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB), vol. 8, pp. 93–103 (2000)
Mirkin, B.: Mathematical Classification and Clustering. Kluwer Academic Publishers, Dordrecht (1996)
Van Mechelen, I., Bock, H.H., De Boeck, P.: Two-mode clustering methods:a structured overview. Statistical Methods in Medical Research 13(5), 363–394 (2004)
Bryan, K., Cunningham, P., Bolshakova, N.: Biclustering of Expression Data Using Simulated Annealing. In: 18th IEEE Symposium on Computer-Baseds Medical Systems (CBMS 2005), pp. 383–388 (2005)
Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Science 220(4598), 671–680 (1983)
Chakraborty, A., Maka, H.: Biclustering of Gene Expression Data Using Genetic Algorithm. In: IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2005), vol. 14(15), pp. 1–8 (2005)
Sushmita, M., Haider, B.: Multi-objective evolutionary biclustering of gene expression data. Pattern Recognition 39(12), 2464–2477 (2006)
Di Gesù, V., Giancarlo, R., Lo Bosco, G., Raimondi, A., Scaturro, D.: GenClust: A Genetic Algorithm for Clustering Gene Expression Data. BMC Bioinformatics 6(289) (2005)
Di Gesù, V., Lo Bosco, G.: A genetic integrated fuzzy classifier. Pattern Recognition Letters 26(4), 411–420 (2005)
Lu, Y., Lu, S., Fotouhi, F., Deng, Y., Brown, S.J.: Incremental genetic K-means algorithm and its application in gene expression data analysis. BMC Bioinformatics 5(172) (2004)
Di Gesù, V., Lo Bosco, G.: GenClust: a Genetic Algorithm for Cluster Analysis. In: Proc. ADA III, pp. 12–18 (2004)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 264–323 (1999)
Yuan, G.C., Liu, Y.J., Dion, M.F., Slack, M.D., Wu, L.F., Altschuler, S.J., Rando, O.J.: Genome-Scale Identification of Nucleosome Positions in S. cerevisiae. Science 309, 626–630 (2005)
Delcher, A.L., Kasif, S., Goldberg, H.R., Hsu, W.H.: Protein secondary structure modelling with probabilistic networks. In: Proc. of Int. Conf. on Intelligent Systems and Molecular Biology, pp. 109–117 (1993)
Corona, D., Di Gesù, V., Lo Bosco, G., Pinello, L., Yuan, G.-C.: A new Multi-Layers Method to Analyze Gene Expression. In: Proc. KES 2007. LNCS, Springer, Heidelberg (in press, 2007)
Yeung, K.Y., Haynor, D.R., Ruzzo, W.L.: Validating clustering for gene expression data. Bioinformatics 17, 309–318 (2001)
Somogyi, R., Wen, X., Ma, W., Barker, J.L.: Developmental kinetic of GLAD family mRNAs parallel neurogenesis in the rat Spinal Cord. Journal Neurosciences 15, 2575–2591 (1995)
Spellman, P., Sherlock, G., Zhang, M., et al.: Comprehensive identification of cell cycle regulated genes of the yeast Saccharomyces Cerevisiae by microarray hybridization. Journal of Mol. Biol. Cell 9, 3273–3297 (1998)
Cho, R.J., et al.: A genome-wide transcriptional analysis of the mitotic cell cycle. Journal of Molecular Cell 2, 65–73 (1998)
Hartuv, E., Schmitt, A., Lange, J., et al.: An Algorithm for Clustering of cDNAs for Gene Expression Analysis Using Short Oligonucleotide Fingerprints. Journal Genomics 66, 249–256 (2000)
Jiang, D., Pei, J., Zhang, A.: Towards Interactive Exploration of Gene Expression Patterns. SIGKDD Explorations 5(2), 79–90 (2003)
Delcher, A.L., Kasif, S., Goldberg, H.R., Hsu, W.H.: Protein secondary structure modelling with probabilistic networks. In: Proc. of Int. Conf. on Intelligent Systems and Molecular Biology, pp. 109–117 (1993)
Yuan, G.C., Liu, Y.J., Dion, M.F., Slack, M.D., Wu, L.F., Altschuler, S.J., Rando, O.J.: Genome-Scale Identification of Nucleosome Positions in S. cerevisiae. Science 309, 626–630 (2005)
Delcher, A.L., Kasif, S., Goldberg, H.R., Hsu, W.H.: Protein secondary structure modelling with probabilistic networks. In: Proc. of Int. Conf. on Intelligent Systems and Molecular Biology, pp. 109–117 (1993)
Corona, D., Di Gesù, V., Lo Bosco, G., Pinello, L., Yuan, G.-C.: A new Multi-Layers Method to Analyze Gene Expression. In: Proc. KES 2007 11th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems. LNCS, Springer, Heidelberg (in press, 2007)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Di Gesù, V. (2007). Data Analysis and Bioinformatics. In: Ghosh, A., De, R.K., Pal, S.K. (eds) Pattern Recognition and Machine Intelligence. PReMI 2007. Lecture Notes in Computer Science, vol 4815. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77046-6_47
Download citation
DOI: https://doi.org/10.1007/978-3-540-77046-6_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77045-9
Online ISBN: 978-3-540-77046-6
eBook Packages: Computer ScienceComputer Science (R0)