Summary
With the growing amount of genetic data available to scientists there is a pressing need to characterise the functions of genes. Such knowledge will enable us to better understand organisms at the molecular level and to elucidate the mechanisms by which diseases disrupt biological processes. With the advent of whole genome expression technologies such as DNA microarrays and proteomics, scientists can at last determine how the genes and proteins change their rates of expression under specific experimental conditions. The data sets generated from such studies are large and require sophisticated tools for proper analysis. In this chapter we review several techniques employed in clustering data sets of this type. Clustering can often reveal broad patterns which show that certain genes or proteins are performing common functions. This is a useful way in which one can attribute functions to newly discovered genes. A wide variety of clustering algorithms exists; we consider several of the most promising and look at how the techniques perform when tested with different types of data from gene expression and protein expression experiments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
M. Schena, D. Shalon, R. Davis and P. O. Brown, Quantitative monitoring of gene expression patterns with a cDNA microarray, Science 270: 467–470, (1995).
P. O. Brown and D. Botstein, Exploring the New World of the genome with DNA microarrays, Nature Genetics 21: 33–37, (1999).
M.R. Wilkins, K. L. Williams, R.D. Appel, D. F. Hochstrasser, (Eds.), Proteome Research: New Frontiers in Functional Genomics, Springer-Verlag Berlin, Heidelberg, New York, (1997).
Humphrey-Smith I., Cordwell S.J., Blackstock W.P.; Proteome Research: Complementarity and limitations with respect to the RNA and DNA worlds; Electrophoresis 18 (8): 1217–1242 (1997).
D. Shipton, Autoimmune disease in rodents: control and specificity, DPhil Thesis, University of Oxford, (1999).
M. B. Eisen, P. T. Spellman, P. O. Brown and D. Botstein, Cluster Analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci. USA, vol 95 pp 14863–14868, (1998).
T. Kohenen, Self-organized formation of topologically correct feature maps, Biol. Cybern. 43: 59–69, (1982).
P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander and T. R. Golub, Interpreting patterns of gene expression with selforgansing maps: Methods and application to hematopoietic differentiation, Proc. Natl. Acad. Aci. USA, 96: 2907–2912, (1999).
R. J. Cho, J. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T. G. Wolfsberg, A. E. Gabrielian, D. Landsman, D. J. Lockhart, A genome-wide transcriptional analysis of the mitotic cell cycle, Mol. Cell, 2(1):65–73, (1998)
Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs ( 3rd edition ), Springer-Verlag, Berlin, Heidelberg, New York, (1996).
R. Cole, Clustering with Genetic Algorithms, MSc Thesis, Department of Computer Science, University of Western Australia, (1998).
D. R. Jones and M. A. Beltramo, Solving partitioning problems with genetic algoritms, In R. K. Belew and L. B. Booker (editors), Proceedings on the Fourth International conference on Genetic Algorithms p442–9, Morgan Kaufmann publishers, San Mateo, California, (1991).
D. E. Goldberg, Genetic Algorithms in Search, Optimisation and Machine Learning, Addison-Wesley Publishing Company, Inc., (1989).
J. Bhuyan, A combination of genetic algorithm and simulated evolution techniques for clustering, In C. J. Hwang and B. W. Hwang (editors), Proceedings of the 1995 ACM Computer Science conference. pl 27–134, The Association for Computing Machinery, Inc., (1995).
B. Fritzke, Unsupervised clustering with growing cell structures, Proc. IJCNN-91, (1991).
A. J. Walker, S. S. Cross and R. F. Harrison, Visualisation of biomedical datasets by use of growing cell structure networks: a novel classification technique, Lancet 354: 1518–21, (1999).
V. Vapnik, Statistical Learning Theory, Wiley, Chichester, England, (1998).
J. C. Platt, Fast training of support vector machines using sequential minimal optimization, In Schölkopf, B., Burges, C. J. C., and Smola, A. J., editors, Advances in Kernel Methods, MIT Press, Boston, (1999).
C. J. C. Burges, A Tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, Kluwer Academic Publishers, Boston, (1998).
M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares Jr., D. Haussier, Knowledge based analysis of microarray gene expression data by using support vector machines, Proc. Natl. Acad. Aci. USA, vol. 97: 262–267, (2000).
R D Meyer and D Cook, Visualisation of data, Current Opinion in Biotechnology 2000 11: 89–96, (2000).
D. Gilbert, M. Schroeder, J. van Helden, Space Explorer: Interactive visualisation of relationships between biological objects, Trends in Biotechnology 18(12): 487–493, (2000).
M Gerstein and R Jansen, The current excitement in bioinformatics — analysis of whole genome expression data: how does it relate to protein structure and function?, Current Opinion in Structural Biology 10: 574–584, (2000).
M. Q. Zhang, Large-scale gene expression data analysis: a new challenge to computational biologists, Genome Research 9: 681–688, (1999).
V. R. Iyer, M. B. Eisen, D. T. Ross, G. Schuler, T. Moore, J. C. F. Lee, J. M. Trent, L. M. Staudt, J. Hudson, M.S. Boguski, D. Lashkari, D Shalon, D. Botstein, P. Brown, The transcriptional program in the response of human fibroblasts to serum, Science 283: 83–87, (1999).
U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack and A. J. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and colon tissues probed by oligonucleotide arrays, Proc. Natl. Acad. Aci. USA, vol. 96: 6745–6750, (1999).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Patel, K., Cartwright, H.M. (2003). Clustering of Large Data Sets in the Life Sciences. In: Cartwright, H.M., Sztandera, L.M. (eds) Soft Computing Approaches in Chemistry. Studies in Fuzziness and Soft Computing, vol 120. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-36213-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-540-36213-5_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53507-9
Online ISBN: 978-3-540-36213-5
eBook Packages: Springer Book Archive