Advertisement

Novel Biclustering Methods for Re-ordering Data Matrices

  • Peter A. DiMaggioJr.
  • Ashwin Subramani
  • Christodoulos A. FloudasEmail author
Chapter
Part of the Fields Institute Communications book series (FIC, volume 63)

Abstract

Clustering of large-scale data sets is an important technique that is used for analysis in a variety of fields. However, a number of these methods are based on heuristics for the identification of the best arrangement of data points. In this chapter, we present rigorous clustering methods based on the iterative optimal re-ordering of data matrices. Distinct Mixed-integer linear programming (MILP) models have been implemented to carry out clustering of dense data matrices (such as gene expression data) and sparse data matrices (such as drug discovery and toxicology). We present the capability of the optimal re-ordering methods on a wide array of data sets from systems biology, molecular discovery and toxicology.

Keywords

Travel Salesman Problem Travel Salesman Problem Carbon Starvation Cluster Boundary Original Data Matrix 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

CAF gratefully acknowledges financial support from the National Science Foundation, National Institutes of Health (R01 GM52032; R24 GM069736) and U.S. Environmental Protection Agency EPA (GAD R 832721-010).

References

  1. 1.
    A. Aggarwal, C.A. Floudas, Synthesis of general separation sequences - nonsharp separations. Comp. Chem. Eng. 14(6), 631–653 (1990)CrossRefGoogle Scholar
  2. 2.
    U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, A.J. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 96, 6745–6750 (1999)CrossRefGoogle Scholar
  3. 3.
    M.R. Anderberg, Cluster Analysis for Applications (Academic, New York, 1973)zbMATHGoogle Scholar
  4. 4.
    I.P. Androulakis, C.D. Maranas, C.A. Floudas, Prediction of oligopeptide conformations via deterministic global optimization. J. Glo. Opt. 11, 1–34 (1997)MathSciNetzbMATHCrossRefGoogle Scholar
  5. 5.
    D.L. Applegate, R.E. Bixby, V. Chvatal, W.J. Cook, The Traveling Salesman Problem: A Computational Study (Princeton University Press, Princeton, 2007)Google Scholar
  6. 6.
    P. Armutlu, M.E. Ozdemir, F. Uney-Yuksektepe, I.H. Kavakli, M. Turkay, Classification of drug molecules considering their ic50 values using mixed-integer linear programming based hyper-boxes method. BMC Bioinformatics 9, 411 (2008)CrossRefGoogle Scholar
  7. 7.
    W. Bannwarth, B. Hinzen, R. Mannhold, H. Kubinyi, G. Folkers, Combinatorial Chemistry: From Theory to Application (Methods and Principles in Medicinal Chemistry) (Wiley, New Jersey, 2006)Google Scholar
  8. 8.
    Z. Bar-Joseph, E.D. Demaine, D.K. Gifford, N. Srebro, A.M. Hamel, T.S. Jaakola, K-ary clustering with optimal leaf ordering for gene expression data. Bioinformatics 19(9), 1070–1078 (2003)CrossRefGoogle Scholar
  9. 9.
    J.N. Bhuyan, V.V. Raghavan, K.E. Venkatesh, in Genetic Algorithm for Clustering with an Ordered Representation. Proceedings of the Fourth International Conference on Genetic Algorithms, p. 408–415 (1991)Google Scholar
  10. 10.
    S. Bleuler, A. Prelic, E. Zitzler, An EA Framework for Biclustering of Gene Expression Data. IEEE Congress on Evolutionary Computation, pp. 166–173 (2004)Google Scholar
  11. 11.
    M. J. Brauer, J. Yuan, B. Bennett, W. Lu, E. Kimball, D. Bostein, J.D. Rabinowitz, Conservation of the metabolomic response to starvation across two divergent microbes. Proc. Natl. Acad. Sci. 103, 19302–19307 (2006)CrossRefGoogle Scholar
  12. 12.
    R.B. Brem, L. Kruglyak, The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc. Natl. Acad. Sci. 102(5), 1572–1577 (2005)CrossRefGoogle Scholar
  13. 13.
    S. Busygin, O.A. Prokopyev, P.M. Pardalos, Feature selection for consistent biclustering via fractional 0-1 programming. J. Comb. Opt. 10, 7–21 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  14. 14.
    S. Busygin, O.A. Prokopyev, P.M. Pardalos, An optimization based approach for data classification. Opt. Meth. Soft. 22(1), 3–9 (2007)MathSciNetzbMATHCrossRefGoogle Scholar
  15. 15.
    P. Carmona-Saez, R.D. Pasqual-Marqui, F. Tirado, J. Carazo, A. Pascual-Montano, Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinformatics 7, 78–96 (2006)CrossRefGoogle Scholar
  16. 16.
    Y. Cheng, G.M. Church, Biclustering of expression data. Proc. ISMB 2000, pp. 93–103 (2000)Google Scholar
  17. 17.
    A.R. Ciric, C.A. Floudas, A retrofit approach for heat-exchanger networks. Comp. Chem. Eng. 13(6), 703–715 (1989)CrossRefGoogle Scholar
  18. 18.
    S. Climer, W. Zhang, Rearrangement clustering: Pitfalls, remedies, and applications. J. Mach. Learn. Res. 7, 919–943 (2006)MathSciNetzbMATHGoogle Scholar
  19. 19.
    CPLEX, ILOG CPLEX 9.0 User’s Manual (2005)Google Scholar
  20. 20.
    M.S. Denison, J.P. Whitlock, Xenobiotic-inducible transcription of cytochrome P450 genes. J. Biol. Chem. 270(31), 18175–18178 (1995)CrossRefGoogle Scholar
  21. 21.
    P. DiMaggio, S. McAllister, C.A. Floudas, X.J. Feng, J. Rabinowitz, H. Rabitz, Biclustering via optimal re-ordering of data matrices in systems biology: Rigorous methods and comparative studies. BMC Bioinformatics 9, 458 (2008)CrossRefGoogle Scholar
  22. 22.
    P. DiMaggio, S. McAllister, C.A. Floudas, X.J. Feng, J. Rabinowitz, H. Rabitz, A network flow model for biclustering via optimal re-ordering of data matrices. J. Glo. Opt. 47, 343–354 (2010)MathSciNetzbMATHCrossRefGoogle Scholar
  23. 23.
    P.A. DiMaggio, A. Subramani, R.S. Judson, C.A. Floudas, A novel framework for predicting in vivo toxicities from in vitro data using optimal methods for dense and sparse matrix reordering and logistic regression. Toxicol. Sci. 118, 251–265 (2010)CrossRefGoogle Scholar
  24. 24.
    P.A. DiMaggio, S.R. McAllister, C.A. Floudas, X.J. Feng, J.D. Rabinowitz, H.A. Rabitz, Enhancing molecular discovery using descriptor-free rearrangement clustering techniques for sparse data sets. AIChE J 56, 405–418 (2010)Google Scholar
  25. 25.
    F. Divina, J. Aguilar, Biclustering of expression data with evolutionary computation. IEEE Trans. Knowl. Data Eng. 18(5), 590–602 (2006)CrossRefGoogle Scholar
  26. 26.
    A.W.F. Edwards, L.L. Cavalli-Sforza, A method for cluster analysis. Biometrics 21, 362–375 (1965)Google Scholar
  27. 27.
    M.B. Eisen, P.T. Spellman, P.O. Brown, D. Botstein, Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95, 14863–14868 (1998)CrossRefGoogle Scholar
  28. 28.
    C.A. Floudas, Nonlinear and Mixed-Integer Optimization (Oxford University Press, New York, 1995)zbMATHGoogle Scholar
  29. 29.
    C.A. Floudas, S.H. Anastasiadis, Synthesis of distillation sequences with several multicomponent feed and product streams. Chem. Eng. Sci. 43(9), 2407–2419 (1988)CrossRefGoogle Scholar
  30. 30.
    C.A. Floudas, I.E. Grossmann, Synthesis of flexible heat exchanger networks with uncertain flowrates and temperatures. Comp. Chem. Eng. 11(4), 319–336 (1987)CrossRefGoogle Scholar
  31. 31.
    L.R. Ford, D.R. Fulkerson, Flows in Networks (Princeton University Press, NJ, 1962)zbMATHGoogle Scholar
  32. 32.
    H.K. Fung, C.A. Floudas, M.S. Taylor, L. Zhang, D. Morikis, Towards full sequence de novo protein design with flexible templates for human beta-defensin-2. Biophys. J. 94, 584–599 (2008)CrossRefGoogle Scholar
  33. 33.
    C. Hansch, A. Leo, Exploring QSAR – Fundamentals and Applications in Chemistry and Biology (American Chemical Society, Washington, DC, 1995)Google Scholar
  34. 34.
    C. Hansch, B.R. Telzer, L. Zhang, Comparative qsar in toxicology: Examples from teratology and cancer chemotherapy of aniline mustards. Crit. Rev. Toxicol. 25, 67–89 (1995)CrossRefGoogle Scholar
  35. 35.
    J.A. Hartigan, M.A. Wong, Algorithm AS 136: A K-means clustering algorithm. Appl. Stat. 28, 100–108 (1979)zbMATHCrossRefGoogle Scholar
  36. 36.
    P. Honkakoski, M. Negishi, Regulation of cytochrome P450 (CYP) genes by nuclear receptors. Biochem. J. 347, 321–337 (2000)CrossRefGoogle Scholar
  37. 37.
    W.W. Huber, B. Grasl-kraupp, R. Schulte-hermann, Hepatocarcinogenic potential of di(2-ethylhexyl)phthalate in rodents and its implications on human risk. Crit. Rev. Toxicol. 26(4), 365–481 (1996)CrossRefGoogle Scholar
  38. 38.
    J. Huser, R. Mannhold, H. Kubinyi, G. Folkers, High-Throughput Screening in Drug Discovery (Methods and Principles in Medicinal Chemistry) (Wiley-VCH, NJ, 2006)Google Scholar
  39. 39.
    A.K. Jain, P.J. Flynn, in Image Segmentation Using Clustering, ed. by N. Ahuja, K. Bowyer. Advances in Image Understanding: A Festschrift for Azriel Rosenfeld (IEEE, NJ, 1996), pp. 65–83Google Scholar
  40. 40.
    A.K. Jain, J. Mao, Artificial neural networks: A tutorial. IEEE Comp. 29, 31–44 (1996)CrossRefGoogle Scholar
  41. 41.
    S.L. Janak, X. Lin, C.A. Floudas, Enhanced continuous-time unit-specific event based formulation for short-term scheduling of multipurpose batch processes: Resource constraints and mixed storage policies. Ind. Eng. Chem. Res. 43, 2516–2533 (2004)CrossRefGoogle Scholar
  42. 42.
    R. Judson, A. Richard, D.J. Dix, K. Houck, M. Martin, R. Kavlock, V. Dellarco, T. Henry, T. Holderman, P. Sayre, S. Tan, T. Carpenter, E. Smith, The toxicity data landscape for environmental chemicals. Environ. Health Perspect. 117, 685–695 (2009)Google Scholar
  43. 43.
    P. Kahraman, M. Turkay, Classification of 1,4-dihydropyridine calcium channel antagonists using the hyperbox approach. Ind. Eng. Chem. Res. 46, 4921–4929 (2007)CrossRefGoogle Scholar
  44. 44.
    R.W. Klein, R.C. Dubes, Experiments in projection and clustering by simulated annealing. Pattern Recogn. 22, 213–220 (1989)zbMATHCrossRefGoogle Scholar
  45. 45.
    J.L. Klepeis, C.A. Floudas, Free energy calculations for peptides via deterministic global optimization. J. Chem. Phys. 110, 7491–7512 (1999)CrossRefGoogle Scholar
  46. 46.
    J.L. Klepeis, C.A. Floudas, Ab initio tertiary structure prediction of proteins. J. Glo. Opt. 25, 113–140 (2003)MathSciNetzbMATHCrossRefGoogle Scholar
  47. 47.
    J.L. Klepeis, C.A. Floudas, ASTRO-FOLD: A combinatorial and global optimization framework for ab initio prediction of three-dimensional structures of proteins from the amino acid sequence. Biophys. J. 85, 2119–2146 (2003)CrossRefGoogle Scholar
  48. 48.
    J.L. Klepeis, C.A. Floudas, D. Morikis, J.D. Lambris, Predicting peptide structures using NMR data and deterministic global optimization. J. Comp. Chem. 20(13), 1354–1370 (1999)CrossRefGoogle Scholar
  49. 49.
    J.L. Klepeis, C.A. Floudas, D. Morikis, C.G. Tsokos, E. Argyropoulos, L. Spruce, J.D. Lambris, Integrated computational and experimenal approach for lead optimization and design of compstatin variants with improved activity. J. Am. Chem. Soc. 125(28), 8422–8423 (2003)CrossRefGoogle Scholar
  50. 50.
    Y. Kluger, R. Basri, J.T. Chang, M. Gerstein, Spectral biclustering of microarray data: Coclustering genes and conditions. Genome Res. 13, 703–716 (2003)CrossRefGoogle Scholar
  51. 51.
    H. Kojima, E. Katsura, S. Takeuchi, K. Niiyama, K. Kobayashi, Screening for estrogen and androgen receptor activities in 200 pesticides by in vitro reporter gene assays using chinese hamster ovary cells. Environ. Health Perspect. 112(5), 524–531 (2004)CrossRefGoogle Scholar
  52. 52.
    A.C. Kokossis, C.A. Floudas, Optimization of complex reactor networks-II: nonisothermal operation. Chem. Eng. Sci. 49(7), 1037–1051 (1994)CrossRefGoogle Scholar
  53. 53.
    J.K. Lenstra, Clustering a data array and the traveling-salesman problem. Oper. Res. 22(2), 413–414 (1974)MathSciNetzbMATHCrossRefGoogle Scholar
  54. 54.
    J.K Lenstra, A.H.G. Rinnooy Kan, Some simple applications of the traveling-salesman problem. Oper. Res. Q. 26(4), 717–733 (1975)Google Scholar
  55. 55.
    F. Liang, X. Feng, M. Lowry, H. Rabitz, Maximal use of minimal libraries through the adaptive substituent reordering algorithm. J. Phys. Chem. B 109, 5842–5854 (2005)CrossRefGoogle Scholar
  56. 56.
    X. Lin, C.A. Floudas, Design, synthesis and scheduling of multipurpose batch plants via an effective continuous-time formulation. Comp. Chem. Eng. 25, 665–674 (2001)CrossRefGoogle Scholar
  57. 57.
    M. Lutz, T. Kenakin, Quantitative Molecular Pharmacology and Informatics in Drug Discovery (Wiley, NJ, 2001)Google Scholar
  58. 58.
    S.C. Madeira, A.L. Oliveira, Biclustering algorithms for biological data analysis: A survey. IEE-ACM Trans. Comp. Bio. 1(1), 24–45 (2004)CrossRefGoogle Scholar
  59. 59.
    W.T. McCormick Jr., P.J. Schweitzer, T.W. White, Problem decomposition and data reorganization by a clustering technique. Oper. Res. 20(5), 993–1009 (1972)zbMATHCrossRefGoogle Scholar
  60. 60.
    M. Mönnigmann, C.A. Floudas, Protein loop structure prediction with flexible stem geometries. Protein Struct. Funct. Bioinformatics 61, 748–762 (2005)CrossRefGoogle Scholar
  61. 61.
    P. Moscato, A. Mendes, R. Berretta, Benchmarking a Memetic algorithm for ordering microarray data. Biosystems 88(1), 56–75 (2007)CrossRefGoogle Scholar
  62. 62.
    R. Ng, Drugs – From Discovery to Approval (WileyLiss, NJ, 2006)Google Scholar
  63. 63.
    P.M. Pardalos, V. Boginski, A. Vazakopoulos, Data Mining in Biomedicine (Springer, Berlin, 2007)zbMATHCrossRefGoogle Scholar
  64. 64.
    R. Perkins, H. Fang, W. Tong, W. Welsh, Quantitative structure-activity relationship methods: perspectives on drug discovery and toxicology. Environ. Toxicol. Chem. 22, 1666–1679 (2003)CrossRefGoogle Scholar
  65. 65.
    A. Prelic, S. Bleuler, P. Zimmermann, A. Wille, P. Buhlmann, W. Gruissem, L. Hennig, L. Thiele, E. Zitzler, A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9), 1122–1129 (2006)CrossRefGoogle Scholar
  66. 66.
    V.V. Raghavan, K. Birchand, in A Clustering Strategy Based on a Formalism of the Reproductive Process in a Natural System. Proceedings of the Second International Conference on Information Storage and Retrieval, pp. 10–22 (1979)Google Scholar
  67. 67.
    D.J. Reiss, N.S. Baliga, R. Bonneau, Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks. BMC Bioinformatics 7, 280–302 (2006)CrossRefGoogle Scholar
  68. 68.
    G. Salton, Developments in automatic text retrieval. Science 253, 974–980 (1991)MathSciNetGoogle Scholar
  69. 69.
    N. Shenvi, J.M. Geremia, H. Rabitz, Substituent ordering and interpolation in molecular library optimization. J. Phys. Chem. 107, 2066–2074 (2003)CrossRefGoogle Scholar
  70. 70.
    N. Shenvi, J.M. Geremia, H. Rabitz, Substituent ordering and interpolation in molecular library optimization. J. Phys. Chem. A 107, 2066 (2003)CrossRefGoogle Scholar
  71. 71.
    H.D. Sherali, J. Desai, A global optimization RLT-based approach for solving the fuzzy clustering problem. J. Glo. Opt. 33, 597–615 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  72. 72.
    H.D. Sherali, J. Desai, A global optimization RLT-based approach for solving the hard clustering problem. J. Glo. Opt. 32, 281–306 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  73. 73.
    N. Slonim, G.S. Atwal, G. Tkacik, W. Bialek, Information-based clustering. Proc. Natl. Acad. Sci. 102(51), 18297–18302 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  74. 74.
    A. Subramani, P.A. DiMaggio Jr., C.A. Floudas, Selecting high quality structures from diverse conformational ensembles. Biophys. J. 97, 1728–1736 (2009)CrossRefGoogle Scholar
  75. 75.
    S. Takeuchi, T. Matsuda, S. Kobayashi, T. Takahashi, H. Kojima, In vitro screening of 200 pesticides for agonistic activity in mouse peroxisome proliferator-activated receptor PPARa and PPARg and quantitative analysis of in vivo induction pathway. Toxicol. Appl. Pharmacol. 217, 235–244 (2008)CrossRefGoogle Scholar
  76. 76.
    M.P. Tan, J.R. Broach, C.A. Floudas, A novel clustering approach and prediction of optimal number of clusters: Global optimum search with enhanced positioning. J. Glo. Opt. 39, 323–346 (2007)MathSciNetzbMATHCrossRefGoogle Scholar
  77. 77.
    M.P. Tan, J.R. Broach, C.A. Floudas, Evaluation of normalization and pre-clustering issues in a novel clustering approach: Global optimum search with enhanced positioning. J. Bioin. Comp. Bio 5(4), 895–913 (2007)CrossRefGoogle Scholar
  78. 78.
    M.P. Tan, E. Smith, J.R. Broach, C.A. Floudas, Microarray data mining: A novel optimization-based approach to uncover biologically coherent structures. BMC Bioinformatics 9, 268–283 (2008)CrossRefGoogle Scholar
  79. 79.
    A. Tanay, R. Sharan, R. Shamir, Discovering statistically significant biclusters in gene expression data. Bioinformatics 18, S136–S144 (2002)CrossRefGoogle Scholar
  80. 80.
    L.E. Thummel, G.R. Wilkinson, In vitro and in vivo drug interactions involving human CYP3A. Annu. Rev. Pharmacol. Toxicol. 38, 389–430 (1998)CrossRefGoogle Scholar
  81. 81.
    W. Tong, W. Welsh, L. Shi, H. Fang, R. Perkins, Structure-activity relationship approaches and applications. Environ. Toxicol. Chem. 22, 1680–1695 (2003)CrossRefGoogle Scholar
  82. 82.
    H.L. Turner, T.C. Bailey, W.J. Krzanowski, C.A. Hemingway, Biclustering models for structured microarray data. IEEE/ACM Trans. Comput. Biol. Bioinformatics 2(4), 316–329 (2005)CrossRefGoogle Scholar
  83. 83.
    L.J. van’t Veer, H. Dai, M.J. Vijver, Y.D. He, A.A. Hart, M. Mao, H.L. Peterse, K. van der Kooy, M.J. Marton, A.T. Witteveen, G.J. Schreiber, R.M. Kerkhoven, C. Roberts, P.S. Linsley, R. Bernards, S.H. Friend, Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002)Google Scholar
  84. 84.
    J.H. Wolfe, Pattern clustering by multivariate mixture analysis. Multivariate Behav. Res. 5, 329–350 (1970)CrossRefGoogle Scholar
  85. 85.
    S. Yoon, C. Nardini, L. Benini, G. De Micheli, Discovering coherent biclusters from gene expression data using zero-suppressed binary decision diagrams. IEEE/ACM Trans. Comput. Biol. Bioinformatics 2(4), 339–354 (2005)CrossRefGoogle Scholar
  86. 86.
    Y. Zhang, J. Skolnick, SPICKER: A clustering approach to identify near-native protein folds. J. Comput. Chem. 25, 865–871 (2004)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Peter A. DiMaggioJr.
    • 1
  • Ashwin Subramani
    • 2
  • Christodoulos A. Floudas
    • 2
    Email author
  1. 1.Department of Molecular BiologyPrinceton UniversityPrincetonUSA
  2. 2.Department of Chemical and Biological EngineeringPrinceton UniversityPrincetonUSA

Personalised recommendations