Clustering Nodes in Large-Scale Biological Networks Using External Memory Algorithms

  • Ahmed Shamsul Arefin
  • Mario Inostroza-Ponta
  • Luke Mathieson
  • Regina Berretta
  • Pablo Moscato
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7017)


Novel analytical techniques have dramatically enhanced our understanding of many application domains including biological networks inferred from gene expression studies. However, there are clear computational challenges associated to the large datasets generated from these studies. The algorithmic solution of some NP-hard combinatorial optimization problems that naturally arise on the analysis of large networks is difficult without specialized computer facilities (i.e. supercomputers). In this work, we address the data clustering problem of large-scale biological networks with a polynomial-time algorithm that uses reasonable computing resources and is limited by the available memory. We have adapted and improved the MSTkNN graph partitioning algorithm and redesigned it to take advantage of external memory (EM) algorithms. We evaluate the scalability and performance of our proposed algorithm on a well-known breast cancer microarray study and its associated dataset.


Data clustering external memory algorithms graph algorithms gene expression data analysis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Inostroza-Ponta, M.: An Integrated and Scalable Approach Based on Combinatorial Optimization Techniques for the Analysis of Microarray Data, PhD thesis, The University of Newcastle, Australia (2008)Google Scholar
  2. 2.
    Gonzalez-Barrios, J.M., Quiroz, A.J.: A clustering procedure based on the comparison between the k nearest neighbors graph and the minimal spanning tree. Statistics and Probability Letters 62(3), 23–34 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Inostroza-Ponta, M., Mendes, A., Berretta, R., Moscato, P.: An integrated QAP-based approach to visualize patterns of gene expression similarity. In: Randall, M., Abbass, H.A., Wiles, J. (eds.) ACAL 2007. LNCS (LNAI), vol. 4828, pp. 156–167. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  4. 4.
    Dementiev, R., Sanders, P., Schultes, D., Sibeyn, J.: Engineering an external memory minimum spanning tree algorithm. In: 3rd IFIP Intl. Conf. on Theoretical Computer Science, pp. 195–208 (2004)Google Scholar
  5. 5.
    Sibeyn, J.: External Connected Components. In: Hagerup, T., Katajainen, J. (eds.) SWAT 2004. LNCS, vol. 3111, pp. 468–479. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  6. 6.
    Schultes, D.: External memory spanning forests and connected components, Technical report (2004),
  7. 7.
    Vitter, J.S.: External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys 33 (2001)Google Scholar
  8. 8.
    Xu, Y., Olman, V., Xu, D.: Clustering Gene Expression Data Using a Graph-Theoretic Approach: An Application of Minimum Spanning Tree. Bioinformatics 18(4), 526–535 (2002)CrossRefGoogle Scholar
  9. 9.
    Grygorash, O., Zhou, Y., Jorgensen, Z.: Minimum Spanning Tree Based Clustering Algorithms. In: Proc. of the 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2006), pp. 73–81. IEEE Computer Society, Washington, DC, USA (2006)Google Scholar
  10. 10.
    Doowang, J.: An external memory approach to computing the maximal repeats across classes of dna sequences. Asian Journal of Health and Information Sciences 1(3), 276–295 (2006)Google Scholar
  11. 11.
    Choi, J.H., Cho, H.G.: Analysis of common k-mers for whole genome sequences using SSB-tree. Japanese Society for Bioinformatics 13, 30–41 (2002)Google Scholar
  12. 12.
    Chiang, Y., Goodrich, M.T., Grove, E.F., Tamassia, R., Vengroff, D.E., et al.: External-memory graph algorithms, In. In: SODA 1995: Proceedings of the Sixth Annual ACM-SIAM, pp. 139–149. Society for IAM, Philadelphia (1995)Google Scholar
  13. 13.
    Abello, J., Buchsbaum, A.L., Westbrook, J.R.: A functional approach to external graph algorithms. Algorithmica, 332–343 (1998)Google Scholar
  14. 14.
    van de Vijver, M.J., He, Y.D., van’t Veer, L.J., Dai, H., et al.: A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med. 347(25) (2002)Google Scholar
  15. 15.
    Fayyad, U.M., Irarni, K.B.: Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In: IJCAI, pp. 1022–1029 (1993)Google Scholar
  16. 16.
    Cotta, C., Sloper, C., Moscato, P.: Evolutionary Search of Thresholds for Robust Feature Set Selection: Application to the Analysis of Microarray Data. In: Raidl, G.R., Cagnoni, S., Branke, J., Corne, D.W., Drechsler, R., Jin, Y., Johnson, C.G., Machado, P., Marchiori, E., Rothlauf, F., Smith, G.D., Squillero, G. (eds.) EvoWorkshops 2004. LNCS, vol. 3005, pp. 21–30. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  17. 17.
    Rocha de Paula, M., Ravetti, M.G., Rosso, O.A., Berretta, R., Moscato, P.: Differences in abundances of cell-signalling proteins in blood reveal novel biomarkers for early detection of clinical Alzheimer’s disease. PLoS ONE 6(e17481) (2011)Google Scholar
  18. 18.
    Jiang, X.P., Elliot, R.L., Head, J.F.: Manipulation of iron transporter genes results in the suppression of human and mouse mammary adenocarcinomas. Anticancer Res. 30(3), 759–765 (2010)Google Scholar
  19. 19.
    Shamir, R., Sharan, R.: CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis. In: Proc. of ISMB, pp. 307–316 (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Ahmed Shamsul Arefin
    • 1
  • Mario Inostroza-Ponta
    • 2
  • Luke Mathieson
    • 3
  • Regina Berretta
    • 1
    • 4
  • Pablo Moscato
    • 1
    • 4
    • 5
  1. 1.Centre for Bioinformatics, Biomarker Discovery and Information-Based MedicineThe University of NewcastleCallaghanAustralia
  2. 2.Departamento de Ingeniería InformáticaUniversidad de Santiago de ChileChile
  3. 3.Department of Computing, Faculty of ScienceMacquarie UniversitySydneyAustralia
  4. 4.Hunter Medical Research InstituteInformation Based Medicine ProgramAustralia
  5. 5.ARC Centre of Excellence in BioinformaticsCallaghanAustralia

Personalised recommendations