A Domain Driven Mining Algorithm on Gene Sequence Clustering

  • Yun Xiong
  • Ming Chen
  • Yangyong Zhu

Recent biological experiments argue that similar gene sequences measured by permutation of the nucleotides do not necessarily share functional similarity. As a result, the state-of-the-art clustering algorithms by which to annotate genes with similar function solely based on sequence composition may cause failure. The recent study of gene clustering techniques that incorporate prior knowledge of the biological domain is deemed to be an essential research subject of data mining, specifically aiming at one for biological sequences. It is now commonly accepted that co-expressed genes generally belong to the same functional category. In this paper, a new similarity metric for gene sequence clustering based on features of such co-expressed genes is proposed, namely ‘Tendency Similarity on N-Same-Dimensions’, in terms of which a domain driven algorithm ‘DD-Cluster’ is designed to group together gene sequences into ‘Similar Tendency Clusters on N-Same-Dimensions’, i.e., co-expressed gene clusters. Compared with earlier clustering methods considering composition of gene sequences alone, the resulting ‘Similar Tendency Clusters on N-Same-Dimensions’ proved more reliable for assisting biologists in gene function annotation. The algorithm has been tested on real data sets and has shown high performance, the clustering results having demonstrated effectiveness.


Gene Expression Data Pattern Mining Subspace Cluster Dimension Support Sequential Pattern Mining 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Mao, L. Y., Mackenzie, C., Roh, J. H., Eraso, J. M., Kaplan, S., Resat, H.. Combining mi-croarray and genomic data to predict DNA binding motifs. Microbiology, 2005, 151(10): 3197–3213.CrossRefGoogle Scholar
  2. 2.
    Cheng, Y., Church, G.. Biclustering of expression data. Bourne, P., Gribskov, M., Altman, R.(Eds.). Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. San Diego: AAAI Press, 2000: 93–103.Google Scholar
  3. 3.
    Wang, H. X., Wang, W., Yang, J., Yu, P. S.. Clustering by pattern similarity in large data sets. Franklin, M. J., Moon, B., Ailamaki, A.. Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. Madison, Wisconsin: ACM, 2002:394–405.CrossRefGoogle Scholar
  4. 4.
    Pei, J., Zhang, X. L., Cho M. J., Wang, H. X., Yu, P. S.. MaPel: A fast algorithm for maximal pattern-based clustering. Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM). Melbourne, Florida, USA: IEEE Computer Society, 2003: 259–266.Google Scholar
  5. 5.
    Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z.. Discovering local structure in gene expression data: The order-preserving submatrix problem. Proceedings of the 6th Annual International Conference on Computational Biology. Washington, DC, USA: ACM, 2002: 49–57.Google Scholar
  6. 6.
    Liu, J. Z., Wang, W.. OP-Cluster: Clustering by tendency in high dimensional space. Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM). Melbourne, Florida, USA: IEEE Computer Society, 2003:187–194.Google Scholar
  7. 7.
    Day, W. H. E., Edelsbrunner, H.. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1984, 1(1): 7–24.MATHCrossRefGoogle Scholar
  8. 8.
    Kaufman, L., Rousseeuw, P. J.. Finding groups in data: An introduction to cluster analysis. New York: Johh Wiley and Sons, 1990.Google Scholar
  9. 9.
    Aggarwal, C. C., Hinneburg, A., Keim1, D.. On the surprising behavior of distance metrics in high dimensional space. Bussche, J. V., Vianu, V.(Eds.). The 8th International Conference on Database Theory. London, UK: Lecture Notes in Computer Science, 2001: 420–434.Google Scholar
  10. 10.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.. Automatic subspace clustering of high dimensional data for data mining applications. Haas, L. M., Tiwary, A.(Eds.). Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, WA, USA: ACM Press, 1998: 94–105.Google Scholar
  11. 11.
    Moreau, Y., Smet, F. D., Thus, G., Marchal, K., Moor, B. D.. Functional bioinformatics of microarray data: From expression to regulation. Proceedings of the IEEE, 2002, 90(11): 1722– 1743.CrossRefGoogle Scholar
  12. 12.
    Eisen, M. B., Spellman, P. T., Brown, P. O., Botstein, D.. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 1998, 95(25): 14863–8.CrossRefGoogle Scholar
  13. 13.
    Zhang, K., Zhu, Y. Y.. Sequence pattern mining without duplicate project database scan. Journal of Computer Research and Development, 2007, 44(1): 126–132.CrossRefGoogle Scholar
  14. 14.
    Hedenfalk, I., Duggan, D., Chen, Y. D.. Gene-expression profiles in hereditary breast cancer. The New England Journal of Medicine, 2001, 344(8): 539–548.CrossRefGoogle Scholar
  15. 15.
    Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J., Church, G. M.. Systematic determination of genetic network architecture. Nature Genetics, 1999, 281–285.Google Scholar
  16. 16.
    Liu, J. Z., Yang, J., Wang, W.. Biclustering in gene expression data by tendency. Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference. United States: IEEE Computer Society, 2004: 182–193.Google Scholar
  17. 17.
    Ashburner, M., Ball, C. A., Blake, J. A.. Gene ontology: Tool for the unification of biology. Nature Genetics, 2000:25(1), 25–29.CrossRefGoogle Scholar
  18. 18.
    Xu, X., Lu, Y., Tung, A. K. H.. Mining shifting-and-scaling co-regulation patterns on gene expression profiles. In: Liu, L., Reuter, A., Whang, K. Y. (Eds.). Proceedings of the 22nd International Conference on Data Engineering(ICDE 2006), Atlanta, GA, USA. IEEE Computer Society, 2006: 89–100.Google Scholar
  19. 19.
    Zhao, Y. H., Yu, J. X., Wang, G. R., Chen, L. Wang, B., Yu, G.. Maximal subspace co-regulated gene clustering. IEEE Transactions on Knowledge and Data Engineering. 2008: 83–98.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Department of Computing and Information TechnologyFudan UniversityShanghaiChina

Personalised recommendations