Skip to main content

A Domain Driven Mining Algorithm on Gene Sequence Clustering

  • Chapter
Data Mining for Business Applications
  • 2007 Accesses

Recent biological experiments argue that similar gene sequences measured by permutation of the nucleotides do not necessarily share functional similarity. As a result, the state-of-the-art clustering algorithms by which to annotate genes with similar function solely based on sequence composition may cause failure. The recent study of gene clustering techniques that incorporate prior knowledge of the biological domain is deemed to be an essential research subject of data mining, specifically aiming at one for biological sequences. It is now commonly accepted that co-expressed genes generally belong to the same functional category. In this paper, a new similarity metric for gene sequence clustering based on features of such co-expressed genes is proposed, namely ‘Tendency Similarity on N-Same-Dimensions’, in terms of which a domain driven algorithm ‘DD-Cluster’ is designed to group together gene sequences into ‘Similar Tendency Clusters on N-Same-Dimensions’, i.e., co-expressed gene clusters. Compared with earlier clustering methods considering composition of gene sequences alone, the resulting ‘Similar Tendency Clusters on N-Same-Dimensions’ proved more reliable for assisting biologists in gene function annotation. The algorithm has been tested on real data sets and has shown high performance, the clustering results having demonstrated effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Mao, L. Y., Mackenzie, C., Roh, J. H., Eraso, J. M., Kaplan, S., Resat, H.. Combining mi-croarray and genomic data to predict DNA binding motifs. Microbiology, 2005, 151(10): 3197–3213.

    Article  Google Scholar 

  2. Cheng, Y., Church, G.. Biclustering of expression data. Bourne, P., Gribskov, M., Altman, R.(Eds.). Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. San Diego: AAAI Press, 2000: 93–103.

    Google Scholar 

  3. Wang, H. X., Wang, W., Yang, J., Yu, P. S.. Clustering by pattern similarity in large data sets. Franklin, M. J., Moon, B., Ailamaki, A.. Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. Madison, Wisconsin: ACM, 2002:394–405.

    Chapter  Google Scholar 

  4. Pei, J., Zhang, X. L., Cho M. J., Wang, H. X., Yu, P. S.. MaPel: A fast algorithm for maximal pattern-based clustering. Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM). Melbourne, Florida, USA: IEEE Computer Society, 2003: 259–266.

    Google Scholar 

  5. Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z.. Discovering local structure in gene expression data: The order-preserving submatrix problem. Proceedings of the 6th Annual International Conference on Computational Biology. Washington, DC, USA: ACM, 2002: 49–57.

    Google Scholar 

  6. Liu, J. Z., Wang, W.. OP-Cluster: Clustering by tendency in high dimensional space. Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM). Melbourne, Florida, USA: IEEE Computer Society, 2003:187–194.

    Google Scholar 

  7. Day, W. H. E., Edelsbrunner, H.. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1984, 1(1): 7–24.

    Article  MATH  Google Scholar 

  8. Kaufman, L., Rousseeuw, P. J.. Finding groups in data: An introduction to cluster analysis. New York: Johh Wiley and Sons, 1990.

    Google Scholar 

  9. Aggarwal, C. C., Hinneburg, A., Keim1, D.. On the surprising behavior of distance metrics in high dimensional space. Bussche, J. V., Vianu, V.(Eds.). The 8th International Conference on Database Theory. London, UK: Lecture Notes in Computer Science, 2001: 420–434.

    Google Scholar 

  10. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.. Automatic subspace clustering of high dimensional data for data mining applications. Haas, L. M., Tiwary, A.(Eds.). Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, WA, USA: ACM Press, 1998: 94–105.

    Google Scholar 

  11. Moreau, Y., Smet, F. D., Thus, G., Marchal, K., Moor, B. D.. Functional bioinformatics of microarray data: From expression to regulation. Proceedings of the IEEE, 2002, 90(11): 1722– 1743.

    Article  Google Scholar 

  12. Eisen, M. B., Spellman, P. T., Brown, P. O., Botstein, D.. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 1998, 95(25): 14863–8.

    Article  Google Scholar 

  13. Zhang, K., Zhu, Y. Y.. Sequence pattern mining without duplicate project database scan. Journal of Computer Research and Development, 2007, 44(1): 126–132.

    Article  Google Scholar 

  14. Hedenfalk, I., Duggan, D., Chen, Y. D.. Gene-expression profiles in hereditary breast cancer. The New England Journal of Medicine, 2001, 344(8): 539–548.

    Article  Google Scholar 

  15. Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J., Church, G. M.. Systematic determination of genetic network architecture. Nature Genetics, 1999, 281–285.

    Google Scholar 

  16. Liu, J. Z., Yang, J., Wang, W.. Biclustering in gene expression data by tendency. Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference. United States: IEEE Computer Society, 2004: 182–193.

    Google Scholar 

  17. Ashburner, M., Ball, C. A., Blake, J. A.. Gene ontology: Tool for the unification of biology. Nature Genetics, 2000:25(1), 25–29.

    Article  Google Scholar 

  18. Xu, X., Lu, Y., Tung, A. K. H.. Mining shifting-and-scaling co-regulation patterns on gene expression profiles. In: Liu, L., Reuter, A., Whang, K. Y. (Eds.). Proceedings of the 22nd International Conference on Data Engineering(ICDE 2006), Atlanta, GA, USA. IEEE Computer Society, 2006: 89–100.

    Google Scholar 

  19. Zhao, Y. H., Yu, J. X., Wang, G. R., Chen, L. Wang, B., Yu, G.. Maximal subspace co-regulated gene clustering. IEEE Transactions on Knowledge and Data Engineering. 2008: 83–98.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yun Xiong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Xiong, Y., Chen, M., Zhu, Y. (2009). A Domain Driven Mining Algorithm on Gene Sequence Clustering. In: Cao, L., Yu, P.S., Zhang, C., Zhang, H. (eds) Data Mining for Business Applications. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-79420-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-0-387-79420-4_8

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-387-79419-8

  • Online ISBN: 978-0-387-79420-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics