A Domain Driven Mining Algorithm on Gene Sequence Clustering

Xiong, Yun; Chen, Ming; Zhu, Yangyong

doi:10.1007/978-0-387-79420-4_8

Yun Xiong⁴,
Ming Chen⁴ &
Yangyong Zhu⁴

2007 Accesses

Recent biological experiments argue that similar gene sequences measured by permutation of the nucleotides do not necessarily share functional similarity. As a result, the state-of-the-art clustering algorithms by which to annotate genes with similar function solely based on sequence composition may cause failure. The recent study of gene clustering techniques that incorporate prior knowledge of the biological domain is deemed to be an essential research subject of data mining, specifically aiming at one for biological sequences. It is now commonly accepted that co-expressed genes generally belong to the same functional category. In this paper, a new similarity metric for gene sequence clustering based on features of such co-expressed genes is proposed, namely ‘Tendency Similarity on N-Same-Dimensions’, in terms of which a domain driven algorithm ‘DD-Cluster’ is designed to group together gene sequences into ‘Similar Tendency Clusters on N-Same-Dimensions’, i.e., co-expressed gene clusters. Compared with earlier clustering methods considering composition of gene sequences alone, the resulting ‘Similar Tendency Clusters on N-Same-Dimensions’ proved more reliable for assisting biologists in gene function annotation. The algorithm has been tested on real data sets and has shown high performance, the clustering results having demonstrated effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Mao, L. Y., Mackenzie, C., Roh, J. H., Eraso, J. M., Kaplan, S., Resat, H.. Combining mi-croarray and genomic data to predict DNA binding motifs. Microbiology, 2005, 151(10): 3197–3213.
Article Google Scholar
Cheng, Y., Church, G.. Biclustering of expression data. Bourne, P., Gribskov, M., Altman, R.(Eds.). Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. San Diego: AAAI Press, 2000: 93–103.
Google Scholar
Wang, H. X., Wang, W., Yang, J., Yu, P. S.. Clustering by pattern similarity in large data sets. Franklin, M. J., Moon, B., Ailamaki, A.. Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. Madison, Wisconsin: ACM, 2002:394–405.
Chapter Google Scholar
Pei, J., Zhang, X. L., Cho M. J., Wang, H. X., Yu, P. S.. MaPel: A fast algorithm for maximal pattern-based clustering. Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM). Melbourne, Florida, USA: IEEE Computer Society, 2003: 259–266.
Google Scholar
Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z.. Discovering local structure in gene expression data: The order-preserving submatrix problem. Proceedings of the 6th Annual International Conference on Computational Biology. Washington, DC, USA: ACM, 2002: 49–57.
Google Scholar
Liu, J. Z., Wang, W.. OP-Cluster: Clustering by tendency in high dimensional space. Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM). Melbourne, Florida, USA: IEEE Computer Society, 2003:187–194.
Google Scholar
Day, W. H. E., Edelsbrunner, H.. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1984, 1(1): 7–24.
Article MATH Google Scholar
Kaufman, L., Rousseeuw, P. J.. Finding groups in data: An introduction to cluster analysis. New York: Johh Wiley and Sons, 1990.
Google Scholar
Aggarwal, C. C., Hinneburg, A., Keim1, D.. On the surprising behavior of distance metrics in high dimensional space. Bussche, J. V., Vianu, V.(Eds.). The 8th International Conference on Database Theory. London, UK: Lecture Notes in Computer Science, 2001: 420–434.
Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.. Automatic subspace clustering of high dimensional data for data mining applications. Haas, L. M., Tiwary, A.(Eds.). Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, WA, USA: ACM Press, 1998: 94–105.
Google Scholar
Moreau, Y., Smet, F. D., Thus, G., Marchal, K., Moor, B. D.. Functional bioinformatics of microarray data: From expression to regulation. Proceedings of the IEEE, 2002, 90(11): 1722– 1743.
Article Google Scholar
Eisen, M. B., Spellman, P. T., Brown, P. O., Botstein, D.. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 1998, 95(25): 14863–8.
Article Google Scholar
Zhang, K., Zhu, Y. Y.. Sequence pattern mining without duplicate project database scan. Journal of Computer Research and Development, 2007, 44(1): 126–132.
Article Google Scholar
Hedenfalk, I., Duggan, D., Chen, Y. D.. Gene-expression profiles in hereditary breast cancer. The New England Journal of Medicine, 2001, 344(8): 539–548.
Article Google Scholar
Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J., Church, G. M.. Systematic determination of genetic network architecture. Nature Genetics, 1999, 281–285.
Google Scholar
Liu, J. Z., Yang, J., Wang, W.. Biclustering in gene expression data by tendency. Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference. United States: IEEE Computer Society, 2004: 182–193.
Google Scholar
Ashburner, M., Ball, C. A., Blake, J. A.. Gene ontology: Tool for the unification of biology. Nature Genetics, 2000:25(1), 25–29.
Article Google Scholar
Xu, X., Lu, Y., Tung, A. K. H.. Mining shifting-and-scaling co-regulation patterns on gene expression profiles. In: Liu, L., Reuter, A., Whang, K. Y. (Eds.). Proceedings of the 22nd International Conference on Data Engineering(ICDE 2006), Atlanta, GA, USA. IEEE Computer Society, 2006: 89–100.
Google Scholar
Zhao, Y. H., Yu, J. X., Wang, G. R., Chen, L. Wang, B., Yu, G.. Maximal subspace co-regulated gene clustering. IEEE Transactions on Knowledge and Data Engineering. 2008: 83–98.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing and Information Technology, Fudan University, Shanghai, 200433, China
Yun Xiong, Ming Chen & Yangyong Zhu

Authors

Yun Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Ming Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yangyong Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yun Xiong .

Editor information

Editors and Affiliations

School of Software Faculty of Engineering and Information Technology, University of Technology, PO Box 123, Sydney, Broadway, NSW 2007, Australia
Longbing Cao & Huaifeng Zhang &
Department of Computer Science, University of Illinois at Chicago, 851 S. Morgan St., Chicago, IL, 60607
Philip S. Yu
Centre for Quantum Computation and Intelligent Systems Faculty of Engineering and Information Technology, University of Technology, PO Box 123, Sydney, Broadway, NSW 2007, Australia
Chengqi Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Xiong, Y., Chen, M., Zhu, Y. (2009). A Domain Driven Mining Algorithm on Gene Sequence Clustering. In: Cao, L., Yu, P.S., Zhang, C., Zhang, H. (eds) Data Mining for Business Applications. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-79420-4_8

Download citation

DOI: https://doi.org/10.1007/978-0-387-79420-4_8
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-79419-8
Online ISBN: 978-0-387-79420-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics