Recent biological experiments argue that similar gene sequences measured by permutation of the nucleotides do not necessarily share functional similarity. As a result, the state-of-the-art clustering algorithms by which to annotate genes with similar function solely based on sequence composition may cause failure. The recent study of gene clustering techniques that incorporate prior knowledge of the biological domain is deemed to be an essential research subject of data mining, specifically aiming at one for biological sequences. It is now commonly accepted that co-expressed genes generally belong to the same functional category. In this paper, a new similarity metric for gene sequence clustering based on features of such co-expressed genes is proposed, namely ‘Tendency Similarity on N-Same-Dimensions’, in terms of which a domain driven algorithm ‘DD-Cluster’ is designed to group together gene sequences into ‘Similar Tendency Clusters on N-Same-Dimensions’, i.e., co-expressed gene clusters. Compared with earlier clustering methods considering composition of gene sequences alone, the resulting ‘Similar Tendency Clusters on N-Same-Dimensions’ proved more reliable for assisting biologists in gene function annotation. The algorithm has been tested on real data sets and has shown high performance, the clustering results having demonstrated effectiveness.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Mao, L. Y., Mackenzie, C., Roh, J. H., Eraso, J. M., Kaplan, S., Resat, H.. Combining mi-croarray and genomic data to predict DNA binding motifs. Microbiology, 2005, 151(10): 3197–3213.
Cheng, Y., Church, G.. Biclustering of expression data. Bourne, P., Gribskov, M., Altman, R.(Eds.). Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. San Diego: AAAI Press, 2000: 93–103.
Wang, H. X., Wang, W., Yang, J., Yu, P. S.. Clustering by pattern similarity in large data sets. Franklin, M. J., Moon, B., Ailamaki, A.. Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. Madison, Wisconsin: ACM, 2002:394–405.
Pei, J., Zhang, X. L., Cho M. J., Wang, H. X., Yu, P. S.. MaPel: A fast algorithm for maximal pattern-based clustering. Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM). Melbourne, Florida, USA: IEEE Computer Society, 2003: 259–266.
Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z.. Discovering local structure in gene expression data: The order-preserving submatrix problem. Proceedings of the 6th Annual International Conference on Computational Biology. Washington, DC, USA: ACM, 2002: 49–57.
Liu, J. Z., Wang, W.. OP-Cluster: Clustering by tendency in high dimensional space. Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM). Melbourne, Florida, USA: IEEE Computer Society, 2003:187–194.
Day, W. H. E., Edelsbrunner, H.. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1984, 1(1): 7–24.
Kaufman, L., Rousseeuw, P. J.. Finding groups in data: An introduction to cluster analysis. New York: Johh Wiley and Sons, 1990.
Aggarwal, C. C., Hinneburg, A., Keim1, D.. On the surprising behavior of distance metrics in high dimensional space. Bussche, J. V., Vianu, V.(Eds.). The 8th International Conference on Database Theory. London, UK: Lecture Notes in Computer Science, 2001: 420–434.
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.. Automatic subspace clustering of high dimensional data for data mining applications. Haas, L. M., Tiwary, A.(Eds.). Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, WA, USA: ACM Press, 1998: 94–105.
Moreau, Y., Smet, F. D., Thus, G., Marchal, K., Moor, B. D.. Functional bioinformatics of microarray data: From expression to regulation. Proceedings of the IEEE, 2002, 90(11): 1722– 1743.
Eisen, M. B., Spellman, P. T., Brown, P. O., Botstein, D.. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 1998, 95(25): 14863–8.
Zhang, K., Zhu, Y. Y.. Sequence pattern mining without duplicate project database scan. Journal of Computer Research and Development, 2007, 44(1): 126–132.
Hedenfalk, I., Duggan, D., Chen, Y. D.. Gene-expression profiles in hereditary breast cancer. The New England Journal of Medicine, 2001, 344(8): 539–548.
Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J., Church, G. M.. Systematic determination of genetic network architecture. Nature Genetics, 1999, 281–285.
Liu, J. Z., Yang, J., Wang, W.. Biclustering in gene expression data by tendency. Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference. United States: IEEE Computer Society, 2004: 182–193.
Ashburner, M., Ball, C. A., Blake, J. A.. Gene ontology: Tool for the unification of biology. Nature Genetics, 2000:25(1), 25–29.
Xu, X., Lu, Y., Tung, A. K. H.. Mining shifting-and-scaling co-regulation patterns on gene expression profiles. In: Liu, L., Reuter, A., Whang, K. Y. (Eds.). Proceedings of the 22nd International Conference on Data Engineering(ICDE 2006), Atlanta, GA, USA. IEEE Computer Society, 2006: 89–100.
Zhao, Y. H., Yu, J. X., Wang, G. R., Chen, L. Wang, B., Yu, G.. Maximal subspace co-regulated gene clustering. IEEE Transactions on Knowledge and Data Engineering. 2008: 83–98.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Xiong, Y., Chen, M., Zhu, Y. (2009). A Domain Driven Mining Algorithm on Gene Sequence Clustering. In: Cao, L., Yu, P.S., Zhang, C., Zhang, H. (eds) Data Mining for Business Applications. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-79420-4_8
Download citation
DOI: https://doi.org/10.1007/978-0-387-79420-4_8
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-79419-8
Online ISBN: 978-0-387-79420-4
eBook Packages: Computer ScienceComputer Science (R0)