Abstract
DNA sequence decomposition into k-mers and their frequency counting, defines a mapping of a sequence into a numerical space by a numerical feature vector of fixed length. This simple process allows to compare sequences in an alignment free way, using common similarities and distance functions on the numerical codomain of the mapping. The most common used decomposition uses all the substrings of a fixed length k making the codomain of exponential dimension. This obviously can affect the time complexity of the similarity computation, and in general of the machine learning algorithm used for the purpose of sequence analysis. Moreover, the presence of possible noisy features can also affect the classification accuracy. In this paper we propose a feature selection method able to select the most informative k-mers associated to a set of DNA sequences. Such selection is based on the Motif Independent Measure (MIM), an unbiased quantitative measure for DNA sequence specificity that we have recently introduced in the literature. Results computed on public datasets show the effectiveness of the proposed feature selection method.
The original version of this chapter was revised: The given name and family name of the authors has been corrected. The erratum to this chapter is available at DOI: 10.1007/978-3-319-24462-4_26
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Altschul, S., Gish, W., Miller, W., et al.: Basic local alignment search tool. J. Mol. Biol. 25(3), 403–410 (1990)
Lipman, D., Pearson, W.: Rapid and sensitive protein similarity searches. Science 227(4693) (1985)
Vinga, S., Almeida, J.: Alignment-free sequence comparison–a review. Bioinformatics 19(4), 513–523 (2003)
Yuan, G.-C., Liu, J.S.: Genomic sequence is highly predictive of local nucleosome depletion. PLoS Comput. Biol. 4(1), e13 (2008)
Lee, D., Karchin, R., Beer, M.A.: Discriminative prediction of mammalian enhancers from DNA sequence. Genome Research 21(12), 2167–2180 (2011)
Pinello, L., Xu, J., Orkin, S.H., Yuan, G.-C.: Analysis of chromatin-state plasticity identifies cell-type specific regulators of H3K27me3 patterns. Proceedings of the National Academy of Sciences 111(3), 344–353 (2014)
Paszkiewicz, K., Studholme, D.J.: De novo assembly of short sequence reads. Briefings in Bioinformatics 11(5), 457–472 (2010)
Liu, Y., Guo, J., Hu, G.-Q., Zhu, H.: Gene prediction in metagenomic fragments based on the svm algorithm. BMC Bioinformatics 14(S-5), S12 (2013)
Drancourt, M., Berger, P., Raoult, D.: Systematic 16S rRNA Gene Sequencing of Atypical Clinical Isolates Identified 27 New Bacterial Species Associated with Humans. Journal of Clinical Microbiology 42(5), 2197–2202 (2004)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge Univ. Press (2000)
Kornberg, R.D., Lorch, Y.: Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome. Cell 98, 285–294 (1999)
Struhl, K., Segal, E.: Determinants of nucleosome positioning. Nat. Struct. Mol. Biol. 20(3), 267–273 (2013)
Yuan, G.-C., Liu, Y.-J., Dion, M.F., Slack, M.D., Wu, L.F., Altschuler, S.J., Rando, O.J.: Genome-scale identification of nucleosome positions in S. cerevisiae. Science 309(5734), 626–630 (2005)
Di Gesú, V., Lo Bosco, G., Pinello, L., Yuan, G.-C., Corona, D.V.F.: A multi-layer method to study genome-scale positions of nucleosomes. Genomics 93(2), 140–145 (2009)
Guo, S.-H., Deng, E.-Z., Xu, L.-Q., Ding, H., Lin, H., Chen, W., Chou, K.-C.: iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics 30(11), 1522–1529 (2014)
Pinello, L., Lo Bosco, G., Yuan, G.-C.: Applications of alignment-free methods in epigenomics. Briefings in Bioinformatics 15(3), 419–430 (2013)
Apostolico, A., Denas, O.: Fast algorithms for computing sequence distances by exhaustive substring composition. Algorithms for Molucular Biology 3(13), 1–9 (2008)
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial Intelligence 97(1–2), 273–324 (1997)
Saeys, Y., Inza, I., Larrañaga, P.: A Review of Feature Selection Techniques in Bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Lo Bosco, G., Pinello, L. (2015). A New Feature Selection Methodology for K-mers Representation of DNA Sequences. In: DI Serio, C., Liò, P., Nonis, A., Tagliaferri, R. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2014. Lecture Notes in Computer Science(), vol 8623. Springer, Cham. https://doi.org/10.1007/978-3-319-24462-4_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-24462-4_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24461-7
Online ISBN: 978-3-319-24462-4
eBook Packages: Computer ScienceComputer Science (R0)