Skip to main content

\(S^2FS\): Single Score Feature Selection Applied to the Problem of Distinguishing Long Non-coding RNAs from Protein Coding Transcripts

  • Conference paper
  • First Online:
  • 428 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11228))

Abstract

The task of distinguishing long non-coding RNAs (lncRNAs) from protein coding transcripts (PCTs) has been previously addressed with machine learning (ML) algorithms using hundreds of features. However, the use of a large number of features can negatively affect the predictive performance of these algorithms since it can lead to problems like overfitting due to a phenomenon known as the curse of dimensionality. In order to deal with these problems, dimensionality reduction techniques have been proposed, among them, feature selection. This work proposes and experimentally evaluates a simple and fast feature selection technique, called Single Score Feature Selection - \(S^2FS\).

For such, initially, frequencies of 2-mers, 3-mers and 4-mers were extracted from public databases of PCTs and lncRNAs of Homo sapiens, resulting in a dataset composed of two groups of RNA sequences, one for PCTs and the other for lncRNAs, and a large number of features. To reduce the number of features, \(S^2FS\) was applied to the dataset. Experimental results showed that relevant features were selected, keeping the predictive accuracy, with a lower processing cost than some existing feature selection techniques.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  2. Cai, J., Luo, J., Wang, S., Yang, S.: Feature selection in machine learning: a new perspective. Neurocomputing 300(26), 70–79 (2018)

    Article  Google Scholar 

  3. Esteller, M.: Non-coding RNAs in human disease. Nat. Rev. Genet. 12(12), 861 (2011)

    Article  Google Scholar 

  4. Hall, M.A.: Correlation-based feature selection for machine learning. Ph.D. thesis, University of Waikato Hamilton, April 1999

    Google Scholar 

  5. Han, S., Liang, Y., Li, Y., Du, W.: Long noncoding RNA identification: comparing machine learning based tools for long noncoding transcripts discrimination. BioMed Res. Int. 2016 (2016)

    Google Scholar 

  6. Hughes, G.: On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory 14(1), 55–63 (1968)

    Article  Google Scholar 

  7. Jain, A., Zongker, D.: Feature selection: evaluation, application, and small sample performance. IEEE Trans. Pattern Anal. Mach. Intell. 19(2), 153–158 (1997)

    Article  Google Scholar 

  8. Kaikkonen, M.U., Lam, M.T., Glass, C.K.: Non-coding RNAs as regulators of gene expression and epigenetics. Cardiovas. Res. 90(3), 430–440 (2011)

    Article  Google Scholar 

  9. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theor. 28(2), 129–137 (2006). https://doi.org/10.1109/TIT.1982.1056489

    Article  MathSciNet  MATH  Google Scholar 

  10. Mattick, J.S.: Non-coding RNAs: the architects of eukaryotic complexity. EMBO Rep. 2(11), 986–991 (2001)

    Article  Google Scholar 

  11. Mattick, J.S., Rinn, J.L.: Discovery and annotation of long noncoding RNAs. Nat. Struct. Mol. Biol. 22(1), 5 (2015)

    Article  Google Scholar 

  12. Pian, C., et al.: LncRNApred: classification of long non-coding RNAs and protein-coding transcripts by the ensemble algorithm with a new hybrid feature. PloS One 11(5), e0154567 (2016)

    Article  Google Scholar 

  13. Ponting, C.P., Olive, P.L., Reik, W.: Evolution and functions of long noncoding RNAs. Cell Volume 136(4), 629–641 (2009)

    Article  Google Scholar 

  14. Popescu, M.C., Balas, V.E., Perescu-Popescu, L., Mastorakis, N.: Multilayer perceptron and neural networks. WSEAS Trans. Circ. Syst. 8(7), 579–588 (2009)

    Google Scholar 

  15. Pruitt, K.D., Tatusova, T., Maglott, D.R.: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35(Suppl. 1), D61–D65 (2007)

    Article  Google Scholar 

  16. Pudil, P., Novovičová, J., Kittler, J.: Floating search methods in feature selection. Pattern Recogn. Lett. 15(11), 1119–1125 (1994)

    Article  Google Scholar 

  17. Rinn, J.L., Chang, H.Y.: Genome regulation by long noncoding RNAs. Ann. Rev. Biochem. 81, 145–166 (2012)

    Article  Google Scholar 

  18. Schneider, H.W., Raiol, T., Brigido, M.M., Walter, M.E.M., Stadler, P.F.: A support vector machine based method to distinguish long non-coding RNAs from protein coding transcripts. BMC Genomics 18(1), 804 (2017)

    Google Scholar 

  19. Tripathi, R., Patel, S., Kumari, V., Chakraborty, P., Varadwaj, P.K.: DeepLNC, a long non-coding RNA prediction tool using deep neural network. Netw. Model. Anal. Health Inform. Bioinform. 5(1), 1–14 (2016)

    Article  Google Scholar 

  20. Volders, P.J., et al.: LNCipedia: a database for annotated human lncRNA transcript sequences and structures. Nucleic Acids Res. 41(D1), D246–D251 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bruno C. Kümmel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kümmel, B.C., de Carvalho, A.C.P.L.F., Brigido, M.M., Ralha, C.G., Walter, M.E.M.T. (2018). \(S^2FS\): Single Score Feature Selection Applied to the Problem of Distinguishing Long Non-coding RNAs from Protein Coding Transcripts. In: Alves, R. (eds) Advances in Bioinformatics and Computational Biology. BSB 2018. Lecture Notes in Computer Science(), vol 11228. Springer, Cham. https://doi.org/10.1007/978-3-030-01722-4_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01722-4_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01721-7

  • Online ISBN: 978-3-030-01722-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics