Skip to main content

Unsupervised Dimension Reduction Methods for Protein Sequence Classification

  • Conference paper
  • First Online:
Book cover Data Analysis, Machine Learning and Knowledge Discovery

Abstract

Feature extraction methods are widely applied in order to reduce the dimensionality of data for subsequent classification, thus decreasing the risk of noise fitting. Principal Component Analysis (PCA) is a popular linear method for transforming high-dimensional data into a low-dimensional representation. Non-linear and non-parametric methods for dimension reduction, such as Isomap, Stochastic Neighbor Embedding (SNE) and Interpol are also used. In this study, we compare the performance of PCA, Isomap, t-SNE and Interpol as preprocessing steps for classification of protein sequences. Using random forests, we compared the classification performance on two artificial and eighteen real-world protein data sets, including HIV drug resistance, HIV-1 co-receptor usage and protein functional class prediction, preprocessed with PCA, Isomap, t-SNE and Interpol. Significant differences between these feature extraction methods were observed. The prediction performance of Interpol converges towards a stable and significantly higher value compared to PCA, Isomap and t-SNE. This is probably due to the nature of protein sequences, where amino acid are often dependent from and affect each other to achieve, for instance, conformational stability. However, visualization of data reduced with Interpol is rather unintuitive, compared to the other methods. We conclude that Interpol is superior to PCA, Isomap and t-SNE for feature extraction previous to classification, but is of limited use for visualization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.

    Article  MATH  Google Scholar 

  • Cai, C. Z., Han, L. Y., Ji, Z. L., Chen, X., & Chen, Y. Z. (2003). SVM-Prot: Web-based support vector machinee software for functional classification of a protein from its primary sequence. Nucleic Acids Research, 31, 459–462.

    Article  Google Scholar 

  • Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA.

    Google Scholar 

  • Cox, T. F., Cox, M. A. A., & Raton, B. (2003). Multidimensional scaling. Technometrics, 45(2), 182.

    Google Scholar 

  • Dybowski, J. N., Heider, D., & Hoffmann, D. (2010). Prediction of co-receptor usage of HIV-1 from genotype. PLOS Computational Biology, 6(4), e1000743.

    Article  Google Scholar 

  • Dybowski, J. N., Riemenschneider, M., Hauke, S., Pyka, M., Verheyen, J., Hoffmann, D., et al. (2011). Improved Bevirimat resistance prediction by combination of structural and sequence-based classifiers. BioData Mining, 4, 26.

    Article  Google Scholar 

  • Heider, D., Appelmann, J., Bayro, T., Dreckmann, W., Held, A., Winkler, J., et al. (2009). A computational approach for the identification of small GTPases based on preprocessed amino acid sequences. Technology in Cancer Research and Treatment, 8(5), 333–342.

    Google Scholar 

  • Heider, D., Hauke, S., Pyka, M., & Kessler, D. (2010). Insights into the classification of small GTPases. Advances and Applications in Bioinformatics and Chemistry, 3, 15–24.

    Article  Google Scholar 

  • Heider, D., & Hoffmann, D. (2011). Interpol: An R package for preprocessing of protein sequences. BioData Mining, 4, 16.

    Article  Google Scholar 

  • Jolliffe, I. T. (2002). Principal component analysis (2nd ed.). Springer series in statistics. New York: Springer.

    Google Scholar 

  • Kyte, J., & Doolittle, R. (1982). A simple method for displaying the hydropathic character of a protein. Journal of Molecular Biology, 157, 105–132.

    Article  Google Scholar 

  • Nanni, L., & Lumini, A. (2011). A new encoding technique for peptide classification. Expert Systems with Applications, 38(4), 3185–3191.

    Article  Google Scholar 

  • Rhee, S. Y., Taylor, J., Wadhera, G., Ben-Hur, A., Brutlag, D. L., & Shafer, R. W. (2006). Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences of USA, 103(46), 17355–17360.

    Article  Google Scholar 

  • Tenenbaum, J. B., De Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323.

    Article  Google Scholar 

  • van der Maaten, L., & Hinton, G. (2008). Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dominik Heider .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Heider, D., Bartenhagen, C., Dybowski, J.N., Hauke, S., Pyka, M., Hoffmann, D. (2014). Unsupervised Dimension Reduction Methods for Protein Sequence Classification. In: Spiliopoulou, M., Schmidt-Thieme, L., Janning, R. (eds) Data Analysis, Machine Learning and Knowledge Discovery. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-01595-8_32

Download citation

Publish with us

Policies and ethics