Skip to main content

A Visual Analytics Approach for Protein Disorder Prediction

  • Chapter

Abstract

In this chapter, we present a case study of performing visual analytics to the protein disorder prediction problem. Protein disorder is one of the most important characteristics in understanding many biological functions and interactions. Due to the high cost to perform lab experiments, machine learning algorithms such as neural networks and support vector machines have been used for its identification. Rather than applying these generic methods, we show in this chapter that more insights can be found using visual analytics. Visualizations using linear discriminant analysis reveal that the disorder within each protein is usually well separated linearly. However, if various proteins are integrated together, there does not exist a clear linear separation rule in general. Based on this observation, we perform another visualization on the linear discriminant vector for each protein and confirm that the proteins are clearly clustered into several groups. Inspired by such findings, we apply k-means clustering on the proteins and construct a different classifier on each group, which leads us to a significant improvement of disorder prediction performance. Moreover, within the identified protein subgroups, the separation accuracy topped 99 %, a clear indicator for further biological investigations on these subgroups.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  • Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. (1997). Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research, 25(17), 3389–3402.

    Article  Google Scholar 

  • Baddeley, A. (1994). The magical number seven: still magic after all these years? The Psychological Review, 101, 353–356.

    Article  Google Scholar 

  • Bishop, C. M., & Tipping, M. E. (1998). A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 281–293.

    Article  Google Scholar 

  • Bryson, K., McGuffin, L. J., Marsden, R. L., Ward, J. J., Sodhi, J. S., & Jones, D. T. (2005). Protein structure prediction servers at University College London. Nucleic Acids Research, 33(suppl 2), W36–W38.

    Article  Google Scholar 

  • Cheng, J. (2004). Protein disorder dataset: Disorder723. http://casp.rnet.missouri.edu/download/disorder.dataset.

  • Cheng, J., Sweredoski, M., & Baldi, P. (2005). Accurate prediction of protein disordered regions by mining protein structure data. Data Mining and Knowledge Discovery, 11, 213–222. doi:10.1007/s10618-005-0001-y.

    Article  MathSciNet  Google Scholar 

  • Choo, J., Bohn, S., & Park, H. (2009). Two-stage framework for visualization of clustered high dimensional data. In IEEE symposium on visual analytics science and technology, 2009. VAST 2009 (pp. 67–74).

    Chapter  Google Scholar 

  • Choo, J., Lee, H., Kihm, J., & Park, H. (2010). iVisClassifier: an interactive visual analytics system for classification based on supervised dimension reduction. In 2010 IEEE conference on visual analytics science and technology (VAST) (pp. 27–34).

    Chapter  Google Scholar 

  • Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd edn.). New York: Wiley-Interscience.

    MATH  Google Scholar 

  • Dunker, A. K., Brown, C. J., Lawson, J. D., Iakoucheva, L. M., & Obradovic, Z. (2002). Intrinsic disorder and protein function. Biochemistry, 41(21), 6573–6582. doi:10.1021/bi012159+. PMID: 12022860.

    Article  Google Scholar 

  • Ferron, F., Longhi, S., Canard, B., & Karlin, D. (2006). A practical overview of protein disorder prediction methods. Proteins: Structure, Function, and Bioinformatics, 65(1), 1–14.

    Article  Google Scholar 

  • Fukunaga, K. (1990). Introduction to statistical pattern recognition (2nd edn.). Boston: Academic Press.

    MATH  Google Scholar 

  • Fukunaga, K., & Mantock, J. M. (1983). Nonparametric discriminant analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 671–678.

    Article  MATH  Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: data mining, inference, and prediction. Berlin: Springer.

    MATH  Google Scholar 

  • Hecker, J., Yang, J., & Cheng, J. (2008). Protein disorder prediction at multiple levels of sensitivity and specificity. BMC Genomics, 9(suppl 1), 9.

    Article  Google Scholar 

  • Howland, P., & Park, H. (2004). Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8), 995–1006.

    Article  Google Scholar 

  • Jolliffe, I. T. (2002). Principal component analysis. Berlin: Springer.

    MATH  Google Scholar 

  • Jones, D. T. (1999). Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology, 292(2), 195–202.

    Article  Google Scholar 

  • Joo, K., Lee, S. J., & Lee, J. (2012). Sann: Solvent accessibility prediction of proteins by nearest neighbor method. Proteins—Structure, Function, Bioinformatics. Wiley, published online. doi:10.1002/prot.24074. http://onlinelibrary.wiley.com/doi/10.1002/prot.24074/pdf.

  • Kim, H., & Park, H. (2003). Protein secondary structure prediction based on an improved support vector machines approach. Protein Engineering, 16(8), 553–560.

    Article  Google Scholar 

  • Kim, H., & Park, H. (2004). Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3D local descriptor. Proteins: Structure, Function, and Bioinformatics, 54(3), 557–562.

    Article  Google Scholar 

  • Kyte, J., & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. Journal of Molecular Biology, 157(1), 105–132.

    Article  Google Scholar 

  • Lin, C.-J. (2012). Liblinear—a library for large linear classification. http://www.csie.ntu.edu.tw/~cjlin/liblinear/.

  • Miller, G. A. (1956). The magical number seven, plus or minus two: some limits on our capacity for processing information. The Psychological Review, 63, 81–97.

    Article  Google Scholar 

  • Protein Structure Prediction Center (2012). http://predictioncenter.org/.

  • Rangwala, H., Kauffman, C., & Karypis, G. (2009). svmprat: Svm-based protein residue annotation toolkit. BMC Bioinformatics, 10(1), 439.

    Article  Google Scholar 

  • Wang, F., Sun, J., Li, T., & Anerousis, N. (2009). Two heads better than one: metric+active learning and its applications for it service classification. In ICDM ’09. Ninth IEEE international conference on data mining, 2009 (pp. 1022–1027).

    Chapter  Google Scholar 

  • Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F., & Jones, D. T. (2004). Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. Journal of Molecular Biology, 337(3), 635–645.

    Article  Google Scholar 

Download references

Acknowledgements

This research is partially supported by NSF grant CCF-0808863. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jaegul Choo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag London Limited

About this chapter

Cite this chapter

Choo, J., Li, F., Joo, K., Park, H. (2012). A Visual Analytics Approach for Protein Disorder Prediction. In: Dill, J., Earnshaw, R., Kasik, D., Vince, J., Wong, P. (eds) Expanding the Frontiers of Visual Analytics and Visualization. Springer, London. https://doi.org/10.1007/978-1-4471-2804-5_10

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-2804-5_10

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-2803-8

  • Online ISBN: 978-1-4471-2804-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics