A Visual Analytics Approach for Protein Disorder Prediction

Choo, Jaegul; Li, Fuxin; Joo, Keehyoung; Park, Haesun

doi:10.1007/978-1-4471-2804-5_10

Jaegul Choo⁶,
Fuxin Li⁶,
Keehyoung Joo⁷ &
…
Haesun Park⁶

2706 Accesses
2 Citations

Abstract

In this chapter, we present a case study of performing visual analytics to the protein disorder prediction problem. Protein disorder is one of the most important characteristics in understanding many biological functions and interactions. Due to the high cost to perform lab experiments, machine learning algorithms such as neural networks and support vector machines have been used for its identification. Rather than applying these generic methods, we show in this chapter that more insights can be found using visual analytics. Visualizations using linear discriminant analysis reveal that the disorder within each protein is usually well separated linearly. However, if various proteins are integrated together, there does not exist a clear linear separation rule in general. Based on this observation, we perform another visualization on the linear discriminant vector for each protein and confirm that the proteins are clearly clustered into several groups. Inspired by such findings, we apply k-means clustering on the proteins and construct a different classifier on each group, which leads us to a significant improvement of disorder prediction performance. Moreover, within the identified protein subgroups, the separation accuracy topped 99 %, a clear indicator for further biological investigations on these subgroups.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. (1997). Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research, 25(17), 3389–3402.
Article Google Scholar
Baddeley, A. (1994). The magical number seven: still magic after all these years? The Psychological Review, 101, 353–356.
Article Google Scholar
Bishop, C. M., & Tipping, M. E. (1998). A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 281–293.
Article Google Scholar
Bryson, K., McGuffin, L. J., Marsden, R. L., Ward, J. J., Sodhi, J. S., & Jones, D. T. (2005). Protein structure prediction servers at University College London. Nucleic Acids Research, 33(suppl 2), W36–W38.
Article Google Scholar
Cheng, J. (2004). Protein disorder dataset: Disorder723. http://casp.rnet.missouri.edu/download/disorder.dataset.
Cheng, J., Sweredoski, M., & Baldi, P. (2005). Accurate prediction of protein disordered regions by mining protein structure data. Data Mining and Knowledge Discovery, 11, 213–222. doi:10.1007/s10618-005-0001-y.
Article MathSciNet Google Scholar
Choo, J., Bohn, S., & Park, H. (2009). Two-stage framework for visualization of clustered high dimensional data. In IEEE symposium on visual analytics science and technology, 2009. VAST 2009 (pp. 67–74).
Chapter Google Scholar
Choo, J., Lee, H., Kihm, J., & Park, H. (2010). iVisClassifier: an interactive visual analytics system for classification based on supervised dimension reduction. In 2010 IEEE conference on visual analytics science and technology (VAST) (pp. 27–34).
Chapter Google Scholar
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd edn.). New York: Wiley-Interscience.
MATH Google Scholar
Dunker, A. K., Brown, C. J., Lawson, J. D., Iakoucheva, L. M., & Obradovic, Z. (2002). Intrinsic disorder and protein function. Biochemistry, 41(21), 6573–6582. doi:10.1021/bi012159+. PMID: 12022860.
Article Google Scholar
Ferron, F., Longhi, S., Canard, B., & Karlin, D. (2006). A practical overview of protein disorder prediction methods. Proteins: Structure, Function, and Bioinformatics, 65(1), 1–14.
Article Google Scholar
Fukunaga, K. (1990). Introduction to statistical pattern recognition (2nd edn.). Boston: Academic Press.
MATH Google Scholar
Fukunaga, K., & Mantock, J. M. (1983). Nonparametric discriminant analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 671–678.
Article MATH Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: data mining, inference, and prediction. Berlin: Springer.
MATH Google Scholar
Hecker, J., Yang, J., & Cheng, J. (2008). Protein disorder prediction at multiple levels of sensitivity and specificity. BMC Genomics, 9(suppl 1), 9.
Article Google Scholar
Howland, P., & Park, H. (2004). Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8), 995–1006.
Article Google Scholar
Jolliffe, I. T. (2002). Principal component analysis. Berlin: Springer.
MATH Google Scholar
Jones, D. T. (1999). Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology, 292(2), 195–202.
Article Google Scholar
Joo, K., Lee, S. J., & Lee, J. (2012). Sann: Solvent accessibility prediction of proteins by nearest neighbor method. Proteins—Structure, Function, Bioinformatics. Wiley, published online. doi:10.1002/prot.24074. http://onlinelibrary.wiley.com/doi/10.1002/prot.24074/pdf.
Kim, H., & Park, H. (2003). Protein secondary structure prediction based on an improved support vector machines approach. Protein Engineering, 16(8), 553–560.
Article Google Scholar
Kim, H., & Park, H. (2004). Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3D local descriptor. Proteins: Structure, Function, and Bioinformatics, 54(3), 557–562.
Article Google Scholar
Kyte, J., & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. Journal of Molecular Biology, 157(1), 105–132.
Article Google Scholar
Lin, C.-J. (2012). Liblinear—a library for large linear classification. http://www.csie.ntu.edu.tw/~cjlin/liblinear/.
Miller, G. A. (1956). The magical number seven, plus or minus two: some limits on our capacity for processing information. The Psychological Review, 63, 81–97.
Article Google Scholar
Protein Structure Prediction Center (2012). http://predictioncenter.org/.
Rangwala, H., Kauffman, C., & Karypis, G. (2009). svmprat: Svm-based protein residue annotation toolkit. BMC Bioinformatics, 10(1), 439.
Article Google Scholar
Wang, F., Sun, J., Li, T., & Anerousis, N. (2009). Two heads better than one: metric+active learning and its applications for it service classification. In ICDM ’09. Ninth IEEE international conference on data mining, 2009 (pp. 1022–1027).
Chapter Google Scholar
Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F., & Jones, D. T. (2004). Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. Journal of Molecular Biology, 337(3), 635–645.
Article Google Scholar

Download references

Acknowledgements

This research is partially supported by NSF grant CCF-0808863. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Computational Science and Engineering, Georgia Institute of Technology, Atlanta, USA
Jaegul Choo, Fuxin Li & Haesun Park
Center for Advanced Computation, Korea Institute for Advanced Study, Seoul, Korea
Keehyoung Joo

Authors

Jaegul Choo
View author publications
You can also search for this author in PubMed Google Scholar
Fuxin Li
View author publications
You can also search for this author in PubMed Google Scholar
Keehyoung Joo
View author publications
You can also search for this author in PubMed Google Scholar
Haesun Park
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jaegul Choo .

Editor information

Editors and Affiliations

School of Interactive Arts & Technology, Simon Fraser University, 102 Ave 250-13450, Surrey, V3T 0A3, British Columbia, Canada
John Dill
School of Computing, Informatics & Media, Centre for Visual Computing, University of Bradford, Richmond Road, Bradford, BD7 1DP, United Kingdom
Rae Earnshaw
The Boeing Company, South Trenton Street, Seattle, 98124, Washington, USA
David Kasik
Bournemouth Media School, National Centre for Computer Animation, Bournemouth University, Talbot Campus, Poole, BH12 5BB, Dorset, United Kingdom
John Vince
Pacific Northwest National Laboratory, Battelle Boulevard 902, Richland, 99352, Washington, USA
Pak Chung Wong

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Choo, J., Li, F., Joo, K., Park, H. (2012). A Visual Analytics Approach for Protein Disorder Prediction. In: Dill, J., Earnshaw, R., Kasik, D., Vince, J., Wong, P. (eds) Expanding the Frontiers of Visual Analytics and Visualization. Springer, London. https://doi.org/10.1007/978-1-4471-2804-5_10

Download citation

DOI: https://doi.org/10.1007/978-1-4471-2804-5_10
Publisher Name: Springer, London
Print ISBN: 978-1-4471-2803-8
Online ISBN: 978-1-4471-2804-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics