Abstract
In this chapter, we present a case study of performing visual analytics to the protein disorder prediction problem. Protein disorder is one of the most important characteristics in understanding many biological functions and interactions. Due to the high cost to perform lab experiments, machine learning algorithms such as neural networks and support vector machines have been used for its identification. Rather than applying these generic methods, we show in this chapter that more insights can be found using visual analytics. Visualizations using linear discriminant analysis reveal that the disorder within each protein is usually well separated linearly. However, if various proteins are integrated together, there does not exist a clear linear separation rule in general. Based on this observation, we perform another visualization on the linear discriminant vector for each protein and confirm that the proteins are clearly clustered into several groups. Inspired by such findings, we apply k-means clustering on the proteins and construct a different classifier on each group, which leads us to a significant improvement of disorder prediction performance. Moreover, within the identified protein subgroups, the separation accuracy topped 99 %, a clear indicator for further biological investigations on these subgroups.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. (1997). Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research, 25(17), 3389–3402.
Baddeley, A. (1994). The magical number seven: still magic after all these years? The Psychological Review, 101, 353–356.
Bishop, C. M., & Tipping, M. E. (1998). A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 281–293.
Bryson, K., McGuffin, L. J., Marsden, R. L., Ward, J. J., Sodhi, J. S., & Jones, D. T. (2005). Protein structure prediction servers at University College London. Nucleic Acids Research, 33(suppl 2), W36–W38.
Cheng, J. (2004). Protein disorder dataset: Disorder723. http://casp.rnet.missouri.edu/download/disorder.dataset.
Cheng, J., Sweredoski, M., & Baldi, P. (2005). Accurate prediction of protein disordered regions by mining protein structure data. Data Mining and Knowledge Discovery, 11, 213–222. doi:10.1007/s10618-005-0001-y.
Choo, J., Bohn, S., & Park, H. (2009). Two-stage framework for visualization of clustered high dimensional data. In IEEE symposium on visual analytics science and technology, 2009. VAST 2009 (pp. 67–74).
Choo, J., Lee, H., Kihm, J., & Park, H. (2010). iVisClassifier: an interactive visual analytics system for classification based on supervised dimension reduction. In 2010 IEEE conference on visual analytics science and technology (VAST) (pp. 27–34).
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd edn.). New York: Wiley-Interscience.
Dunker, A. K., Brown, C. J., Lawson, J. D., Iakoucheva, L. M., & Obradovic, Z. (2002). Intrinsic disorder and protein function. Biochemistry, 41(21), 6573–6582. doi:10.1021/bi012159+. PMID: 12022860.
Ferron, F., Longhi, S., Canard, B., & Karlin, D. (2006). A practical overview of protein disorder prediction methods. Proteins: Structure, Function, and Bioinformatics, 65(1), 1–14.
Fukunaga, K. (1990). Introduction to statistical pattern recognition (2nd edn.). Boston: Academic Press.
Fukunaga, K., & Mantock, J. M. (1983). Nonparametric discriminant analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 671–678.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: data mining, inference, and prediction. Berlin: Springer.
Hecker, J., Yang, J., & Cheng, J. (2008). Protein disorder prediction at multiple levels of sensitivity and specificity. BMC Genomics, 9(suppl 1), 9.
Howland, P., & Park, H. (2004). Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8), 995–1006.
Jolliffe, I. T. (2002). Principal component analysis. Berlin: Springer.
Jones, D. T. (1999). Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology, 292(2), 195–202.
Joo, K., Lee, S. J., & Lee, J. (2012). Sann: Solvent accessibility prediction of proteins by nearest neighbor method. Proteins—Structure, Function, Bioinformatics. Wiley, published online. doi:10.1002/prot.24074. http://onlinelibrary.wiley.com/doi/10.1002/prot.24074/pdf.
Kim, H., & Park, H. (2003). Protein secondary structure prediction based on an improved support vector machines approach. Protein Engineering, 16(8), 553–560.
Kim, H., & Park, H. (2004). Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3D local descriptor. Proteins: Structure, Function, and Bioinformatics, 54(3), 557–562.
Kyte, J., & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. Journal of Molecular Biology, 157(1), 105–132.
Lin, C.-J. (2012). Liblinear—a library for large linear classification. http://www.csie.ntu.edu.tw/~cjlin/liblinear/.
Miller, G. A. (1956). The magical number seven, plus or minus two: some limits on our capacity for processing information. The Psychological Review, 63, 81–97.
Protein Structure Prediction Center (2012). http://predictioncenter.org/.
Rangwala, H., Kauffman, C., & Karypis, G. (2009). svmprat: Svm-based protein residue annotation toolkit. BMC Bioinformatics, 10(1), 439.
Wang, F., Sun, J., Li, T., & Anerousis, N. (2009). Two heads better than one: metric+active learning and its applications for it service classification. In ICDM ’09. Ninth IEEE international conference on data mining, 2009 (pp. 1022–1027).
Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F., & Jones, D. T. (2004). Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. Journal of Molecular Biology, 337(3), 635–645.
Acknowledgements
This research is partially supported by NSF grant CCF-0808863. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag London Limited
About this chapter
Cite this chapter
Choo, J., Li, F., Joo, K., Park, H. (2012). A Visual Analytics Approach for Protein Disorder Prediction. In: Dill, J., Earnshaw, R., Kasik, D., Vince, J., Wong, P. (eds) Expanding the Frontiers of Visual Analytics and Visualization. Springer, London. https://doi.org/10.1007/978-1-4471-2804-5_10
Download citation
DOI: https://doi.org/10.1007/978-1-4471-2804-5_10
Publisher Name: Springer, London
Print ISBN: 978-1-4471-2803-8
Online ISBN: 978-1-4471-2804-5
eBook Packages: Computer ScienceComputer Science (R0)