3D visualization and cluster analysis of unstructured protein sequences using ARCSA with a file conversion approach

  • U. Vignesh
  • R. Parvathi


This work explains synthesis of protein structures based on the unsupervised learning method known as clustering. Protein structure prediction was performed for different crab and egg datasets with inputs collected from the Protein Data Bank (PDB ID: 3LIG, 2W3Z, 3ZVQ, 2KLR and 2YIZ). The three-dimensional protein structure was merged together with the filtering instances inbuilt in data mining techniques known as MergeSets. The problem description in this proposed methodology, referred to as attribute-related cluster sequence analysis, is to identify a good working algorithm for clustering of protein structures by comparing four existing algorithms: k-means, expectation maximization, farthest first and COBWEB. Experiments are conducted with the BioWeka data mining tool, Modeler 9.15 and the PyMOL tool with scripts using the Python programming language. This paper shows that the expectation maximization algorithm is the best for structured protein clustering, and this will also pave the way for identifying better algorithms for supervised learning methods.


Protein clustering Biological data mining Drug discovery 


  1. 1.
    Vignesh U (2013) Implementing efficient DNA matching using suffix tree. Eng Sci Int Res J 1:170–172Google Scholar
  2. 2.
    Vignesh U, Sivakumar M (2013) Implementing high performance retrieval process by max-score ranking. IOSR J Comput Eng 8:28–33CrossRefGoogle Scholar
  3. 3.
    Vignesh U, Senthilraja P (2013) MashQL Editor using Query Detection Algorithm. Eng Sci Int Res J 1:173–176Google Scholar
  4. 4.
    Vignesh U, Valarmathi P, Arun S (2013) Implementing clustering using CSI by K-means. Int J Eng Sci Innov Technol 2:568–573Google Scholar
  5. 5.
    Vignesh U, Parvathi R (2017) Clustering on structured proteins with filtering instances on Bioweka. J Eng Sci Technol 12:820–833Google Scholar
  6. 6.
    Vignesh U, Parvathi R (2017) Next generation sequencing data analysis software and methods: a survey. Int J Control Theory Appl 9:1–28Google Scholar
  7. 7.
    Vignesh S, Robert P, Vignesh U, Bharathidasan D, Rajasekaran S (2013) Implementing CURE to address scalability issue in social media. Int J Comput Eng Res 3:1–7Google Scholar
  8. 8.
    Birlutiu A, d’Alche-Buc F, Heskes T (2015) A Bayesian framework for combining protein and network topology information for predicting protein–protein interactions. IEEE Trans Comput Biol Bioinform 12(1):538–550CrossRefGoogle Scholar
  9. 9.
    Song D, Chen J, Chen G, Li N, Li J, Fan J, Bu D, Li SC (2015) Parameterized BLOSUM matrices for protein alignment. IEEE Trans Comput Biol Bioinform 12(3):686–694CrossRefGoogle Scholar
  10. 10.
    Tseng VA, Kao C-P (2005) Efficiently mining gene expression data via a novel parameterless clustering method. IEEE Trans Comput Biol Bioinform 2(1):355–365CrossRefGoogle Scholar
  11. 11.
    Yang J, Wang W (2003) CLUSEQ: efficient and effective sequence clustering. In: 19th International Conference on Data Engineering, IEEE Computer Society Press, Los Alamitos, pp 101–112Google Scholar
  12. 12.
    Ng YK, Yin L, Ono H, Li SC (2015) Finding all longest common segments in protein structures efficiently. IEEE Trans Comput Biol Bioinform 12(3):644–655CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Computing Science and EngineeringVIT University - Chennai CampusChennaiIndia

Personalised recommendations