Predicting Crystallisability of Organic Molecules Using Statistical Modelling Techniques

  • Rajni M. BhardwajEmail author
Part of the Springer Theses book series (Springer Theses)


Statistical modelling tools i.e. Random Forest and Principal Component Analysis were applied to predict the crystallisability (crystals vs. no crystals) of a set of organic molecules. The predictive models are based on calculated 2-D and 3-D molecular descriptors and published experimental crystallisation propensities of these organic molecules. Random Forest classification method has provided better model than PCA and for the first time enabled the prediction of the crystallisability of organic molecules with ~70 % accuracy. Random Forest classification method has also identified the most important descriptors i.e. torsion energy, van der Waals/steric energy, structure connectivity, conformation and numbers of rotatable bonds in the molecules contributing towards different crystallisation behaviour.


Random Forest Molecular Descriptor Random Forest Model Quantitative Structure Property Relationship Physicochemical Descriptor 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Accelrys (2010) Pipeline pilot in chemistry collection: basic chemistry user guide, Accelrys Inc., 5005 Wateridge Vista Drive, San Diego, CA 92121Google Scholar
  2. Alsenz J, Kansy M (2007) High throughput solubility measurement in drug discovery and development. Adv Drug Deliv Rev 59:546–567CrossRefGoogle Scholar
  3. Baird JA, Van Eerdenbrugh B, Taylor LS (2010) A classification system to assess the crystallization tendency of organic molecules from undercooled melts. J Pharm Sci 99:3787–3806CrossRefGoogle Scholar
  4. Baird J, Santiago-Quinonez D, Rinaldi C, Taylor L (2012) Role of viscosity in influencing the glass-forming ability of organic molecules from the undercooled melt state. Pharm Res 29:271–284CrossRefGoogle Scholar
  5. Blagden N, Davey RJ, Rowe R, Roberts R (1998) Disappearing polymorphs and the role of reaction by-products: the case of sulphathiazole. Int J Pharm 172:169–177CrossRefGoogle Scholar
  6. Breiman L (2001) Random forests. Mach Learn 45:5–32CrossRefGoogle Scholar
  7. Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP, Van Eerdewegh P (2005) Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol 28:171–182CrossRefGoogle Scholar
  8. Cattell RB (1966) The scree test for the number of factors. Multivar Behav Res 1:245–276CrossRefGoogle Scholar
  9. Debeljak Ž, Škrbo A, Jasprica I, Mornar A, Plečko V, Banjanac M, Medić-Šarić M (2007) QSAR study of antimicrobial activity of some 3-nitrocoumarins and related compounds. J Chem Inf Model 47:918–926CrossRefGoogle Scholar
  10. Doniger S, Hofmann T, Yeh J (2004) Predicting CNS permeability of drug molecules: comparison of neural network and support vector machine algorithms. J Comput Biol 9:849–864CrossRefGoogle Scholar
  11. Eder BK, Davis JM, Bloomfield P (1994) An automated classification scheme designed to better elucidate the dependence of ozone on meteorology. J Appl Meteorol 33:1182–1199CrossRefGoogle Scholar
  12. Fabian L (2009) Cambridge structural database analysis of molecular complementarity in cocrystals. Cryst Growth Des 9:1436–1443CrossRefGoogle Scholar
  13. Ferré L (1995) Selection of components in principal component analysis: a comparison of methods. Comput Stat Data Anal 19:669–682CrossRefGoogle Scholar
  14. Florence AJ (2009) Approaches to high-throughput physical form screening and discovery. In: Brittain HG (ed) Polymorphism in pharmaceutical solids, vol 192. Informa Healthcare, New York, pp 139–184Google Scholar
  15. Good AC, Hermsmeier MA (2006) Measuring CAMD technique performance. How “druglike” are drugs? Implications of random test set selection exemplified using druglikeness classification models. J Chem Inf Model 47:110–114CrossRefGoogle Scholar
  16. Gu C-H, Li H, Gandhi RB, Raghavan K (2004) Grouping solvents by statistical analysis of solvent property parameters: implication to polymorph screening. Int J Pharm 283:117–125CrossRefGoogle Scholar
  17. Huang X, Pan W, Grindle S, Han X, Chen Y, Park S, Miller L, Hall J (2005) A comparative study of discriminating human heart failure etiology using gene expression profiles. BMC Bioinformatics 6:1–15CrossRefGoogle Scholar
  18. Hughes LD, Palmer DS, Nigsch F, Mitchell JBO (2008) Why are some properties more difficult to predict than others? A study of QSPR models of solubility, melting point, and log P. J Chem Inf Model 48:220–232CrossRefGoogle Scholar
  19. Hursthouse MB, Huth LS, Threlfall TL (2009) Why do organic compounds crystallise well or badly or ever so slowly? Why is crystallisation nevertheless such a good purification technique? Org Process Res Dev 13:1231–1240CrossRefGoogle Scholar
  20. Infantes L, Fabian L, Motherwell WDS (2007) Organic crystal hydrates: what are the important factors for formation. CrystEngComm 9:65–71CrossRefGoogle Scholar
  21. Johnston A, Johnston BF, Kennedy AR, Florence AJ (2008) Targeted crystallisation of novel carbamazepine solvates based on a retrospective random forest classification. CrystEngComm 10:23–25CrossRefGoogle Scholar
  22. Kaiser HF (1960) The application of electronic computers to factor analysis. Educ Psychol Measur 20:141–151CrossRefGoogle Scholar
  23. Kandaswamy KK, Pugalenthi G, Suganthan PN, Gangal R (2010) SVMCRYS: an SVM approach for the prediction of protein crystallization propensity from protein sequence. Protein Pept Lett 17:423–430CrossRefGoogle Scholar
  24. Kauffman GW, Jurs PC (2001) QSAR and k-nearest neighbor classification analysis of selective cyclooxygenase-2 inhibitors using topologically-based numerical descriptors. J Chem Inf Comput Sci 41:1553–1560CrossRefGoogle Scholar
  25. Kourti T (2009) Quality by design in the pharmaceutical industry: process modelling, monitoring and control using latent variable method. In: 7th IFAC international symposium on advanced control of chemical processes, vol 7. Koç University Campus, Turkey, pp 36–41Google Scholar
  26. Kurgan L, Razib A, Aghakhani S, Dick S, Mizianty M, Jahandideh S (2009) CRYSTALP2: sequence-based protein crystallization propensity prediction. BMC Struct Biol 9:50CrossRefGoogle Scholar
  27. Li S, Fedorowicz A, Singh H, Soderholm SC (2005) Application of the random forest method in studies of local lymph node assay based skin sensitization data. J Chem Inf Model 45:952–964CrossRefGoogle Scholar
  28. Liaw A, Wiener M (2002) Classification and regression by random forest. R News 2:18–22Google Scholar
  29. Lunetta K, Hayward LB, Segal J, Van Eerdewegh P (2004) Screening large-scale association study data: exploiting interactions using random forests. BMC Genet 5:32CrossRefGoogle Scholar
  30. Makretsov NA, Huntsman DG, Nielsen TO, Yorida E, Peacock M, Cheang MCU, Dunn SE, Hayes M, van de Rijn M, Bajdik C, Gilks CB (2004) Hierarchical clustering analysis of tissue microarray immunostaining data identifies prognostically significant groups of breast carcinoma. Clin Cancer Res 10:6143–6151CrossRefGoogle Scholar
  31. McCabe JF (2010) Application of design of experiment (DOE) to polymorph screening and subsequent data analysis. CrystEngComm 12:1110–1119CrossRefGoogle Scholar
  32. Mizianty MJ, Kurgan L (2009) Meta prediction of protein crystallization propensity. Biochem Biophys Res Commun 390:10–15CrossRefGoogle Scholar
  33. Mizianty MJ, Kurgan L (2011) Sequence-based prediction of protein crystallization, purification and production propensity. Bioinformatics 27:i24–i33CrossRefGoogle Scholar
  34. MOE (2002) Chemical computing group, 1010 Sherbrooke St. W, Montreal, Quebec, H3A 2R7. Qubec, CanadaGoogle Scholar
  35. Mytkolli H, Calitoiu D (2009) Statistical modelling using SAS. A short course.
  36. Overton IM, Barton GJ (2006) A normalised scale for structural genomics target ranking: the OB-score. FEBS Lett 580:4005–4009CrossRefGoogle Scholar
  37. Overton IM, Padovani G, Girolami MA, Barton GJ (2008) ParCrys: a Parzen window density estimation approach to protein crystallization propensity prediction. Bioinformatics 24:901–907CrossRefGoogle Scholar
  38. Palmer DS, O’Boyle NM, Glen RC, Mitchell JBO (2006) Random forest models to predict aqueous solubility. J Chem Inf Model 47:150–158CrossRefGoogle Scholar
  39. Qi Y, Bar-Joseph Z, Klein-Seetharaman J (2006) Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins Struct Funct Bioinf 63:490–500CrossRefGoogle Scholar
  40. Rose S (2002) Statistical design and application to combinatorial chemistry. Drug Discovery Today 7:133–138CrossRefGoogle Scholar
  41. Sanchez-Puig N, Sauter C, Lorber B, Giege R, Moreno A (2012) Predicting protein crystallizability and nucleation. Protein Pept Lett 19:725–731Google Scholar
  42. Sheridan R, Nachbar R, Bush B (1994) Extending the trend vector: the trend matrix and sample-based partial least squares. J Comput Aided Mol Des 8:323–340CrossRefGoogle Scholar
  43. SIMCA (2012) Multivariate analysis software, version, Umetrics Ltd. MKS Instruments UK Ltd., Unit 3-4, Cowley Way, Weston Road, Crewe, Cheshire, CW1 6AG, U. K.Google Scholar
  44. Slabinski L, Jaroszewski L, Rychlewski L, Wilson IA, Lesley SA, Godzik A (2007) XtalPred: a web server for prediction of protein crystallizability. Bioinformatics 23:3403–3405CrossRefGoogle Scholar
  45. Steyvers M (2006) Multidimensional scaling. Encyclopedia of Cognitive Science, Wiley, EnglandGoogle Scholar
  46. Suh C, Gadzuric S, Gaune-Escard M, Rajan K (2009) Multivariate analysis for chemistry-property relationships in molten salts. Zeitschrift fur Naturforsch A (J Phys Sci) 64:467–476Google Scholar
  47. Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 43:1947–1958CrossRefGoogle Scholar
  48. Taskinen J, Yliruusi J (2003) Prediction of physicochemical properties based on neural network modelling. Adv Drug Deliv Rev 55:1163–1183CrossRefGoogle Scholar
  49. R Development Core Team (2006) R: a language and environment for statistical computing, Version 2.10.1 and 2.11.1, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0Google Scholar
  50. Tye H (2004) Application of statistical ‘design of experiments’ methods in drug discovery. Drug Discovery Today 9:485–491CrossRefGoogle Scholar
  51. Vidmar G, Pohar M (2005) Augmented convex hull plots: rationale, implementation in R and biomedical applications. Comput Methods Programs Biomed 78:69–74CrossRefGoogle Scholar
  52. Wiklund S (2008) Multivariate data analysis for Omics. Umetrics Ltd. An MKS Company.
  53. Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemometr Intell Lab Syst 2:37–52CrossRefGoogle Scholar
  54. Xu D, Redman-Furey N (2007) Statistical cluster analysis of pharmaceutical solvents. Int J Pharm 339:175–188CrossRefGoogle Scholar
  55. Yu L, Reutzel-Edens SM, Mitchell CA (2000) Crystallization and polymorphism of conformationally flexible molecules: problems, patterns, and strategies. Org Process Res Dev 4:396–402CrossRefGoogle Scholar
  56. Zhang Q-Y, Aires-de-Sousa J (2006) Random forest prediction of mutagenicity from empirical physicochemical descriptors. J Chem Inf Model 47:1–8CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Lilly Corporate CenterEli Lilly and CompanyIndianapolisUSA

Personalised recommendations