AutoWeka: Toward an Automated Data Mining Software for QSAR and QSPR Studies

Nantasenamat, Chanin; Worachartcheewan, Apilak; Jamsak, Saksiri; Preeyanon, Likit; Shoombuatong, Watshara; Simeon, Saw; Mandi, Prasit; Isarankura-Na-Ayudhya, Chartchalerm; Prachayasittikul, Virapong

doi:10.1007/978-1-4939-2239-0_8

Chanin Nantasenamat³,
Apilak Worachartcheewan³,
Saksiri Jamsak³,
Likit Preeyanon³,
Watshara Shoombuatong³,
Saw Simeon³,
Prasit Mandi³,
Chartchalerm Isarankura-Na-Ayudhya⁴ &
…
Virapong Prachayasittikul⁴

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1260))

3857 Accesses
14 Citations

Abstract

In biology and chemistry, a key goal is to discover novel compounds affording potent biological activity or chemical properties. This could be achieved through a chemical intuition-driven trial-and-error process or via data-driven predictive modeling. The latter is based on the concept of quantitative structure-activity/property relationship (QSAR/QSPR) when applied in modeling the biological activity and chemical properties, respectively, of compounds. Data mining is a powerful technology underlying QSAR/QSPR as it harnesses knowledge from large volumes of high-dimensional data via multivariate analysis. Although extremely useful, the technicalities of data mining may overwhelm potential users, especially those in the life sciences. Herein, we aim to lower the barriers to access and utilization of data mining software for QSAR/QSPR studies. AutoWeka is an automated data mining software tool that is powered by the widely used machine learning package Weka. The software provides a user-friendly graphical interface along with an automated parameter search capability. It employs two robust and popular machine learning methods: artificial neural networks and support vector machines. This chapter describes the practical usage of AutoWeka and relevant tools in the development of predictive QSAR/QSPR models. Availability: The software is freely available at http://www.mt.mahidol.ac.th/autoweka.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Brodin A (1858) On the analogy of arsenic and phosphoric acid with respect to chemical and toxicology. Medico-Surgical Academy, St. Petersburg, Russia
Google Scholar
Cros A (1863) Action de l’alcool amylique sur l’organisme. University of Strasbourg, Strasbourg
Google Scholar
Kekulé A (1865) Sur la constitution des substances aromatiques. Bull Soc Chim Fr 3:98
Google Scholar
Richardson B (1869) Physiological research on alcohols. Med Times Gaz 2:703–706
Google Scholar
Richet C (1893) On the relationship between the toxicity and the physical properties of substances. Compt Rendus Seances Soc Biol 9:775–776
Google Scholar
Overton E (1897) Osmotic properties of cells in the bearing on toxicology and pharmacology. Z Phys Chem 22:189–209
CAS Google Scholar
Meyer H (1899) On the theory of alcohol narcosis. Arch Exp Pathol Pharmacol 42:109–118
Article Google Scholar
Moore W (1917) Volatility of organic compounds as an index of the toxicity of their vapors to insects. J Agric Res 10(7):365
Google Scholar
Hammett LP (1937) The effect of structure upon the reactions of organic compounds. Benzene derivatives. J Am Chem Soc 59(1):96–103
Article CAS Google Scholar
Taft RW (1952) Polar and steric substituent constants for aliphatic and o-benzoate groups from rates of esterification and hydrolysis of esters1. J Am Chem Soc 74(12):3120–3128
Article CAS Google Scholar
Hansch C, Maloney PP, Fujita T et al (1962) Correlation of biological activity of phenoxyacetic acids with Hammett substituent constants and partition coefficients. Nature 194:178–180
Google Scholar
Hansch C, Muir RM, Fujita T et al (1963) The correlation of biological activity of plant growth regulators and chloromycetin derivatives with Hammett constants and partition coefficients. J Am Chem Soc 85(18):2817–2824
Article CAS Google Scholar
Hansch C, Muir RM (1950) The ortho effect in plant growth-regulators. Plant Physiol 25(3):389
Article CAS PubMed Central PubMed Google Scholar
Hansch C, Fujita T (1964) p-σ-π analysis. A method for the correlation of biological activity and chemical structure. J Am Chem Soc 86(8):1616–1626
Article CAS Google Scholar
Free SM Jr, Wilson JW (1964) A mathematical contribution to structure-activity studies. J Med Chem 7:395–399
Article CAS PubMed Google Scholar
Hansch C (1969) Quantitative approach to biochemical structure-activity relationships. Acc Chem Res 2(8):232–239
Article CAS Google Scholar
Nantasenamat C, Isarankura-Na-Ayudhya C, Naenna T et al (2009) A practical overview of quantitative structure-activity relationship. Excli J 8:74–88
Google Scholar
Nantasenamat C, Isarankura-Na-Ayudhya C, Prachayasittikul V (2010) Advances in computational methods to predict the biological activity of compounds. Expert Opin Drug Discov 5(7):633–654
Article CAS PubMed Google Scholar
Medina-Franco JL, Martinez-Mayorga K, Bender A et al (2009) Characterization of activity landscapes using 2D and 3D similarity methods: consensus activity cliffs. J Chem Inf Model 49(2):477–491
Article CAS PubMed Google Scholar
Bajorath J (2012) Modeling of activity landscapes for drug discovery. Expert Opin Drug Discov 7(6):463–473
Article CAS PubMed Google Scholar
Doweyko AM (2008) QSAR: dead or alive? J Comput Aided Mol Des 22(2):81–89
Article CAS PubMed Google Scholar
Doweyko AM (2008) Is QSAR relevant to drug discovery? IDrugs 11(12):894–899
CAS PubMed Google Scholar
Tropsha A, Golbraikh A (2007) Predictive QSAR modeling workflow, model applicability domains, and virtual screening. Curr Pharm Des 13(34):3494–3504
Article CAS PubMed Google Scholar
Golbraikh A, Tropsha A (2002) Beware of q2! J Mol Graph Model 20(4):269–276
Article CAS PubMed Google Scholar
Huang J, Fan X (2011) Why QSAR fails: an empirical evaluation using conventional computational approach. Mol Pharm 8(2):600–608
Article CAS PubMed Google Scholar
Tropsha A, Gramatica P, Gombar VK (2003) The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb Sci 22(1):69–77
Article CAS Google Scholar
Tropsha A (2010) Best practices for QSAR model development, validation, and exploitation. Mol Inf 29(6–7):476–488
Article CAS Google Scholar
Fourches D, Muratov E, Tropsha A (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inf Model 50(7):1189–1204
Article CAS PubMed Central PubMed Google Scholar
Scior T, Bender A, Tresadern G et al (2012) Recognizing pitfalls in virtual screening: a critical review. J Chem Inf Model 52(4):867–881
Article CAS PubMed Google Scholar
Dearden JC, Cronin MT, Kaiser KL (2009) How not to develop a quantitative structure-activity or structure-property relationship (QSAR/QSPR). SAR QSAR Environ Res 20(3–4):241–266
Article CAS PubMed Google Scholar
Jewell NE, Turner DB, Willett P et al (2001) Automatic generation of alignments for 3D QSAR analyses. J Mol Graph Model 20(2):111–121
Article CAS PubMed Google Scholar
Tervo AJ, Nyronen TH, Ronkko T et al (2004) Comparing the quality and predictiveness between 3D QSAR models obtained from manual and automated alignment. J Chem Inf Comput Sci 44(3):807–816
Article CAS PubMed Google Scholar
Olah M, Bologa C, Oprea TI (2004) An automated PLS search for biologically relevant QSAR descriptors. J Comput Aided Mol Des 18(7–9):437–449
Article CAS PubMed Google Scholar
Bhonsle JB, Wang Z-X, Tamamura H et al (2005) A simple, automated quasi-4D-QSAR, quasi-multi way PLS approach to develop highly predictive QSAR models for highly flexible CXCR4 inhibitor cyclic pentapeptide ligands using scripted common molecular modeling tools. QSAR Comb Sci 24(5):620–630
Article CAS Google Scholar
Cartmell J, Enoch S, Krstajic D et al (2005) Automated QSPR through competitive workflow. J Comput Aided Mol Des 19(11):821–833
Article CAS PubMed Google Scholar
Zhang S, Golbraikh A, Oloff S et al (2006) A novel automated lazy learning QSAR (ALL-QSAR) approach: method development, applications, and virtual screening of chemical databases using validated ALL-QSAR models. J Chem Inf Model 46(5):1984–1995
Article CAS PubMed Central PubMed Google Scholar
Bhonsle JB, Bhattacharjee AK, Gupta RK (2007) Novel semi-automated methodology for developing highly predictive QSAR models: application for development of QSAR models for insect repellent amides. J Mol Model 13(1):179–208
Article CAS PubMed Google Scholar
Obrezanova O, Csanyi G, Gola JM et al (2007) Gaussian processes: a method for automatic QSAR modeling of ADME properties. J Chem Inf Model 47(5):1847–1857
Article CAS PubMed Google Scholar
Rodgers SL, Davis AM, Tomkinson NP et al (2007) QSAR modeling using automatically updating correction libraries: application to a human plasma protein binding model. J Chem Inf Model 47(6):2401–2407
Article CAS PubMed Google Scholar
Ma CY, Buontempo FV, Wang XZ (2008) Inductive data mining: automatic generation of decision trees from data for QSAR modelling and process historical data analysis. Comput Aid Chem Eng 25:581–586
Article Google Scholar
Wood DJ, Buttar D, Cumming JG et al (2011) Automated QSAR with a hierarchy of global and local models. Mol Inf 30(11–12):960–972
Article CAS Google Scholar
Perez-Castillo Y, Lazar C, Taminau J et al (2012) GA(M)E-QSAR: a novel, fully automatic genetic-algorithm-(meta)-ensembles approach for binary classification in ligand-based drug design. J Chem Inf Model 52(9):2366–2386
Article CAS PubMed Google Scholar
Cox R, Green DV, Luscombe CN et al (2013) QSAR workbench: automating QSAR modeling to drive compound design. J Comput Aided Mol Des 27(4):321–336
Article CAS PubMed Central PubMed Google Scholar
Martins JPA, Ferreira MMC (2013) QSAR modeling: a new open source computational package to generate and validate QSAR models. Quim Nova 26:554–560
Article Google Scholar
Hall M, Frank E, Holmes G et al (2009) The WEKA data mining software: an update. SIGKDD Explorations 11 (1)
Google Scholar
Venkateswarlu S, Ramachandra MS, Subbaraju GV (2005) Synthesis and biological evaluation of polyhydroxycurcuminoids. Bioorg Med Chem 13(23):6374–6380
Article CAS PubMed Google Scholar
Worachartcheewan A, Nantasenamat C, Isarankura-Na-Ayudhya C et al (2011) Predicting the free radical scavenging activity of curcumin derivatives. Chemometr Intell Lab Syst 109(2):207–216
Article CAS Google Scholar
Mandi P, Nantasenamat C, Srungboonmee K et al (2012) QSAR study of anti-prion activity of 2-aminothiazoles. Excli J 11:453–467
Google Scholar
Nantasenamat C, Isarankura-Na-Ayudhya C, Naenna T et al (2008) Prediction of bond dissociation enthalpy of antioxidant phenols by support vector machine. J Mol Graph Model 27(2):188–196
Article CAS PubMed Google Scholar
Nantasenamat C, Li H, Mandi P et al (2013) Exploring the chemical space of aromatase inhibitors. Mol Div. doi:10.1007/s11030-11013-19462-x
Google Scholar
Nantasenamat C, Piacham T, Tantimongcolwat T et al (2008) QSAR model of the quorum-quenching N-acyl-homoserine lactone lactonase activity. J Biol Syst 16(2):279–293
Article CAS Google Scholar
Pingaew R, Tongraung P, Worachartcheewan A et al (2012) Cytotoxicity and QSAR study of (thio)ureas derived from phenylalkylamines and pyridylalkylamines. Med Chem Res 22:4016-4029
Google Scholar
Prachayasittikul S, Wongsawatkul O, Worachartcheewan A et al (2010) Elucidating the structure-activity relationships of the vasorelaxation and antioxidation properties of thionicotinic acid derivatives. Molecules 15(1):198–214
Article CAS PubMed Google Scholar
Thippakorn C, Suksrichavalit T, Nantasenamat C et al (2009) Modeling the LPS neutralization activity of anti-endotoxins. Molecules 14(5):1869–1888
Article CAS PubMed Google Scholar
Worachartcheewan A, Nantasenamat C, Isarankura-Na-Ayudhya C et al (2013) Predicting antimicrobial activities of benzimidazole derivatives. Med Chem Res 22:5418–5430
Google Scholar
Worachartcheewan A, Nantasenamat C, Naenna T et al (2009) Modeling the activity of furin inhibitors using artificial neural network. Eur J Med Chem 44(4):1664–1673
Article CAS PubMed Google Scholar
Nantasenamat C, Li H, Isarankura-Na-Ayudhya C et al (2012) Exploring the physicochemical properties of templates from molecular imprinting literature using interactive text mining approach. Chemometr Intell Lab Syst 116:128–136
Article CAS Google Scholar
Nantasenamat C, Isarankura-Na-Ayudhya C, Tansila N et al (2007) Prediction of GFP spectral properties using artificial neural network. J Comput Chem 28(7):1275–1289
Article CAS PubMed Google Scholar
Nantasenamat C, Naenna T, Isarankura N-AC et al (2005) Quantitative prediction of imprinting factor of molecularly imprinted polymers by artificial neural network. J Comput Aid Mol Des 19(7):509–524
Article CAS Google Scholar
Nantasenamat C, Isarankura-Na-Ayudhya C, Naenna T et al (2007) Quantitative structure-imprinting factor relationship of molecularly imprinted polymers. Biosens Bioelectron 22(12):3309–3317
Article CAS PubMed Google Scholar
Nantasenamat C, Srungboonmee K, Jamsak S et al (2013) Quantitative structure-property relationship study of spectral properties of green fluorescent protein with support vector machine. Chemometr Intell Lab Syst 120:42–52
Article CAS Google Scholar
McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 5(4):115–133
Article Google Scholar
Lawrence J (1993) Introduction to neural networks: design, theory, and applications, 6th edn. California Scientific Software, California
Google Scholar
Smith M (1993) Neural networks for statistical modeling. Van Nostrand Reinhold, New York
Google Scholar
Bishop CM, Nasrabadi NM (2006) Pattern recognition and machine learning, vol 1. Springer, New York
Google Scholar
Vapnik V (2000) The nature of statistical learning theory. Springer, New York
Book Google Scholar
Vapnik V (1998) Statistical learning theory. Wiley, New York
Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Google Scholar
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
Book Google Scholar
Platt JC (1999) Fast training of support vector machines using sequential minimal optimization. In: Schoelkopf B, Burges C, Smola A (eds) Advances in kernel methods: support vector learning. MIT Press, Cambridge, USA, pp 185–208
Google Scholar
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Article Google Scholar

Download references

Acknowledgments

This work was supported by Mahidol University via the Goal-Oriented Research Grant to C.N.; postdoctoral fellowship to W.S.; research assistantships to P.M., S.J., and L.P.; and partial financial support to S.S.

Author information

Authors and Affiliations

Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
Chanin Nantasenamat, Apilak Worachartcheewan, Saksiri Jamsak, Likit Preeyanon, Watshara Shoombuatong, Saw Simeon & Prasit Mandi
Department of Clinical Microbiology and Applied Technology, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
Chartchalerm Isarankura-Na-Ayudhya & Virapong Prachayasittikul

Authors

Chanin Nantasenamat
View author publications
You can also search for this author in PubMed Google Scholar
Apilak Worachartcheewan
View author publications
You can also search for this author in PubMed Google Scholar
Saksiri Jamsak
View author publications
You can also search for this author in PubMed Google Scholar
Likit Preeyanon
View author publications
You can also search for this author in PubMed Google Scholar
Watshara Shoombuatong
View author publications
You can also search for this author in PubMed Google Scholar
Saw Simeon
View author publications
You can also search for this author in PubMed Google Scholar
Prasit Mandi
View author publications
You can also search for this author in PubMed Google Scholar
Chartchalerm Isarankura-Na-Ayudhya
View author publications
You can also search for this author in PubMed Google Scholar
Virapong Prachayasittikul
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chanin Nantasenamat .

Editor information

Editors and Affiliations

Chemistry Department, Oxford University, UK, and University of Victoria, Victoria, British Columbia, Canada
Hugh Cartwright

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Nantasenamat, C. et al. (2015). AutoWeka: Toward an Automated Data Mining Software for QSAR and QSPR Studies. In: Cartwright, H. (eds) Artificial Neural Networks. Methods in Molecular Biology, vol 1260. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-2239-0_8

Download citation

DOI: https://doi.org/10.1007/978-1-4939-2239-0_8
Published: 21 November 2014
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-2238-3
Online ISBN: 978-1-4939-2239-0
eBook Packages: Springer Protocols

Publish with us

Policies and ethics