Construction of precise support vector machine based models for predicting promoter strength
The prediction of the prokaryotic promoter strength based on its sequence is of great importance not only in the fundamental research of life sciences but also in the applied aspect of synthetic biology. Much advance has been made to build quantitative models for strength prediction, especially the introduction of machine learning methods such as artificial neural network (ANN) has significantly improve the prediction accuracy. As one of the most important machine learning methods, support vector machine (SVM) is more powerful to learn knowledge from small sample dataset and thus supposed to work in this problem.
To confirm this, we constructed SVM based models to quantitatively predict the promoter strength. A library of 100 promoter sequences and strength values was randomly divided into two datasets, including a training set (⩾10 sequences) for model training and a test set (⩾10 sequences) for model test.
The results indicate that the prediction performance increases with an increase of the size of training set, and the best performance was achieved at the size of 90 sequences. After optimization of the model parameters, a high-performance model was finally trained, with a high squared correlation coefficient for fitting the training set (R2 > 0.99) and the test set (R2 > 0.98), both of which are better than that of ANN obtained by our previous work.
Our results demonstrate the SVM-based models can be employed for the quantitative prediction of promoter strength.
Keywordssupport vector machine model quantitative prediction promoter strength machine learning
This work was financially supported by NSFC (Nos. 31471270, 31301017, 31670056 and 31300686), 973 Program (No. 2014CB745202), 863 Program (No. SS2015AA020936), the Guangdong Natural Science Funds for Distinguished Young Scholar (No. S2013050016987), the Science and Technology Planning Project of Guangdong Province (Nos. 2014B 020201001 and 2014A030304008), Natural Science Foundation of Guangdong Province (No. 2015A030310317), the Guangzhou Science and Technology Scheme (Nos. 201508020091 and 201508020092), and Shenzhen grants (Nos. KQTD2015033ll7210153, JCYJ20140610152828 703, KQJSCX20160301144623, CXZZ20140901004122088, JCYJ20150 521144321007 and JCYJ20140901003939019).
- 13.Ho, H. K., Zhang, L., Ramamohanarao, K. and Martin, S. (2013) A survey of machine learning methods for secondary and supersecondary protein structure prediction. In Methods and Protocols: Methods in Molecular Biology, 932, 87–106 New York: Humana PressGoogle Scholar
- 32.Li, Y., Lee, K. K., Walsh, S., Smith, C., Hadingham, S., Sorefan, K., Cawley, G. and Bevan, M. W. (2006) Establishing glucose- and ABAregulated transcription networks in Arabidopsis by microarray analysis and promoter classification using a Relevance Vector Machine. Genome Res., 16, 414–427CrossRefPubMedPubMedCentralGoogle Scholar