A Combined Enhancing and Feature Extraction Algorithm to Improve Learning Accuracy for Gene Expression Classification

Huynh, Phuoc-Hai; Nguyen, Van-Hoa; Do, Thanh-Nghi

doi:10.1007/978-3-030-35653-8_17

A Combined Enhancing and Feature Extraction Algorithm to Improve Learning Accuracy for Gene Expression Classification

Phuoc-Hai Huynh¹²,
Van-Hoa Nguyen¹² &
Thanh-Nghi Do^13,14

Conference paper
First Online: 20 November 2019

1389 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11814))

Abstract

In recent years, gene expression data combined with machine learning methods revolutionized cancer classification which had been based solely on morphological appearance. However, the characteristics of gene expression data have very-high-dimensional and small-sample-size which lead to over-fitting of classification algorithms. We propose a novel gene expression classification model of multiple classifying algorithms with synthetic minority oversampling technique (SMOTE) using features extracted by deep convolutional neural network (DCNN). In our approach, the DCNN extracts latent features of gene expression data, then the SMOTE algorithm generates new data from the features of DCNN was implemented. These models are used in conjunction with classifiers that efficiently classify gene expression data. Numerical test results on fifty very-high-dimensional and small-sample-size gene expression datasets from the Kent Ridge Biomedical and Array Expression repositories illustrate that the proposed algorithm is more accurate than state-of-the-art classifying models and improve the accuracy of classifiers including non-linear support vector machines (SVM), linear SVM, k nearest neighbors and random forests.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Chakraborty, S., Rahman, T.: The difficulties in cancer treatment. Ecancermedicalscience 6, ed16 (2012)
Google Scholar
Schena, M., Shalon, D., Davis, R.W., Brown, P.O.: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270(5235), 467–470 (1995)
Article Google Scholar
Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D.: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16(10), 906–914 (2000)
Article Google Scholar
Khan, J., et al.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7(6), 673 (2001)
Article Google Scholar
Li, L., Weinberg, C.R., Darden, T.A., Pedersen, L.G.: Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 17(12), 1131–1142 (2001)
Article Google Scholar
Netto, O. P., Nozawa, S. R., Mitrowsky, R. A. R., Macedo, A. A., Baranauskas, J. A., Lins, C.: Applying decision trees to gene expression data from DNA microarrays: a leukemia case study. In: XXX Congress of the Brazilian Computer Society, X Workshop on Medical Informatics, p. 10 (2010)
Google Scholar
Díaz-Uriarte, R., De Andres, S.A.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7(1), 3 (2006)
Article Google Scholar
Tan, A.C., Gilbert, D.: Ensemble machine learning on gene expression data for cancer classification. Appl. Bioinform. 2(3 Suppl.), S75–S83 (2003)
Google Scholar
Huynh, P.H., Nguyen, V.H., Do, T.N.: Random ensemble oblique decision stumps for classifying gene expression data. In: Proceedings of the Ninth International Symposium on Information and Communication Technology, SoICT 2018, pp. 137–144. ACM, New York (2018)
Google Scholar
Pinkel, D., et al.: High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat. Genet. 20(2), 207 (1998)
Article Google Scholar
Singh, R., Lanchantin, J., Robins, G., Qi, Y.: Deepchrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics 32(17), i639–i648 (2016)
Article Google Scholar
Liu, J., Wang, X., Cheng, Y., Zhang, L.: Tumor gene expression data classification via sample expansion-based deep learning. Oncotarget 8(65), 109646 (2017)
Google Scholar
Huynh, P.-H., Nguyen, V.-H., Do, T.-N.: A coupling support vector machines with the feature learning of deep convolutional neural networks for classifying microarray gene expression data. In: Sieminski, A., Kozierkiewicz, A., Nunez, M., Ha, Q.T. (eds.) Modern Approaches for Intelligent Information and Database Systems. SCI, vol. 769, pp. 233–243. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76081-0_20
Chapter Google Scholar
Huynh, P.H., Nguyen, V.H., Do, T.N.: Novel hybrid DCNN-SVM model for classifying RNA-sequencing gene expression data. J. Inf. Telecommun. 3(4), 533–547 (2019). https://doi.org/10.1080/24751839.2019.1660845
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, pp. 935–942. ACM (2007)
Google Scholar
Blagus, R., Lusa, L.: Smote for high-dimensional class-imbalanced data. BMC Bioinform. 14(1), 106 (2013)
Article Google Scholar
Jinyan, L., Huiqing, L.: Kent ridge bio-medical data set repository. Technical report (2002)
Google Scholar
Brazma, A., et al.: ArrayExpress a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 31(1), 68–71 (2003)
Article Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Book Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Google Scholar
Fix, E., Hodges, J.: Discriminatory analysis-nonparametric discrimination: small sample performance. Technical report, California University, Berkeley (1952)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Breiman, L., Friedman, J. H., Olshen, R., Stone, C. J.: Classification and Regression Trees, vol. 8, pp. 452–456. Wadsworth International Group (1984)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
Google Scholar
Li, Y., et al.: A comprehensive genomic pan-cancer classification using the cancer genome atlas gene expression data. BMC Genom. 18(1), 508 (2017)
Article Google Scholar
Krizhevsky, A., et al.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)
Min, S., Lee, B., Yoon, S.: Deep learning in bioinformatics. Briefings. Bioinformatics 18, bbw068 (2016)
Google Scholar
Zeebaree, D.Q., Haron, H., Abdulazeez, A.M.: Gene selection and classification of microarray data using convolutional neural network. In: 2018 International Conference on Advanced Science and Engineering (ICOASE), pp. 145–150. IEEE (2018)
Google Scholar
Lyu, B., Haque, A.: Deep learning based tumor type classification using gene expression data. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2018, pp. 89–96. ACM, New York (2018)
Google Scholar
Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Nat. Acad. Sci. 99(10), 6562–6566 (2002)
Article Google Scholar
Vert, J.P., Kanehisa, M.: Graph-driven feature extraction from microarray data using diffusion kernels and kernel CCA. In: Advances in Neural Information Processing Systems, pp. 1449–1405 (2003)
Google Scholar
Wang, A., Gehan, E.A.: Gene selection for microarray data analysis using principal component analysis. Stat. Med. 24(13), 2069–2087 (2005)
Article MathSciNet Google Scholar
Sun, G., Dong, X., Xu, G.: Tumor tissue identification based on gene expression data using DWT feature extraction and PNN classifier. Neurocomputing 69(4–6), 387–402 (2006)
Article Google Scholar
Huynh, P.H., Nguyen, V., Do, T.N.: Enhancing gene expression classification of support vector machines with generative adversarial networks. J. Inf. Commun. Converg. Eng. 17, 14–20 (2019)
Google Scholar
Van den Bulcke, T., et al.: SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinform. 7, 43 (2006)
Article Google Scholar
Costa, P., et al.: End-to-end adversarial retinal image synthesis. IEEE Trans. Med. Imaging 37(3), 781–791 (2018)
Article Google Scholar
Moeskops, P., Veta, M., Lafarge, M.W., Eppenhof, K.A.J., Pluim, J.P.W.: Adversarial training and dilated convolutions for brain MRI segmentation. In: Cardoso, M.J., et al. (eds.) DLMIA/ML-CDS -2017. LNCS, vol. 10553, pp. 56–64. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67558-9_7
Chapter Google Scholar
Lusa, L., et al.: Class prediction for high-dimensional class-imbalanced data. BMC Bioinform. 11(1), 523 (2010)
Article Google Scholar
Fernández, A., García, S., Herrera, F., Chawla, N.V.: Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Int. Res. 61(1), 863–905 (2018)
MathSciNet MATH Google Scholar
Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1998)
Article Google Scholar
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet Google Scholar
Hubel, D.H., Wiesel, T.: Shape and arrangement of columns in cat’s striate cortex. J. Physiol. 165(3), 559–568 (1963)
Article Google Scholar
Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc. 2(2), 121–167 (1998)
Article Google Scholar
Popovici, V., et al.: Effect of training-sample size and classification difficulty on the accuracy of genomic predictors. Breast Cancer Res. 12(1), R5 (2010)
Article MathSciNet Google Scholar
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000)
Book Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.: TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems (2015)
Google Scholar
Wong, T.T.: Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recogn. 48(9), 2839–2846 (2015)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Information Technology Faculty, An Giang University, Angiang, Viet Nam
Phuoc-Hai Huynh & Van-Hoa Nguyen
College of Information Technology, Can Tho University, Cantho, Vietnam
Thanh-Nghi Do
UMI UMMISCO 209 (IRD/UPMC), Sorbonne University, Pierre and Marie Curie University, Paris 6, France
Thanh-Nghi Do

Authors

Phuoc-Hai Huynh
View author publications
You can also search for this author in PubMed Google Scholar
Van-Hoa Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Thanh-Nghi Do
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Phuoc-Hai Huynh .

Editor information

Editors and Affiliations

Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam
Tran Khanh Dang
Johannes Kepler Universität Linz, Linz, Austria
Josef Küng
Hosei University, Tokyo, Japan
Makoto Takizawa
Telecommunications University, Nha Trang City, Vietnam
Son Ha Bui

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huynh, PH., Nguyen, VH., Do, TN. (2019). A Combined Enhancing and Feature Extraction Algorithm to Improve Learning Accuracy for Gene Expression Classification. In: Dang, T., Küng, J., Takizawa, M., Bui, S. (eds) Future Data and Security Engineering. FDSE 2019. Lecture Notes in Computer Science(), vol 11814. Springer, Cham. https://doi.org/10.1007/978-3-030-35653-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-35653-8_17
Published: 20 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-35652-1
Online ISBN: 978-3-030-35653-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics