Improving classification accuracy of cancer types using parallel hybrid feature selection on microarray gene expression data
- 74 Downloads
Data mining techniques are used to mine unknown knowledge from huge data. Microarray gene expression (MGE) data plays a major role in predicting type of cancer. But as MGE data is huge in volume, applying traditional data mining approaches is time consuming. Hence parallel programming frameworks like Hadoop, Spark and Mahout are necessary to ease the task of computation.
Not all the gene expressions are necessary in prediction, it is very essential to select important genes for improving classification accuracy. So feature selection algorithms are parallelized and executed on Spark framework to eliminate unnecessary genes and identify only predictive genes in very less time without affecting prediction accuracy.
Parallelized hybrid feature selection (HFS) method is proposed to serve the purpose. This method includes parallelized correlation feature subset selection followed by rank-based feature selection methods. The selected subset of genes is evaluated using parallel classification algorithms. The accuracy values obtained are compared with existing rank-weight feature selection, parallelized recursive feature selection methods and also with the values obtained by executing parallelized HFS on DistributedWekaSpark.
The classification accuracy obtained with the proposed parallelized HFS method is 97% and 79% for gastric cancer and childhood leukemia respectively. The proposed parallelized HFS method produced ~ 4% to ~ 15% improvement in classification accuracy when compared with previous methods.
The results reveal the fact that the proposed parallelized feature selection algorithm is scalable to growing medical data and predicts cancer sub-types in lesser time with higher accuracy.
KeywordsParallelized hybrid feature selection Correlation feature subset selection Rank-based methods Parallel classification Spark DistributedWekaSpark
This research work is part of project work funded by Science and Engineering Research Board (SERB), Department of Science and Technology (DST) funded project under Young Scientist Scheme—Early Start-up Research Grant- titled “Investigation on the effect of Gene and Protein Mutants in the onset of Neuro-Degenerative Brain Disorders (Alzheimer’s and Parkinson’s disease): A Computational Study” with Reference no-SERB—YSS/2015/000737/ES.
Compliance with ethical standards
Conflict of interest
Lokeswari Venkataramana, Shomona Gracia Jacob, Rajavel Ramadoss, Dodda Saisuma, Dommaraju Haritha and Kunthipuram Manoja declare that they have no conflict of interest.
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent is not necessary as this article does not involve human or animal participants.
- Ali SI, Shahzad W (2012) A feature subset selection method based on symmetric uncertainty and ant colony optimization. In: IEEE international conference on technologies (ICET), pp 1–6Google Scholar
- Alshamlan HM, Badr GH, Alohali Y (2013) A study of cancer microarray gene expression prole: objectives and approaches. In: Proceedings of the world congress on engineering, vol 2, pp 1–6Google Scholar
- Bioinformatics Laboratory (2019). http://www.biolab.si/supp/bi-ancer/projections/info/ALLGSE412_poterapiji.html. Accessed 20 July 2019
- Gracia Jacob S (2015) Discovery of novel oncogenic patterns using hybrid feature selection and rule mining. Ph.D. Thesis. Anna University. IndiaGoogle Scholar
- Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the seventeenth international conference on machine learning, pp 359–366Google Scholar
- Ryza S, Laserson U, Owen S, Wills J (2017) Advanced analytics with Spark: patterns for learning from data at scale. O’Reilly Media Inc., Northern California, USAGoogle Scholar
- Spark Release 2.2.1—Apache Spark (2019). https://spark.apache.org/releases/spark-release-2-2-1.html. Accessed 25 July 2019
- Venkataramana L, Jacob SG, Ramadoss R (2018) Parallelized classification of cancer sub-types from gene expression profiles using recursive gene selection. Stud Inform Control 27(1):215–224Google Scholar
- Waikato Environment for Knowledge Analysis (WEKA) (2019). http://weka.sourceforge.net/packageMetaData/distributedWekaSpark/index.html. Accessed 26 July 2019
- Wang Z, Zhang Y, Chen Z, Yang H, Sun Y, Kang J, Yang Y, Liang X (2016) Application of ReliefF algorithm to selecting feature sets for classification of high resolution remote sensing image. In: 2016 IEEE international geoscience and remote sensing symposium (IGARSS), pp 755–758Google Scholar
- Zhang H, Li L, Luo C, Sun C, Chen Y, Dai Z, Yuan Z (2014) Informative gene selection and direct classification of tumor based on chi square test of pairwise gene interactions. Biomed Res Int 2014(589290):1–9Google Scholar