Abstract
While we glance in the past twenty years, it can be evidently noticed that biological sciences have brought about an active analytical research in high-dimensional data. Recently, many new approaches in Data Science and Machine Learning fields have emerged to handle the ultrahigh-dimensional genome data. Several cancer data types together with the availability of pertinent studies going on similar types of cancers adds to the complexity of the data. It is of commentative biological and clinical interest to understand what subtypes a cancer has, how a patient’s genomic profiles and survival rates vary among subtypes, whether a survival of a patient can be predicted from his or her genomic profile, and the correlation between different genomic profiles. It is of utmost importance to identify types of cancer mutations as they play a very significant role in divulging useful observations into disease pathogenesis and advancing therapy varying from person to person. In this paper we focus on finding the cancer-causing genes and their specific mutations and classifying the genes on the 9 classes of cancer. This will help in predicting which genetic mutation causes which type of cancer. We have used Sci-kit Learn and NLTK for this project to analyze what each class means by classifying all genetic mutations into 17 major mutation types (according to dataset). Dataset is in two formats: CSV and Text, where csv containing the genes and their mutations and text file containing the description of these mutations. Our approach merged the two datasets and used Random Forest, with GridSearchCv and ten-fold Cross-Validation, to perform a supervised classification analysis and has provided with an accuracy score of 68.36%. This is not much accurate as the genes & their variations don’t follow the HGVS Nomenclature of genes because of which conversion of text to numerical format resulted in loss of some important features. Our findings suggest that classes 1, 4 and 7 contribute the most for causing cancer.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bhola, A., & Tiwari, A. K. (2015, December) Machine learning based approaches for cancer classification using gene expression data. Machine Learning and Applications: An International Journal (MLAIJ), 2(3/4).
Kharya, S., (2012). Using data mining techniques for diagnosis and prognosis of cancer disease. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), 2(2).
Liang, M., Li, Z., Chen, T., & Zeng, J. (2015, July/August) Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12(4).
Gregory, K. B., Momin, A. A., Coombes, K. R., & Baladandayuthapani, V. (2014, November/December) Latent feature decompositions for integrative analysis of multi-platform genomic data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(6).
Weitschek, E., Cumbo, F., Cappelli, E., & Felici, G. (2016). Genomic data integration: A case study on next generation sequencing of cancer. In 2016 27th International Workshop on Database and Expert Systems Applications.
Huang, H.-Y., Ho, C.-M., Lin, C.-Y., Chang, Y.-S., Yang, C.-A., & Chang, J.-G. (2016). An integrative analysis for cancer studies. In 2016 IEEE 16th International Conference on Bioinformatics and Bioengineering.
Mishra, S., Kaddi, C. D., & Wang, M. D. (2015). Pan-cancer analysis for studying cancer stage using protein expression data. In Conf Proc IEEE Eng. Med Biol Soc (pp. 8189–8192).
Guyon, I., et al. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1–3), 389–422.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Dev, J., et al. (2012). A classification technique for microarray gene expression data using PSO-FLANN. International Journal on Computer Science and Engineering, 4(9), 1534.
Castaño, A., et al. (2011). Neuro-logistic models based on evolutionary generalized radial basis function for the microarray gene expression classification problem. Neural Processing Letters, 34(2), 117–131.
Sharma, A., Imoto, S., & Miyano, S. (2012). A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 9(3), 754–764.
Rajput, D. S., Singh, P., & Bhattacharya, M. (2011). Feature selection with efficient initialization of clusters centers for high dimensional data clustering. In 2011 International Conference on IEEE Communication Systems and Network Technologies (CSNT) (pp. 293–297).
Jiang, D., Tang, C., & Zhang, A. (2004). Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering, 16(11), 1370–1386.
Devi, B., Kumar, S., Anuradha, Shankar, V.G. (2019). AnaData: A novel approach for data analytics using random forest tree and SVM. In B. Iyer, S. Nalbalwar, N. Pathak (Eds.), Computing, communication and signal processing (Vol. 810). Advances in Intelligent Systems and Computing. Singapore: Springer. https://doi.org/10.1007/978-981-13-1513-8_53.
Shankar, V. G., Jangid, M., Devi, B., Kabra, S. (2018). Mobile big data: Malware and its analysis. In Proceedings of First International Conference on Smart System, Innovations and Computing (Vol. 79, pp. 831–842). Smart Innovation, Systems and Technologies. Singapore: Springer. https://doi.org/10.1007/978-981-10-5828-8_79.
Priyanga, A., & Prakasam, S. (2013). Effectiveness of data mining—Based cancer prediction system (DMBCPS). International Journal of Computer Applications, 83(10), 0975–8887.
Azuaje, F. (1999). Interpretation of genome expression patterns: computational challenges and opportunities. In IEEE Engineering in Medicine and Biology Magazine: The Quarterly Magazine of the Engineering in Medicine & Biology Society (Vol. 19, Issue, 6, pp. 119–119).
Shankar, V. G., Devi, B., Srivastava, S. (2019). DataSpeak: Data extraction, aggregation, and classification using big data novel algorithm. In B. Iyer, S. Nalbalwar, N. Pathak (Eds.), Computing, communication and signal processing (Vol. 810). Advances in Intelligent Systems and Computing. Singapore: Springer. https://doi.org/10.1007/978-981-13-1513-8_16.
Acknowledgements
Varun Goel is the corresponding author. It is our privilege to express our sincere thanks to Prof. Venkatesh Gauri Shankar (Assistant Professor) from Manipal University Jaipur for his helpful guidance and discussions on our data analysis methods. He provided with various resources to support us during the implementation of this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Goel, V., Jangir, V., Shankar, V.G. (2020). DataCan: Robust Approach for Genome Cancer Data Analysis. In: Sharma, N., Chakrabarti, A., Balas, V. (eds) Data Management, Analytics and Innovation. Advances in Intelligent Systems and Computing, vol 1016. Springer, Singapore. https://doi.org/10.1007/978-981-13-9364-8_12
Download citation
DOI: https://doi.org/10.1007/978-981-13-9364-8_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-9363-1
Online ISBN: 978-981-13-9364-8
eBook Packages: EngineeringEngineering (R0)