DataCan: Robust Approach for Genome Cancer Data Analysis

Goel, Varun; Jangir, Vishal; Shankar, Venkatesh Gauri

doi:10.1007/978-981-13-9364-8_12

Varun Goel¹⁷,
Vishal Jangir¹⁷ &
Venkatesh Gauri Shankar¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1016))

1309 Accesses
10 Citations

Abstract

While we glance in the past twenty years, it can be evidently noticed that biological sciences have brought about an active analytical research in high-dimensional data. Recently, many new approaches in Data Science and Machine Learning fields have emerged to handle the ultrahigh-dimensional genome data. Several cancer data types together with the availability of pertinent studies going on similar types of cancers adds to the complexity of the data. It is of commentative biological and clinical interest to understand what subtypes a cancer has, how a patient’s genomic profiles and survival rates vary among subtypes, whether a survival of a patient can be predicted from his or her genomic profile, and the correlation between different genomic profiles. It is of utmost importance to identify types of cancer mutations as they play a very significant role in divulging useful observations into disease pathogenesis and advancing therapy varying from person to person. In this paper we focus on finding the cancer-causing genes and their specific mutations and classifying the genes on the 9 classes of cancer. This will help in predicting which genetic mutation causes which type of cancer. We have used Sci-kit Learn and NLTK for this project to analyze what each class means by classifying all genetic mutations into 17 major mutation types (according to dataset). Dataset is in two formats: CSV and Text, where csv containing the genes and their mutations and text file containing the description of these mutations. Our approach merged the two datasets and used Random Forest, with GridSearchCv and ten-fold Cross-Validation, to perform a supervised classification analysis and has provided with an accuracy score of 68.36%. This is not much accurate as the genes & their variations don’t follow the HGVS Nomenclature of genes because of which conversion of text to numerical format resulted in loss of some important features. Our findings suggest that classes 1, 4 and 7 contribute the most for causing cancer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bhola, A., & Tiwari, A. K. (2015, December) Machine learning based approaches for cancer classification using gene expression data. Machine Learning and Applications: An International Journal (MLAIJ), 2(3/4).
Google Scholar
Kharya, S., (2012). Using data mining techniques for diagnosis and prognosis of cancer disease. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), 2(2).
Article Google Scholar
Liang, M., Li, Z., Chen, T., & Zeng, J. (2015, July/August) Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12(4).
Google Scholar
Gregory, K. B., Momin, A. A., Coombes, K. R., & Baladandayuthapani, V. (2014, November/December) Latent feature decompositions for integrative analysis of multi-platform genomic data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(6).
Google Scholar
Weitschek, E., Cumbo, F., Cappelli, E., & Felici, G. (2016). Genomic data integration: A case study on next generation sequencing of cancer. In 2016 27th International Workshop on Database and Expert Systems Applications.
Google Scholar
Huang, H.-Y., Ho, C.-M., Lin, C.-Y., Chang, Y.-S., Yang, C.-A., & Chang, J.-G. (2016). An integrative analysis for cancer studies. In 2016 IEEE 16th International Conference on Bioinformatics and Bioengineering.
Google Scholar
Mishra, S., Kaddi, C. D., & Wang, M. D. (2015). Pan-cancer analysis for studying cancer stage using protein expression data. In Conf Proc IEEE Eng. Med Biol Soc (pp. 8189–8192).
Google Scholar
Guyon, I., et al. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1–3), 389–422.
Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Article Google Scholar
Dev, J., et al. (2012). A classification technique for microarray gene expression data using PSO-FLANN. International Journal on Computer Science and Engineering, 4(9), 1534.
Google Scholar
Castaño, A., et al. (2011). Neuro-logistic models based on evolutionary generalized radial basis function for the microarray gene expression classification problem. Neural Processing Letters, 34(2), 117–131.
Article Google Scholar
Sharma, A., Imoto, S., & Miyano, S. (2012). A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 9(3), 754–764.
Article Google Scholar
Rajput, D. S., Singh, P., & Bhattacharya, M. (2011). Feature selection with efficient initialization of clusters centers for high dimensional data clustering. In 2011 International Conference on IEEE Communication Systems and Network Technologies (CSNT) (pp. 293–297).
Google Scholar
Jiang, D., Tang, C., & Zhang, A. (2004). Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering, 16(11), 1370–1386.
Article Google Scholar
Devi, B., Kumar, S., Anuradha, Shankar, V.G. (2019). AnaData: A novel approach for data analytics using random forest tree and SVM. In B. Iyer, S. Nalbalwar, N. Pathak (Eds.), Computing, communication and signal processing (Vol. 810). Advances in Intelligent Systems and Computing. Singapore: Springer. https://doi.org/10.1007/978-981-13-1513-8_53.
Google Scholar
Shankar, V. G., Jangid, M., Devi, B., Kabra, S. (2018). Mobile big data: Malware and its analysis. In Proceedings of First International Conference on Smart System, Innovations and Computing (Vol. 79, pp. 831–842). Smart Innovation, Systems and Technologies. Singapore: Springer. https://doi.org/10.1007/978-981-10-5828-8_79.
Chapter Google Scholar
Priyanga, A., & Prakasam, S. (2013). Effectiveness of data mining—Based cancer prediction system (DMBCPS). International Journal of Computer Applications, 83(10), 0975–8887.
Google Scholar
Azuaje, F. (1999). Interpretation of genome expression patterns: computational challenges and opportunities. In IEEE Engineering in Medicine and Biology Magazine: The Quarterly Magazine of the Engineering in Medicine & Biology Society (Vol. 19, Issue, 6, pp. 119–119).
Google Scholar
Shankar, V. G., Devi, B., Srivastava, S. (2019). DataSpeak: Data extraction, aggregation, and classification using big data novel algorithm. In B. Iyer, S. Nalbalwar, N. Pathak (Eds.), Computing, communication and signal processing (Vol. 810). Advances in Intelligent Systems and Computing. Singapore: Springer. https://doi.org/10.1007/978-981-13-1513-8_16.
Google Scholar

Download references

Acknowledgements

Varun Goel is the corresponding author. It is our privilege to express our sincere thanks to Prof. Venkatesh Gauri Shankar (Assistant Professor) from Manipal University Jaipur for his helpful guidance and discussions on our data analysis methods. He provided with various resources to support us during the implementation of this work.

Author information

Authors and Affiliations

Department of Information Technology, Manipal University Jaipur, Jaipur, India
Varun Goel, Vishal Jangir & Venkatesh Gauri Shankar

Authors

Varun Goel
View author publications
You can also search for this author in PubMed Google Scholar
Vishal Jangir
View author publications
You can also search for this author in PubMed Google Scholar
Venkatesh Gauri Shankar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Varun Goel .

Editor information

Editors and Affiliations

Society for Data Science, Pune, Maharashtra, India
Neha Sharma
A.K. Choudhury School of Information Technology, University of Calcutta, Kolkata, West Bengal, India
Amlan Chakrabarti
Department of Automatics and Applied Software, Aurel Vlaicu University of Arad, Arad, Romania
Valentina Emilia Balas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Goel, V., Jangir, V., Shankar, V.G. (2020). DataCan: Robust Approach for Genome Cancer Data Analysis. In: Sharma, N., Chakrabarti, A., Balas, V. (eds) Data Management, Analytics and Innovation. Advances in Intelligent Systems and Computing, vol 1016. Springer, Singapore. https://doi.org/10.1007/978-981-13-9364-8_12

Download citation

DOI: https://doi.org/10.1007/978-981-13-9364-8_12
Published: 25 September 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-9363-1
Online ISBN: 978-981-13-9364-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics