Skip to main content

DataCan: Robust Approach for Genome Cancer Data Analysis

  • Conference paper
  • First Online:
Data Management, Analytics and Innovation

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1016))

Abstract

While we glance in the past twenty years, it can be evidently noticed that biological sciences have brought about an active analytical research in high-dimensional data. Recently, many new approaches in Data Science and Machine Learning fields have emerged to handle the ultrahigh-dimensional genome data. Several cancer data types together with the availability of pertinent studies going on similar types of cancers adds to the complexity of the data. It is of commentative biological and clinical interest to understand what subtypes a cancer has, how a patient’s genomic profiles and survival rates vary among subtypes, whether a survival of a patient can be predicted from his or her genomic profile, and the correlation between different genomic profiles. It is of utmost importance to identify types of cancer mutations as they play a very significant role in divulging useful observations into disease pathogenesis and advancing therapy varying from person to person. In this paper we focus on finding the cancer-causing genes and their specific mutations and classifying the genes on the 9 classes of cancer. This will help in predicting which genetic mutation causes which type of cancer. We have used Sci-kit Learn and NLTK for this project to analyze what each class means by classifying all genetic mutations into 17 major mutation types (according to dataset). Dataset is in two formats: CSV and Text, where csv containing the genes and their mutations and text file containing the description of these mutations. Our approach merged the two datasets and used Random Forest, with GridSearchCv and ten-fold Cross-Validation, to perform a supervised classification analysis and has provided with an accuracy score of 68.36%. This is not much accurate as the genes & their variations don’t follow the HGVS Nomenclature of genes because of which conversion of text to numerical format resulted in loss of some important features. Our findings suggest that classes 1, 4 and 7 contribute the most for causing cancer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bhola, A., & Tiwari, A. K. (2015, December) Machine learning based approaches for cancer classification using gene expression data. Machine Learning and Applications: An International Journal (MLAIJ), 2(3/4).

    Google Scholar 

  2. Kharya, S., (2012). Using data mining techniques for diagnosis and prognosis of cancer disease. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), 2(2).

    Article  Google Scholar 

  3. Liang, M., Li, Z., Chen, T., & Zeng, J. (2015, July/August) Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12(4).

    Google Scholar 

  4. Gregory, K. B., Momin, A. A., Coombes, K. R., & Baladandayuthapani, V. (2014, November/December) Latent feature decompositions for integrative analysis of multi-platform genomic data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(6).

    Google Scholar 

  5. Weitschek, E., Cumbo, F., Cappelli, E., & Felici, G. (2016). Genomic data integration: A case study on next generation sequencing of cancer. In 2016 27th International Workshop on Database and Expert Systems Applications.

    Google Scholar 

  6. Huang, H.-Y., Ho, C.-M., Lin, C.-Y., Chang, Y.-S., Yang, C.-A., & Chang, J.-G. (2016). An integrative analysis for cancer studies. In 2016 IEEE 16th International Conference on Bioinformatics and Bioengineering.

    Google Scholar 

  7. Mishra, S., Kaddi, C. D., & Wang, M. D. (2015). Pan-cancer analysis for studying cancer stage using protein expression data. In Conf Proc IEEE Eng. Med Biol Soc (pp. 8189–8192).

    Google Scholar 

  8. Guyon, I., et al. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1–3), 389–422.

    Google Scholar 

  9. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  Google Scholar 

  10. Dev, J., et al. (2012). A classification technique for microarray gene expression data using PSO-FLANN. International Journal on Computer Science and Engineering, 4(9), 1534.

    Google Scholar 

  11. Castaño, A., et al. (2011). Neuro-logistic models based on evolutionary generalized radial basis function for the microarray gene expression classification problem. Neural Processing Letters, 34(2), 117–131.

    Article  Google Scholar 

  12. Sharma, A., Imoto, S., & Miyano, S. (2012). A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 9(3), 754–764.

    Article  Google Scholar 

  13. Rajput, D. S., Singh, P., & Bhattacharya, M. (2011). Feature selection with efficient initialization of clusters centers for high dimensional data clustering. In 2011 International Conference on IEEE Communication Systems and Network Technologies (CSNT) (pp. 293–297).

    Google Scholar 

  14. Jiang, D., Tang, C., & Zhang, A. (2004). Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering, 16(11), 1370–1386.

    Article  Google Scholar 

  15. Devi, B., Kumar, S., Anuradha, Shankar, V.G. (2019). AnaData: A novel approach for data analytics using random forest tree and SVM. In B. Iyer, S. Nalbalwar, N. Pathak (Eds.), Computing, communication and signal processing (Vol. 810). Advances in Intelligent Systems and Computing. Singapore: Springer. https://doi.org/10.1007/978-981-13-1513-8_53.

    Google Scholar 

  16. Shankar, V. G., Jangid, M., Devi, B., Kabra, S. (2018). Mobile big data: Malware and its analysis. In Proceedings of First International Conference on Smart System, Innovations and Computing (Vol. 79, pp. 831–842). Smart Innovation, Systems and Technologies. Singapore: Springer. https://doi.org/10.1007/978-981-10-5828-8_79.

    Chapter  Google Scholar 

  17. Priyanga, A., & Prakasam, S. (2013). Effectiveness of data mining—Based cancer prediction system (DMBCPS). International Journal of Computer Applications, 83(10), 0975–8887.

    Google Scholar 

  18. Azuaje, F. (1999). Interpretation of genome expression patterns: computational challenges and opportunities. In IEEE Engineering in Medicine and Biology Magazine: The Quarterly Magazine of the Engineering in Medicine & Biology Society (Vol. 19, Issue, 6, pp. 119–119).

    Google Scholar 

  19. Shankar, V. G., Devi, B., Srivastava, S. (2019). DataSpeak: Data extraction, aggregation, and classification using big data novel algorithm. In B. Iyer, S. Nalbalwar, N. Pathak (Eds.), Computing, communication and signal processing (Vol. 810). Advances in Intelligent Systems and Computing. Singapore: Springer. https://doi.org/10.1007/978-981-13-1513-8_16.

    Google Scholar 

Download references

Acknowledgements

Varun Goel is the corresponding author. It is our privilege to express our sincere thanks to Prof. Venkatesh Gauri Shankar (Assistant Professor) from Manipal University Jaipur for his helpful guidance and discussions on our data analysis methods. He provided with various resources to support us during the implementation of this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Varun Goel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Goel, V., Jangir, V., Shankar, V.G. (2020). DataCan: Robust Approach for Genome Cancer Data Analysis. In: Sharma, N., Chakrabarti, A., Balas, V. (eds) Data Management, Analytics and Innovation. Advances in Intelligent Systems and Computing, vol 1016. Springer, Singapore. https://doi.org/10.1007/978-981-13-9364-8_12

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-9364-8_12

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-9363-1

  • Online ISBN: 978-981-13-9364-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics