Advertisement

A Topological Data Analysis Approach on Predicting Phenotypes from Gene Expression Data

  • Sayan Mandal
  • Aldo Guzmán-Sáenz
  • Niina Haiminen
  • Saugata Basu
  • Laxmi ParidaEmail author
Conference paper
  • 47 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12099)

Abstract

The goal of this study was to investigate if gene expression measured from RNA sequencing contains enough signal to separate healthy and afflicted individuals in the context of phenotype prediction. We observed that standard machine learning methods alone performed somewhat poorly on the disease phenotype prediction task; therefore we devised an approach augmenting machine learning with topological data analysis.

We describe a framework for predicting phenotype values by utilizing gene expression data transformed into sample-specific topological signatures by employing feature subsampling and persistent homology. The topological data analysis approach developed in this work yielded improved results on Parkinson’s disease phenotype prediction when measured against standard machine learning methods.

This study confirms that gene expression can be a useful indicator of the presence or absence of a condition, and the subtle signal contained in this high dimensional data reveals itself when considering the intricate topological connections between expressed genes.

Keywords

Topological data analysis Gene expression Phenotype prediction Parkinson’s disease 

Notes

Acknowledgments

Saugata Basu was partially supported by NSF Grant DMS-1620271.

References

  1. 1.
    Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/. Software available from tensorflow.org
  2. 2.
    Adams, H., et al.: Persistence images: a stable vector representation of persistent homology. J. Mach. Learn. Res. 18(8), 1–35 (2017). http://jmlr.org/papers/v18/16-337.htmlMathSciNetGoogle Scholar
  3. 3.
    Arsuaga, J., Borrman, T., Cavalcante, R., Gonzalez, G., Park, C.: Identification of copy number aberrations in breast cancer subtypes using persistence topology. Microarrays 4(3), 339–369 (2015)CrossRefGoogle Scholar
  4. 4.
    Bridle, J.S.: Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In: Soulié, F.F., Hérault, J. (eds.) Neurocomputing. NATO ASI Series (Series F: Computer and Systems Sciences), vol. 68, pp. 227–236. Springer, Heidelberg (1990).  https://doi.org/10.1007/978-3-642-76153-9_28CrossRefGoogle Scholar
  5. 5.
    Bubenik, P.: Statistical topological data analysis using persistence landscapes. J. Mach. Learn. Res. 16(1), 77–102 (2015). http://dl.acm.org/citation.cfm?id=2789272.2789275MathSciNetzbMATHGoogle Scholar
  6. 6.
    Buchet, M., Chazal, F., Oudot, S.Y., Sheehy, D.R.: Efficient and robust persistent homology for measures. Comput. Geom. Theory Appl. 58(C), 70–96 (2016).  https://doi.org/10.1016/j.comgeo.2016.07.001 MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Camara, P.: Topological methods for genomics: present and future directions. Curr. Opin. Syst. Biol., 95–101 (2017).  https://doi.org/10.1016/j.coisb.2016.12.007CrossRefGoogle Scholar
  8. 8.
    Cang, Z., Mu, L., Wu, K., Opron, K., Xia, K., Wei, G.W.: A topological approach for protein classification. Comput. Math. Biophys. 3(1), 140–162 (2015).  https://doi.org/10.1515/mlbmb-2015-0009
  9. 9.
    Carlsson, G., Zomorodian, A., Collins, A., Guibas, L.: Persistence barcodes for shapes. In: Proceedings of the 2004 Eurographics/ACM SIGGRAPH Symposium on Geometry Processing. SGP 2004, pp. 124–135. ACM, New York (2004).  https://doi.org/10.1145/1057432.1057449
  10. 10.
    Chahine, L.M., Stern, M.B., Chen-Plotkin, A.: Blood-based biomarkers for Parkinson’s disease. Parkinsonism Relat. Disord. 20(S1), S99–S103 (2014)CrossRefGoogle Scholar
  11. 11.
    Chazal, F., Fasy, B., Lecci, F., Michel, B., Rinaldo, A., Wasserman, L.: Subsampling methods for persistent homology. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 37, pp. 2143–2151. PMLR, Lille, France, 07–09 July 2015. http://proceedings.mlr.press/v37/chazal15.html
  12. 12.
    Chollet, F., et al.: Keras (2015). https://keras.io
  13. 13.
    Chung, M.K., Bubenik, P., Kim, P.T.: Persistence diagrams of cortical surface data. In: Prince, J.L., Pham, D.L., Myers, K.J. (eds.) IPMI 2009. LNCS, vol. 5636, pp. 386–397. Springer, Heidelberg (2009).  https://doi.org/10.1007/978-3-642-02498-6_32CrossRefGoogle Scholar
  14. 14.
    Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). arXiv:1511.07289 (2015)
  15. 15.
    Cohen-Steiner, D., Edelsbrunner, H., Harer, J.: Stability of persistence diagrams. In: Proceedings of the Twenty-first Annual Symposium on Computational Geometry. SCG 2005, pp. 263–271. ACM, New York (2005).  https://doi.org/10.1145/1064092.1064133
  16. 16.
    van Dam, S., Võsa, U., van der Graaf, A., Franke, L., de Magalhães, J.P.: Gene co-expression analysis for functional classification and gene-disease predictions. Brief. Bioinform. 19(4), 575–592 (2017).  https://doi.org/10.1093/bib/bbw139CrossRefGoogle Scholar
  17. 17.
    Dey, T., Mandal, S.: Protein classification with improved topological data analysis. In: 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Bioinformatics (2018)Google Scholar
  18. 18.
    Duman, A.N., Pirim, H.: Gene coexpression network comparison via persistent homology. Int. J. Genomics 2018, Article ID 7329576, 1–11 (2018).  https://doi.org/10.1155/2018/7329576CrossRefGoogle Scholar
  19. 19.
    Haiminen, N., et al.: Comparative exomics of Phalaris cultivars under salt stress. BMC Genomics (Suppl 6), S18 (2014).  https://doi.org/10.1186/1471-2164-15-S6-S18CrossRefGoogle Scholar
  20. 20.
    Le Cun, Y., et al.: Handwritten digit recognition with a back-propagation network. In: Proceedings of the 2nd International Conference on Neural Information Processing Systems. NIPS 1989, pp. 396–404. MIT Press, Cambridge (1989). http://dl.acm.org/citation.cfm?id=2969830.2969879
  21. 21.
    Nicolau, M., Levine, A.J., Carlsson, G.: Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proc. Natl. Acad. Sci. 108(17), 7265–7270 (2011).  https://doi.org/10.1073/pnas.1102826108CrossRefGoogle Scholar
  22. 22.
    Parnetti, L., et al.: CSF and blood biomarkers for Parkinson’s disease. Lancet Neurol. 18(6), 573–586 (2019)CrossRefGoogle Scholar
  23. 23.
    Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  24. 24.
    Pike, J.A., et al.: Topological data analysis quantifies biological nano-structure from single molecule localization microscopy. bioRxiv (2018).  https://doi.org/10.1101/400275
  25. 25.
    Ranzato, M., Huang, F.J., Boureau, Y., LeCun, Y.: Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, June 2007.  https://doi.org/10.1109/CVPR.2007.383157
  26. 26.
    Sauerwald, N., Shen, Y., Kingsford, C.: Topological data analysis reveals principles of chromosome structure throughout cellular differentiation. bioRxiv (2019).  https://doi.org/10.1101/540716
  27. 27.
    Schofield, J.P.R., et al.: A topological data analysis network model of asthma based on blood gene expression profiles. bioRxiv (2019).  https://doi.org/10.1101/516328
  28. 28.
    Székely, G.J., Rizzo, M.L., Bakirov, N.K.: Measuring and testing dependence by correlation of distances. Ann. Stat. 35(6), 2769–2794 (2007).  https://doi.org/10.1214/009053607000000505MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Turner, K., Mukherjee, S., Boyer, D.M.: Persistent homology transform for modeling shapes and surfaces. Inf. Infer. 3(4), 310–344 (2014)MathSciNetzbMATHGoogle Scholar
  30. 30.
    Wang, C., Chen, L., Yang, Y., Zhang, M., Wong, G.: Identification of potential blood biomarkers for Parkinson’s disease by gene expression and DNA methylation data integration analysis. Clin. Epigenetics 11, 24 (2019)CrossRefGoogle Scholar
  31. 31.
    Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, 15–19 August 1999, pp. 42–49. ACM (1999)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.The Ohio State UniversityColumbusUSA
  2. 2.IBM Research, T. J. Watson Research CenterYorktown HeightsUSA
  3. 3.Purdue UniversityWest LafayetteUSA

Personalised recommendations