Skip to main content

Prediction in Cancer Genomics Using Topological Signatures and Machine Learning

  • Conference paper
  • First Online:
Book cover Topological Data Analysis

Part of the book series: Abel Symposia ((ABEL,volume 15))

Abstract

Copy Number Aberrations, gains and losses of genomic regions, are a hallmark of cancer and can be experimentally detected using microarray comparative genomic hybridization (aCGH). In previous works, we developed a topology based method to analyze aCGH data whose output are regions of the genome where copy number is altered in patients with a predetermined cancer phenotype. We call this method Topological Analysis of array CGH (TAaCGH). Here we combine TAaCGH with machine learning techniques to build classifiers using copy number aberrations. We chose logistic regression on two different binary phenotypes related to breast cancer to illustrate this approach. The first case consists of patients with over-expression of the ERBB2 gene. Over-expression of ERBB2 is commonly regulated by a copy number gain in chromosome arm 17q. TAaCGH found the region 17q11-q22 associated with the phenotype and using logistic regression we reduced this region to 17q12-q21.31 correctly classifying 78% of the ERBB2 positive individuals (sensitivity) in a validation data set. We also analyzed over-expression in Estrogen Receptor (ER), a second phenotype commonly observed in breast cancer patients and found that the region 5p14.3-12 together with six full arms were associated with the phenotype. Our method identified 4p, 6p and 16q as the strongest predictors correctly classifying 76% of ER positives in our validation data set. However, for this set there was a significant increase in the false positive rate (specificity). We suggest that topological and machine learning methods can be combined for prediction of phenotypes using genetic data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Al-Kuraya, K., Schraml, P., Torhorst, J., Tapia, C., Zaharieva, B., Novotny, H., Spichtin, H., Maurer, R., Mirlacher, M., Köchli, O. and Zuber, M. Prognostic relevance of gene amplifications and coamplifications in breast cancer. Cancer research, 64(23), 8534–8540 (2004)

    Article  Google Scholar 

  2. Ardanza-Trevijano, S., Gonzalez G., Borrman T., Garcia J.L., Arsuaga J. Topological analysis of amplicon structure in Comparative Genomic Hybridization (CGH) data: an application to ERBB2/HER2/NEU amplified tumors. In: Bac A., Mari J.L. (eds.) International Workshop on Computational Topology in Image Context. 6th International Workshop, CTIC 2016, Marseille, France, June 15–17. Lecture Notes in Computer Science vol. 9667, pp. 113–129. Springer, Cham. (2016)

    Google Scholar 

  3. Arsuaga, J., Borrman, T., Cavalcante, R., Gonzalez, G., Park, C. Identification of copy number aberrations in breast cancer subtypes using persistence topology. Microarrays 4 (3), 339–69 (2015)

    Article  Google Scholar 

  4. Bauer, K.R., Brown, M., Cress, R.D., Parise, C. A., & Caggiano, V.: Descriptive analysis of estrogen receptor (ER)-negative, progesterone receptor (PR)-negative, and HER2-negative invasive breast cancer, the so-called triple-negative phenotype: a population-based study from the California cancer Registry. Cancer 109 (9), 1721–1728 (2007)

    Article  Google Scholar 

  5. Bauer U: Ripser: a lean c+ + code for the computation of vietoris-rips persistence barcodes. https://github.com/Ripser/ripser (2017).

  6. Beroukhim, R., Mermel, C.H., Porter, D., Wei, G., Raychaudhuri, S., Donovan, J., Barretina, J., Boehm, J.S., Dobson, J., Urashima, M. and Mc Henry, K.T.: The landscape of somatic copy-number alteration across human cancers. Nature 463, 899–905, (2010)

    Article  Google Scholar 

  7. Bubenik, P.: Statistical topological data analysis using persistence landscapes. The Journal of Machine Learning Research 16 (1), 77–102 (2015)

    MathSciNet  MATH  Google Scholar 

  8. Burnham, K. P., & Anderson, D. R.: Multimodel inference: understanding AIC and BIC in model selection. Sociological methods & research 33, 261–304 (2004)

    Article  MathSciNet  Google Scholar 

  9. Carlson, R. W., Moench, S. J., Hammond, M. E., Perez, E. A., Burstein, H. J., Allred, D. C., … & Hudis, C. A.: HER2 testing in breast cancer: NCCN Task Force report and recommendations. Journal of the National Comprehensive Cancer Network: JNCCN 4, S1–22 (2006)

    Article  Google Scholar 

  10. Climent, J., Dimitrow, P., Fridlyand, J., Palacios, J., Siebert, R., Albertson, D. G., … & Martinez-Climent, J. A.: Deletion of chromosome 11q predicts response to anthracycline-based chemotherapy in early breast cancer. Cancer research 67, 818–826 (2007)

    Article  Google Scholar 

  11. Cuny, M., Kramar, A., Courjal, F., Johannsdottir, V., Iacopetta, B., Fontaine, H., … & Theillet, C.: Relating genotype and phenotype in breast cancer: an analysis of the prognostic significance of amplification at eight different genes or loci and of p53 mutations. Cancer research 60, 1077–1083 (2000)

    Google Scholar 

  12. de Ronde, J. J., Klijn, C., Velds, A., Holstege, H., Reinders, M. J., Jonkers, J., & Wessels, L. F.: KC-SMARTR: An R package for detection of statistically significant aberrations in multi-experiment aCGH data. BMC research notes 3 (298) (2010)

    Google Scholar 

  13. Deming, S. L., Nass, S. J., Dickson, R. B., & Trock, B. J.: C-myc amplification in breast cancer: a meta-analysis of its occurrence and prognostic relevance. British journal of cancer 83 (12), 1688–1695 (2000)

    Article  Google Scholar 

  14. DeWoskin, D.: Applications of computational homology to analysis of primary breast tumor cgh profiles. Master’s thesis, San Francisco State University (2009)

    Google Scholar 

  15. DeWoskin, D., Climent, J., Cruz-White, I., Vazquez, M., Park, C., & Arsuaga, J.: Applications of computational homology to the analysis of treatment response in breast cancer patients. Topology and its Applications 157 (1), 157–164 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  16. Edelsbrunner, H., & Harer, J.: Persistent homology-a survey. Contemporary mathematics 453, 257–282, (2008)

    Article  MathSciNet  MATH  Google Scholar 

  17. Efron, B., & Gong, G.: A leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician 37 (1), 36–48 (1983)

    MathSciNet  Google Scholar 

  18. Fridlyand, J., Snijders, A. M., Pinkel, D., Albertson, D. G., & Jain, A. N.: Hidden Markov models approach to the analysis of array CGH data. Journal of multivariate analysis 90 (1), 132–153, (2004)

    Article  MathSciNet  MATH  Google Scholar 

  19. Henselman, G., & Ghrist, R. Matroid filtrations and computational persistent homology. arXiv preprint arXiv:1606.00199 (2016)

    Google Scholar 

  20. Hira, Z. M., & Gillies, D. F.: A review of feature selection and feature extraction methods applied on microarray data. Advances in bioinformatics https://doi.org/10.1155/2015/198363 (2015).

  21. Horlings, H.M., Lai, C., Nuyten, D.S., Halfwerk, H., Kristel, P., van Beers, E., Joosse, S.A., Klijn, C., Nederlof, P.M., Reinders, M.J. and Wessels, L.F.: Integration of DNA copy number alterations and prognostic gene expression signatures in breast cancer patients. Clinical Cancer Research, 16 (2):651–663 (2010).

    Article  Google Scholar 

  22. Horlings, H.M., Lai, C., Nuyten, D.S., Halfwerk, H., Kristel, P., van Beers, E., Joosse, S.A., Klijn, C., Nederlof, P.M., Reinders, M.J. and Wessels, L.F.: Integration of DNA copy number alterations and prognostic gene expression signatures in breast cancer patients. supplementary material. Clinical Cancer Research http://clincancerres.aacrjournals.org/content/16/2/651/suppl/DC1/. (2010).

  23. Hupé, P., Stransky, N., Thiery, J.P., Radvanyi, F. and Barillot, E.,: Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics, 20 (18), 3413–3422 (2004).

    Article  Google Scholar 

  24. Kashyap, H., Ahmed, H.A., Hoque, N., Roy, S. and Bhattacharyya, D.K.. Big data analytics in bioinformatics: A machine learning perspective. arXiv:1506.05101 (2015)

    Google Scholar 

  25. Kelley, D.R., Snoek, J. and Rinn, J.L.: Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome research, 26 (7), 990–999 (2016).

    Article  Google Scholar 

  26. Khaled, W.T., Lee, S.C., Stingl, J., Chen, X., Ali, H.R., Rueda, O.M., Hadi, F., Wang, J., Yu, Y., Chin, S.F. and Stratton, M.: Bcl11a is a triple-negative breast cancer gene with critical functions in stem and progenitor cells. Nature communications, 6, 6987 (2015).

    Article  Google Scholar 

  27. Kim, H.C., Lee, J.Y., Sung, H., Choi, J.Y., Park, S.K., Lee, K.M., Kim, Y.J., Go, M.J., Li, L., Cho, Y.S. and Park, M. A genome-wide association study identifies a breast cancer risk variant in ERBB4 at 2q34: results from the seoul breast cancer study. Breast Cancer Research, 14 (2):R56 (2012).

    Article  Google Scholar 

  28. Klijn, C., Holstege, H., de Ridder, J., Liu, X., Reinders, M., Jonkers, J. and Wessels, L.: Identification of cancer genes using a statistical framework for multiexperiment analysis of nondiscretized array CGH data. Nucleic acids research, 36 (2), e13–e13, (2008).

    Article  Google Scholar 

  29. Lai, C., Horlings, H.M., van de Vijver, M.J., van Beers, E.H., Nederlof, P.M., Wessels, L.F. and Reinders, M.J.: SIRAC: Supervised identification of regions of aberration in aCGH datasets. BMC bioinformatics, 8 (1), 422 (2007)

    Article  Google Scholar 

  30. Lerebours, F., Bieche, I. and Lidereau, R.: Update on inflammatory breast cancer. Breast Cancer Research, 7 (2), 52–XX (2005)

    Google Scholar 

  31. Long, J., Cai, Q., Shu, X.O., Qu, S., Li, C., Zheng, Y., Gu, K., Wang, W., Xiang, Y.B., Cheng, J. and Chen, K.: Identification of a functional genetic variant at 16q12. 1 for breast cancer risk: results from the asia breast cancer consortium. PLoS genetics, 6 (6), e1001002 (2010)

    Google Scholar 

  32. Meier, L., Van De Geer, S. and Bühlmann, P.: The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 53–71 (2008)

    Google Scholar 

  33. Mendelsohn, J., Howley, P.M., Israel, M.A., Gray, J. and Thompson, C.B: The Molecular Basis of Cancer E-Book. Elsevier Health Sciences (2014)

    Google Scholar 

  34. Meyerson, M., Gabriel, S., Getz, G.: Advances in understanding cancer genomes through second-generation sequencing. Nat Rev Genet. 11, 685–696 (2010)

    Article  Google Scholar 

  35. Mischaikow, K. and Nanda, V: Morse theory for filtrations and efficient computation of persistent homology. Discrete & Computational Geometry, 50 (2), 330–353 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  36. Mizukami, Y., Nonomura, A., Takizawa, T., Noguchi, M., Michigishi, T., Nakamura, S. and Ishizaki, T.: N-myc protein expression in human breast carcinoma: prognostic implications. Anticancer research, 15 (6B), 2899–2905 (1995)

    Google Scholar 

  37. National Cancer Institute, https://www.cancer.gov/types/breast/patient/breasttreatment-pdq. Accessed Sept. 2018.

  38. Nanda, V.: Perseus, the persistent homology software. http://www.sas.upenn.edu/vnanda/perseus, Accessed 04 Aug. 2019

  39. Nanda, V. and Sazdanovic, R.. Simplicial models and topological inference in biological systems. In: Jonoska N and Saito M (eds) Discrete and topological models in molecular biology, pp 109–141. Springer Science and Business Media, (2014)

    Google Scholar 

  40. Peduzzi, P., Concato, J., Kemper, E., Holford, T.R. and Feinstein, A.R. A simulation study of the number of events per variable in logistic regression analysis. Journal of clinical epidemiology, 49 (12), 1373–1379 (1996)

    Article  Google Scholar 

  41. Pinkel, D. and Albertson, D.G.: Array comparative genomic hybridization and its applications in cancer. Nature genetics, 37 (6s):S11-S17 (2005)

    Article  Google Scholar 

  42. R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org/ (2017). Accessed 2017.

  43. Reis-Filho, J.S., Savage, K., Lambros, M.B., James, M., Steele, D., Jones, R.L. and Dowsett, M.: Cyclin d1 protein overexpression and CCND1 amplification in breast carcinomas: an immunohistochemical and chromogenic in situ hybridisation analysis. Modern pathology, 19 (7), 999–1009 (2006).

    Article  Google Scholar 

  44. Sexton H. and Vejdemo-Johansson M: jplex, http://comptop.stanford.edu/programs/jplex/ (2008). Accessed December 2008.

  45. Shivapurkar, N., Sood, S., Wistuba, I.I., Virmani, A.K., Maitra, A., Milchgrub, S., Minna, J.D. and Gazdar, A.F.: Multiple regions of chromosome 4 demonstrating allelic losses in breast carcinomas. Cancer research, 59 (15), 3576–3580 (1999)

    Google Scholar 

  46. Stacey, S.N., Manolescu, A., Sulem, P., Thorlacius, S., Gudjonsson, S.A., Jonsson, G.F., Jakobsdottir, M., Bergthorsson, J.T., Gudmundsson, J., Aben, K.K. and Strobbe, L.J.: Common variants on chromosome 5p12 confer susceptibility to estrogen receptor-positive breast cancer. Nature genetics, 40 (6), 703–706 (2008)

    Article  Google Scholar 

  47. The GUDHI Project. Gudhi: User and reference manual. http://gudhi.gforge.inria.fr/ (2015)

  48. Toloşi, L. and Lengauer, T. Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics, 27 (14):1986–1994 (2011)

    Article  Google Scholar 

  49. Vittinghoff, E. and McCulloch, C.E.: Relaxing the rule of ten events per variable in logistic and cox regression. American journal of epidemiology, 165 (6):710–718 (2007).

    Article  Google Scholar 

  50. Whitaker, J.W., Chen, Z. and Wang, W.: Predicting the human epigenome from DNA motifs. Nature methods, 12 (3), 265–272 (2015)

    Article  Google Scholar 

  51. Wolff, A.C., Hammond, M.E.H., Hicks, D.G., Dowsett, M., McShane, L.M., Allison, K.H., Allred, D.C., Bartlett, J.M., Bilous, M., Fitzgibbons, P. and Hanna, W. Recommendations for human epidermal growth factor receptor 2 testing in breast cancer: American society of clinical oncology/college of american pathologists clinical practice guideline update. Archives of Pathology and Laboratory Medicine, 138 (2), 241–256 (2013).

    Article  Google Scholar 

  52. Zomorodian A. Fast construction of the Vietoris-Rips complex. Computers & Graphics, 34 (3), 263–271 (2010).

    Article  Google Scholar 

Download references

Acknowledgements

JA was partially supported by NSF DMS 1854770 and NSF CCF 1934568. RS was partially supported by the Simons Collaboration Grant 318086 in the early stages of this project and NSF DMS 1854705.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Javier Arsuaga .

Editor information

Editors and Affiliations

Appendix

Appendix

See Tables 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, and 16.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gonzalez, G., Ushakova, A., Sazdanovic, R., Arsuaga, J. (2020). Prediction in Cancer Genomics Using Topological Signatures and Machine Learning. In: Baas, N., Carlsson, G., Quick, G., Szymik, M., Thaule, M. (eds) Topological Data Analysis. Abel Symposia, vol 15. Springer, Cham. https://doi.org/10.1007/978-3-030-43408-3_10

Download citation

Publish with us

Policies and ethics