Towards Systematic Methods in an Era of Big Data: Neighborhood Wide Association Studies

  • Shannon M. LynchEmail author
Part of the Energy Balance and Cancer book series (EBAC, volume 15)


Methodologic challenges related to variable selection exist in neighborhood studies. In the era of “Big Data”, this variable selection issue will only continue to grow as neighborhood data become increasingly more complex and integrated with multilevel data. To allow for consistency and comparability of neighborhood variables across studies, systematic approaches for variable selection are needed. Borrowing concepts from empiric methods in biology, a novel neighborhood-wide association study (NWAS) and a neighborhood-environment wide association study (NE-WAS) were recently developed. This chapter introduces key concepts of the NWAS/NE-WAS designs, provides criteria for evaluating these systematic approaches, and discusses the potential impact these empiric methods have on future multilevel interventions.


Neighborhood wide association study (NWAS) Neighborhood-environment wide association study (NE-WAS) Big data Machine learning Variable selection 


  1. 1.
    Gomez SL, Shariff-Marco S, DeRouen M, Keegan THM, Yen IH, Mujahid M, et al. The impact of neighborhood social and built environment factors across the cancer continuum: current research, methodological considerations, and future directions. Cancer. 2015;121(14):2314–30.PubMedPubMedCentralCrossRefGoogle Scholar
  2. 2.
    Yen IH, Syme SL. The social environment and health: a discussion of the epidemiologic literature. Annu Rev Public Health. 1999;20(1):287–308.PubMedCrossRefGoogle Scholar
  3. 3.
    Jackson RJ. The impact of the built environment on health: an emerging field. Am J Public Health. 2003;93(9):1382–4.PubMedPubMedCentralCrossRefGoogle Scholar
  4. 4.
    Lynch SM, Rebbeck TR. Bridging the gap between biologic, individual, and macroenvironmental factors in Cancer: a multilevel approach. Cancer Epidemiol Biomark Prev. 2013;22(4):485–95.CrossRefGoogle Scholar
  5. 5.
    Warnecke RB, Oh A, Breen N, Gehlert S, Paskett E, Tucker KL, et al. Approaching health disparities from a population perspective: the National Institutes of Health centers for population health and health disparities. Am J Public Health. 2008;98(9):1608–15.PubMedPubMedCentralCrossRefGoogle Scholar
  6. 6.
    Krieger N, Chen JT, Waterman PD, Soobader M-J, Subramanian SV, Carson R. Geocoding and monitoring of US socioeconomic inequalities in mortality and Cancer incidence: does the choice of area-based measure and geographic level matter? The public health disparities geocoding project. Am J Epidemiol. 2002;156(5):471–82.PubMedCrossRefGoogle Scholar
  7. 7.
    Krieger N. Theories for social epidemiology in the 21st century: an ecosocial perspective. Int J Epidemiol. 2001;30(4):668–77.CrossRefGoogle Scholar
  8. 8.
    Lynch SM, Mitra N, Ross M, Newcomb C, Dailey K, Jackson T, et al. A Neighborhood-Wide Association Study (NWAS): example of prostate cancer aggressiveness. PLoS One. 2017;12(3):e0174548.PubMedPubMedCentralCrossRefGoogle Scholar
  9. 9.
    Weber GM, Mandl KD, Kohane IS. Finding the missing link for big biomedical data. JAMA. 2014;311(24):2479–80.PubMedGoogle Scholar
  10. 10.
    Mooney SJ, Westreich DJ, El-Sayed AM. Commentary: Epidemiology in the era of big data. Epidemiology. 2015;26(3):390–4.PubMedPubMedCentralCrossRefGoogle Scholar
  11. 11.
    Low S-K, Zembutsu H, Nakamura Y. Breast cancer: the translation of big genomic data to cancer precision medicine. Cancer Sci. 2018;109(3):497–506.PubMedCrossRefGoogle Scholar
  12. 12.
    Kar SP, Beesley J, Amin Al Olama A, Michailidou K, Tyrer J, Kote-Jarai Z, et al. Genome-wide meta-analyses of breast, ovarian, and prostate Cancer association studies identify multiple new susceptibility loci shared by at least two Cancer types. Cancer Discov. 2016;6(9):1052–67.PubMedPubMedCentralCrossRefGoogle Scholar
  13. 13.
    U.S. Census Data [Internet]. United States Census Bureau. 2010 cited Accessed 11 Sept 2018.Google Scholar
  14. 14.
    Behavioral Risk Factor Surveillance Data [Internet]. Center for disease control. 2010–2017 cited 21 Sept 2018. Available from
  15. 15.
    Google Imagery [Internet]. Google, Inc. 2018 cited 11 Oct 2018. Available from
  16. 16.
    Open Data Philly [Internet]. 2018 cited 11 Oct 2018. Available from
  17. 17.
    Crime Data [Internet]. ESRI. 2018 cited 15 Oct 2018. Available from
  18. 18.
    Community Health Database [Internet]. Public health management corporation. 2016 [cited 16 June 2016]. Available from
  19. 19.
    National Cancer Institute(NCI) Division of Cancer Control and Population Sciences. NCI cohort consortium. Bethesda, MD. 1 Dec 2018. Available from
  20. 20.
    MacArthur JBE, Cerezo M, Gil L, Hall P, Hastings E, Junkins H, McMahon A, Milano A, Morales J, Pendlington Z, Welter D, Burdett T, Hindorff L, Flicek P, Cunningham F, Parkinson H. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 2017;45(Database Issue):D896–901.CrossRefGoogle Scholar
  21. 21.
    The Cancer Genome Atlas [Internet]. 2018 [cited 12 Nov 2018]. Available from
  22. 22.
    Surveillance, Epidemiology, and End Results (SEER) Program [Internet]. National Cancer Institute, DCCPS, Surveillance Research Program. 1973–2015 [cited 1 Dec 2018]. Available from
  23. 23.
    Varghese JS, Easton DF. Genome-wide association studies in common cancers—what have we learnt? Curr Opin Genet Dev. 2010;20(3):201–9.PubMedCrossRefGoogle Scholar
  24. 24.
    Sampson RJ, Morenoff JD, Gannon-Rowley T. Assessing Neighborhood Effects: social processes and new directions in research. Annu Rev Sociol. 2002;28:443–78.CrossRefGoogle Scholar
  25. 25.
    Eeles RA, Kote-Jarai Z, Giles GG, Olama AA, Guy M, Jugurnauth SK, et al. Multiple newly identified loci associated with prostate cancer susceptibility. Nat Genet. 2008;40(3):316–21.PubMedCrossRefGoogle Scholar
  26. 26.
    Patel CJ, Bhattacharya J, Butte AJ. An Environment-Wide Association Study (EWAS) on type 2 diabetes mellitus. PLoS One. 2010;5(5):e10746.PubMedPubMedCentralCrossRefGoogle Scholar
  27. 27.
    Ioannidis JPA, Loy EY, Poulton R, Chia KS. Researching genetic versus nongenetic determinants of disease: a comparison and proposed unification. Sci Transl Med. 2009;1(7):7ps8.PubMedCrossRefGoogle Scholar
  28. 28.
    Mooney SJ, Joshi S, Cerdá M, Kennedy GJ, Beard JR, Rundle AG. Contextual correlates of physical activity among older adults: a neighborhood environment-wide association study (NE-WAS). Cancer Epidemiol Biomark Prev. 2017;26(4):495–504.CrossRefGoogle Scholar
  29. 29.
    Pearson TA, Manolio TA. How to interpret a genome-wide association study. JAMA. 2008;299(11):1335–44.PubMedCrossRefGoogle Scholar
  30. 30.
    Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA. 2009;106(23):9362–7.PubMedPubMedCentralCrossRefGoogle Scholar
  31. 31.
    Reich DE, Lander ES. On the allelic spectrum of human disease. Trends Genet. 2001;17(9):502–10.CrossRefGoogle Scholar
  32. 32.
    Bush WS, Moore JH. Chapter 11: genome-wide association studies. PLoS Comput Biol. 2012;8(12):e1002822.PubMedPubMedCentralCrossRefGoogle Scholar
  33. 33.
    Hirschhorn JN, Daly MJ. Genome-wide association studies for common diseases and complex traits. Nat Rev Genet. 2005;6:95–108.PubMedCrossRefGoogle Scholar
  34. 34.
    Meuwissen TH, Goddard ME. Fine mapping of quantitative trait loci using linkage disequilibria with closely linked marker loci. Genetics. 2000;155(1):421–30.PubMedPubMedCentralGoogle Scholar
  35. 35.
    Wang Y, Localio R, Rebbeck TR. Evaluating Bias due to population stratification in epidemiologic studies of gene-gene or gene-environment interactions. Cancer Epidemiol Biomark Prev. 2006;15(1):124–32.CrossRefGoogle Scholar
  36. 36.
    Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers EJ, Berk R, et al. Redefine statistical significance. Nat Hum Behav. 2018;2(1):6–10.PubMedCrossRefGoogle Scholar
  37. 37.
    Chawla DS. “One-size-fits-all” threshold for P values under fire. Nature News [Internet] 2017. Available from
  38. 38.
    Year 2000 US. Census SF1 and SF3 Form variables [Internet] 2014. cited 1 Jan 2014. Available from
  39. 39.
    Oakes JM. The (mis)estimation of neighborhood effects: causal inference for a practicable social epidemiology. Soc Sci Med. 2004;58(10):1929–52. Scholar
  40. 40.
    Messer L, Laraia B, Kaufman J, Eyster J, Holzman C, Culhane J, et al. The development of a standard neighborhood deprivation index. J Urban Health. 2006;83(6):1041–62.PubMedPubMedCentralCrossRefGoogle Scholar
  41. 41.
    Diez Roux AV, Mair C. Neighborhoods and health. Ann NY Acad Sci. 2010;1186(1):125–45.CrossRefGoogle Scholar
  42. 42.
    Hubbard AE, Ahern J, Fleischer NL, Laan MV, Lippman SA, Jewell N, et al. To GEE or not to GEE: comparing population average and mixed models for estimating the associations between neighborhood risk factors and health. Epidemiology. 2010;21(4):467–74.PubMedCrossRefGoogle Scholar
  43. 43.
    Ru H, Martino S. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J R Stat Soc Ser B Stat Methodol. 2008;71(2):319–92.CrossRefGoogle Scholar
  44. 44.
    Thomas DC, Casey G, Conti DV, Haile RW, Lewinger JP, Stram DO. Methodological issues in multistage genome-wide association studies. Stat Sci Review J Inst Math Stat. 2009;24(4):414–29.Google Scholar
  45. 45.
    Aslibekyan S, Claas SA, Arnett DK. To replicate or not to replicate: the case of Pharmacogenetic studies: establishing validity of Pharmacogenomic findings: from replication to triangulation. Circ Cardiovasc Genet. 2013;6(4):409–12.PubMedPubMedCentralCrossRefGoogle Scholar
  46. 46.
    Thomson H, Thomas S, Sellstrom E, Petticrew M. Housing improvements for health and associated socio-economic outcomes. Cochrane Database Syst Rev. 2013;Google Scholar
  47. 47.
    Zeigler-Johnson C, Tierney A, Rebbeck TR, Rundle A. Prostate Cancer severity associations with neighborhood deprivation. Prostate Cancer. 2011;2011:1–9.CrossRefGoogle Scholar
  48. 48.
    Carpenter W, Howard D, Taylor Y, Ross L, Wobker S, Godley P. Racial differences in PSA screening interval and stage at diagnosis. Cancer Causes Control. 2010;21(7):1071–80.PubMedPubMedCentralCrossRefGoogle Scholar
  49. 49.
    Kamphuis CB. Socioeconomic differences in lack of recreational walking among older adults: the role of neighbourhood and individual factors. Int J Behav Nutr Phys Act. 2009;6(1)PubMedCrossRefGoogle Scholar
  50. 50.
    Lipsitch M, Tchetgen Tchetgen E, Cohen T. Negative controls: a tool for detecting confounding and bias in observational studies. Epidimiology. 2010;21(3):383–8.CrossRefGoogle Scholar
  51. 51.
    Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.CrossRefGoogle Scholar
  52. 52.
    Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol. 1996;58:267–88.Google Scholar
  53. 53.
    Olson RS, La Cava W, Mustahsan Z, Varik A, Moore JH. Data-driven advice for applying machine learning to bioinformatics problems. Pac Symp Biocomput. 2018;23:192–203.PubMedPubMedCentralGoogle Scholar
  54. 54.
    LoConte NK, Gershenwald JE, Thomson CA, Crane TE, Harmon GE, Rechis R. Lifestyle modifications and policy implications for primary and secondary Cancer prevention: diet, exercise, sun safety, and alcohol reduction. Am Soc Clin Oncol Educ Book. 2018;38:88–100.PubMedCrossRefGoogle Scholar
  55. 55.
    Urbanowicz RJ, Moore JH. ExSTraCS 2.0: description and evaluation of a scalable learning Classifer system. Evol Intel. 2015;8(2.3):89–116.CrossRefGoogle Scholar
  56. 56.
    Ioannidis J. This I believe in genetics: discovery can be a nuisance, replication is science, implementation matters. Front Genet. 2013;4:33.PubMedPubMedCentralCrossRefGoogle Scholar
  57. 57.
    Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015;372(9):793–5.PubMedPubMedCentralCrossRefGoogle Scholar
  58. 58.
    Rebbeck TR. Precision prevention of Cancer. Cancer Epidemiol Biomark Prev. 2014;23:2713–5.CrossRefGoogle Scholar
  59. 59.
    O’Keefe EB, Meltzer JP. Health Disparities and Cancer: Racial Disparities in Cancer Mortality in the United States, 2000–2010. Frontiers in public health. 2015;3:51.Google Scholar
  60. 60.
    Institute of Medicine (IOM). Capturing social and behavioral domains and measures in electronic health records: Phase 2. Washington, DC: National Academies Press; 2014.Google Scholar
  61. 61.
    Cowley D. New Alliance seeks to promote health and prevent illness by addressing social determinants of health in Ogden, St George Utah 2018. Available from
  62. 62.
    Lynch SM, Moore JH. A call for biological data mining approaches in epidemiology. BioData mining. 2016;9(1):1.PubMedPubMedCentralCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Fox Chase Cancer CenterPhiladelphiaUSA

Personalised recommendations