Keywords

1 Introduction

Personalized or stratified medicine has been one of the hot topics in health care, reaching well beyond the launch of the Precision Medicine Initiative in the United States (Collins and Varmus 2010). The promise of personalized medicine is to identify individuals at risk and find optimally tailored health care solutions based on their genetic and environmental makeup (Lu et al. 2014). Although personal medicine spans over a variety of medical and biological disciplines, two subfields are particularly promising due to their growing adoption: genetics and neuroscience. Indeed, many current examples of precision medicine come from pharmacogenomics in general, specifically from oncology, where cancer treatments are picked to match the mutations found in tumours (Kummar et al. 2015; Smith 2012; Tan and Du 2012).

While this use of genetic data in health care is projected to become more central in the next years, its success will depend on multiple factors. As for most things in healthcare, cost plays a huge role. But while the costs for performing a high precision medical examination, like a brain scan, or sequencing a human genome continue to drop (Wetterstrand 2018), their usefulness is bound by both our ability to quickly process these large amounts of data as well as the lack of medically-relevant scientific knowledge we have about individual genetic variants (Dewey et al. 2014), or complex neurobiological processes. As such it is key that science be able to generate genetic knowledge more quickly (Kohane 2015).

Two recent trends in science, big data and artificial intelligence, appear to be promising for not only accelerating our genomic and neurobiological understanding but also for diagnosing in a precision medicine framework (Moon et al. 2007; Dilsizian and Siegel 2014). The idea is that artificial intelligence can be used to mine large data sets to find the smallest associations between genetic variants / neuromarkers and disease phenotypes, and to track disease progression or predict optimal treatments. To effectively create such large data collections it thus becomes central to link and share individual data sets (Kohane 2015). But while the total number of basepairs sequenced per time as well as the total number of participants included in neuroscientific studies have exponentially increased over the last years, sharing practices for such data has not kept up a similar speed (Kovalevskaya et al. 2016), despite individual efforts to enable open sharing of genetic (Mao et al. 2016; Greshake et al. 2014) or neuroscientific (Poline et al. 2012) data.

2 Sharing Genomic Data

To alleviate these shortcomings individual academic consortia have been founded to pool data sets across institutions and individual researchers. National efforts include the UK10K (“UK10K” 2018), which aimed to sequence 10,000 participants in the United Kingdom and the similarly structured 100,000 Genomes Project by Genomics England (“Genomics England” 2018). In the United States, the Exome Aggregation Consortium (ExAC) (“ExAC” 2018) – which has collected over 60,000 exomes - and more recently the All of Us initiative (“All of Us” 2018) are collecting and aggregating more patient data for research purposes. And it is not only academic research that is starting to collect large data sets for personalized medicine, commercial companies are starting to explore the field too.

Since deCODE Genetics and 23andMe released the first Direct-To-Consumer genetic tests back in 2007 (Vorhaus 2010), the market for commercial genetic testing has grown significantly: Not only in terms of companies like MyHeritage, FamilyTreeDNA, AncestryDNA or Veritas that have entered the market, but also in terms of the number of people who have gotten genetic tests through these services. Today, AncestryDNA has over five million customers and industry veteran 23andMe has genetic data for over two million people (McAllister 2017). These sizable commercial databases are of interest to academic and commercial researchers. 23andMe has collaborated with academic researchers on numerous research papers (“23andMe Research” 2018) and has done commercial for-profit collaborations with pharmaceutical companies like Pfizer and Genentech.

Who profits from such large-scale research remains open. As an example, in psychology the need to look into how representative study participants are has been acknowledged. After all, around 80% of all participants in psychology studies are from WEIRD (Western, Educated, Industrialized, Rich, Democratic) countries and do thus not represent human diversity (Henrich et al. 2010). As such, only WEIRD participants can fully profit from much of psychological research. To avoid the overrepresentation of WEIRD individuals found in psychology, it is key that our genetic research data resources reflect human diversity across populations. Indeed, this issue of representativeness becomes even more central in the genetic framework of Genome Wide Association Studies (GWAS). These studies are commonly used to inform personalized medicine by identifying genetic risk factors, e.g. for cancer (Agyeman and Ofori-Asenso 2015). Unfortunately, most of these identified risk factors are mere correlations, not genes directly causing a disease. As these correlations depend on the ancestry context in which they were found, findings of a GWAS are not necessarily applicable outside the human population in which an association was initially found (Bush et al. 2012) and cannot be replicated in many cases (Marigorta et al. 2013).

Indeed, many data sharing efforts show such a lack of population diversity: More than 50% of the over 60,000 samples in the ExAC consortium come from a European population (“ExAC” 2018). Similarly, commercial databases like the ones of 23andMe suffer from ancestry and race biases (“Problems with 23andMe Ancestry Composition” 2015; Euny Hong 2016). Open genomic databases – like the Personal Genome Projects and openSNP – are not fairing much better: 75% of participants in one of Harvard’s Personal Genome Project studies identified as white (Mao et al. 2016) and amongst a survey of over 500 openSNP participants over 70% come from the US, UK and Canada. Additionally, over 75% of openSNP participants had at least a Bachelor’s degree, hinting at a highly skewed demographic (Haeusermann et al. 2017).

3 Sharing Neurobiological Data

Similar to genetics, neuroscience has gone a long way when it comes to data sharing: While initial attempts to share data mainly focused on post-processed data, like coordinate-based results or statistical maps of magnetic resonance imaging (MRI) (Fox and Lancaster 2002), more recent initiatives enable sharing of entire functional or structural MRI datasets (Gorgolewski et al. 2015; Poldrack et al. 2013) and magneto- or electro- encephalography (M/EEG) data (Niso et al. 2016).

As in the case of psychology and genomics, neuroscience research is largely based on data of individuals from WEIRD societies (Falk et al. 2013), despite a plethora of studies showing that brain development is affected by socioeconomic status, early life stress, or cultural differences (Hackman et al. 2010; Marshall et al. 2018; Chan et al. 2018; Duval et al. 2017; Liddell and Jobson 2016). Indeed, within or across household socio-economic variables during childhood, such as family income, parental education (Ellwood-Lowe et al. 2018; Weissman et al. 2018) or neighbourhood poverty levels (Marshall et al. 2018), can be traced on trajectories of brain development, and result in differences in brain structure (Ellwood-Lowe et al. 2018) and cognitive functions (Hackman and Farah 2018), or gene expression (Parker et al. 2017). Differences in brain networks according to socio-economic status are also evident during adolescence (Weissman et al. 2018) and adulthood (Chan et al. 2018).

Furthermore, culture has been shown to influence neural functions (Liddell and Jobson 2016). Cultural and ethnic differences have an impact on emotion perception and expression, and brain responses to emotional or social cues (Derntl et al. 2012). Moreover, ethnic differences have been found in physiological responses to fear or novelty (Martínez et al. 2014; Kredlow et al. 2017), which are commonly used to assess anxiety or post-traumatic stress disorders (Bach et al. 2017). This situation is aggravated by the fact that ethnicity can influence skin conductance responses (Kredlow et al. 2017), which are commonly used as laboratory measurements of fear mechanisms (Tzovara et al. 2018), potentially leading to the exclusion of ethnicities despite being at higher risk e.g. for post-traumatic stress disorders (Roberts et al. 2011).

How much existing data sharing efforts for neuroscience are affected by these biases is hard to estimate at this point: Although these initiatives generally tend to support standardized data formats for data sharing (Niso et al. 2018; Gorgolewski et al. 2016), they only rarely include concrete guidelines for reporting of socio-demographic variables (Madan 2017).

4 Data Sharing as a Social Movement

All of this paints a bleak picture: The populations we are using to develop personalized medicine are highly WEIRD (Henrich et al. 2010). Even worse, we might often not even be aware of this, as we are not collecting the needed demographic data to identify our biases. Depending on the field, research studies can furthermore only contain small sample sizes, making it hard to evaluate how ethnicity or social factors influence neurobiological functions and gene expression. Only by sharing diverse datasets, and including rich demographic information will it be possible to make our understanding of disease progression, and neurobiological functions relevant for all individuals, irrespective of their social or ethnic background.

Back in 2005, Thomas Friedman firmly believed that next great breakthrough in bioscience could come from a 15-year-old who downloads the human genome in Egypt (Pink 2005). Today, we have to acknowledge that there is a good chance that this 15-year-old would not be able to profit from their own breakthrough. Because of this, we are still far away from a truly personalized medicine, making our personal data political. It is up to us, the generators of data and the people sharing data to work on changing this, ensuring that the promise of personalized medicine is equitable. Or to say it with Carol Hanisch’s words: There are no personal solutions at this time. There is only collective action for a collective solution (Hanisch 1969).