Local and Global Stratification Analysis in Whole Genome Sequencing (WGS) Studies Using LocStra

Hahn, Georg; Lutz, Sharon Marie; Hecker, Julian; Prokopenko, Dmitry; Lange, Christoph

doi:10.1007/978-3-030-46165-2_13

Local and Global Stratification Analysis in Whole Genome Sequencing (WGS) Studies Using LocStra

Georg Hahn¹⁴,
Sharon Marie Lutz¹⁴,
Julian Hecker¹⁵,
Dmitry Prokopenko¹⁶ &
…
Christoph Lange¹⁴

Conference paper
First Online: 29 April 2020

2356 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12029))

Abstract

We are interested in the analysis of local and global population stratification in WGS studies. We present a new R package (locStra) that utilizes the covariance matrix, the genomic relationship matrix, and the unweighted/weighted genetic Jaccard similarity matrix in order to assess population substructure. The package allows one to use a tailored sliding window approach, for instance using user-defined window sizes and metrics, in order to compare local and global similarity matrices. A technique to select the window size is proposed. Population stratification with locStra is efficient due to its C++ implementation which fully exploits sparse matrix algebra. The runtime for the genome-wide computation of all local similarity matrices does typically not exceed one hour for realistic study sizes. This makes an unprecedented investigation of local stratification across the entire genome possible. We apply our package to the 1,000 Genomes Project.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Bates, D., Eddelbuettel, D.: Fast and elegant numerical linear algebra using the RcppEigen package. J. Stat. Softw. 52(5), 1–24 (2013)
Article Google Scholar
Chang, C.C., Chow, C.C., Tellier, L.C., Vattikuti, S., Purcell, S.M., Lee, J.J.: Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4 (2015)
Google Scholar
Devlin, B., Roeder, K.: Genomic control for association studies. Biometrics 55(4), 997–1004 (1999)
Article Google Scholar
Laird, N.M., Lange, C.: The Fundamentals of Modern Statistical Genetics. SBH. Springer, New York (2011). https://doi.org/10.1007/978-1-4419-7338-2
Book MATH Google Scholar
Lee, S., Epstein, M.P., Duncan, R., Lin, X.: Sparse principal component analysis for identifying ancestry-informative markers in genome-wide association studies. Genet. Epidemiol. 36(4), 293–302 (2012)
Article Google Scholar
Martin, E.R., et al.: Properties of global and local ancestry adjustments in genetic association tests in admixed populations. Genet. Epidemiol. 42(2), 214–229 (2018)
Article Google Scholar
Patterson, N., Price, A.L., Reich, D.: Population structure and Eigenanalysis. PLoS Genet. 2(12), e190 (2006)
Article Google Scholar
Price, A.L., et al.: Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006)
Article Google Scholar
Price, A.L., et al.: Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 5(6), e1000519 (2009)
Article Google Scholar
Pritchard, J.K., Stephens, M., Rosenberg, N.A., Donnelly, P.: Association mapping in structured populations. Am. J. Hum. Genet. 67(1), 170–181 (2000)
Article Google Scholar
Prokopenko, D., et al.: Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project. Bioinformatics 32(9), 1366–1372 (2016)
Article Google Scholar
Purcell, S., Chang, C.: PLINK2 (2019)
Google Scholar
Schlauch, D., Fier, H., Lange, C.: Identification of genetic outliers due to sub-structure and cryptic relationships. Bioinformatics 33(13), 1972–1979 (2017)
Article Google Scholar
Schlauch, D.: Implementation of the stego algorithm - similarity test for estimating genetic outliers (2016)
Google Scholar
The 1000 Genomes Project Consortium: A global reference for human genetic variation. Nature, 526, 68–74 (2015)
Google Scholar
Mises, R.V., PollaczekGeiringer, H.: Praktische Verfahren der Gleichungsaufloesung. ZAMM Zeitschrift fur Angewandte Mathematik und Mechanik 9, 152–164 (1929)
Article Google Scholar
Wang, B., Sverdlov, S., Thompson, E.: Efficient estimation of realized kinship from single nucleotide polymorphism genotypes. Genetics 205(3), 1063–1078 (2017)
Article Google Scholar
Zhong, Y., Perera, M.A., Gamazon, E.R.: On using local ancestry to characterize the genetic architecture of human traits: genetic regulation of gene expression in multiethnic or admixed populations. Am. J. Hum. Genet. 104(6), 1097–1115 (2019)
Article Google Scholar

Download references

Acknowledgment

The project described was supported by Cure Alzheimer’s fund, Award Number (R01MH081862, R01MH087590) from the National Institute of Mental Health and Award Number (R01HL089856, R01HL089897) from the National Heart, Lung and Blood Institute.

Author information

Authors and Affiliations

Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, 02115, USA
Georg Hahn, Sharon Marie Lutz & Christoph Lange
Department of Medicine, Channing Laboratory, Brigham and Women’s Hospital, Boston, MA, 02115, USA
Julian Hecker
Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, 02115, USA
Dmitry Prokopenko

Authors

Georg Hahn
View author publications
You can also search for this author in PubMed Google Scholar
Sharon Marie Lutz
View author publications
You can also search for this author in PubMed Google Scholar
Julian Hecker
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Prokopenko
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Lange
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Georg Hahn .

Editor information

Editors and Affiliations

University of Connecticut, Storrs, CT, USA
Ion Măndoiu
Virginia Tech, Blacksburg, VA, USA
T. M. Murali
Florida International University, Miami, FL, USA
Giri Narasimhan
University of Connecticut, Storrs, CT, USA
Sanguthevar Rajasekaran
Georgia State University, Atlanta, GA, USA
Pavel Skums
Georgia State University, Atlanta, GA, USA
Alexander Zelikovsky

Appendices

A Details on the Implementation

The appendix provides two implementation details on the fully sparse matrix algebra used in the computations of the covariance and Jaccard matrices. Default implementations were used for the GRM matrix [17] and the s-matrix [14]. We assume $X \in \mathbb {R}^{m \times n}$ for the matrix containing (genomic) data of length m in each of the n columns (one column per individual) throughout the section.

1.1 A.1 Covariance Matrix

We first look at computing the covariance matrix in dense algebra. Let the column means of X be given as vector $v \in \mathbb {R}^n$ and denote as $Y \in \mathbb {R}^{m \times n}$ the matrix consisting of the rows of X with their mean substracted. Then

$$\text {cov}(X) = \frac{1}{m-1} Y^\top Y.$$

In the sparse case, it is not possible to compute $\text {cov}(X)$ as above. This is because normalizing X as above by substracting the column means results in a dense matrix which easily exceeds available memory. Thus the computation is split up suitably to always avoid the creation of dense matrices. Letting v be the column means as above, and $w \in \mathbb {R}^n$ be the column sums,

$$\text {cov}(X) = \frac{1}{m-1} \left( X^\top X - w v^\top - v w^\top + m v v^\top \right) .$$

We observe that computing $X^\top X$ involves only the sparse input matrix and one sparse matrix multiplication (which can be done efficiently). The other three terms are vector-vector products resulting in dense $n \times n$ matrices, the (necessary) size of the output covariance matrix.

1.2 A.2 Jaccard Similarity Matrix

Denoting the ith column of X as $X_i$, each entry (i, j) of the Jaccard matrix is given as

$$\text {jac}(X)_{ij} = \frac{\left| \left\{ k: X_{ik} \wedge X_{jk} \right\} \right| }{\left| \left\{ k: X_{ik} \vee X_{jk} \right\} \right| }.$$

For this we assume that X is binary.

Naïvely, we iterate over all the entries of the Jaccard matrix and compute them as given above using binary and as well as binary or operations. This turned out to be slow in our experiments. The following is a faster way to compute the Jaccard matrix in practice even though the asymptotic runtime is unchanged.

Recall that $w \in \mathbb {R}^n$ denotes the column sums of X. Using sparse matrix multiplication, we compute $Y = X^\top X \in \mathbb {R}^{n \times n}$, which is a dense matrix. Let $Z \in \mathbb {R}^{n \times n}$ be the matrix obtained by adding w to all rows and all columns of $-Y$. Observe that $\text {jac}(X) = Y/Z$, where the matrix division is performed entry-wise. Since we only need one sparse matrix multiplication to compute Y (which can be done efficiently), this approach is computationally very fast. The few other operations on the matrices Y and Z are efficient since both matrices are already of same size as the dense Jaccard output matrix.

Table 1. Theoretical runtimes for computing the four similarity matrices. Runtimes differ between dense and sparse implementations. The parameters are: dimensions $m \in \mathbb {N}$ and $n \in \mathbb {N}$ of the input data $X \in \mathbb {R}^{m \times n}$, matrix sparsity parameter $s \in [0,1]$.

Full size table

B Theoretical Runtimes

Table 1 shows the theoretical runtimes for both dense and sparse implementations. As can be seen, the theoretical runtimes for the dense computations are equal, but runtimes for sparse implementations differ. A detailed overview of the computations being carried out and their runtimes is given below. All efforts are given in the dimensions $m \in \mathbb {N}$ and $n \in \mathbb {N}$ of the input data $X \in \mathbb {R}^{m \times n}$, and the matrix sparsity parameter $s \in [0,1]$ (the proportion of non-zero matrix entries).

Computing the covariance matrix involves calculating the column means of X and substracting them from the matrix X (O(mn) for the dense case and $O(n^2)$ for the sparse case involving a correction with outer products, see Sect. A.1). Multiplying $Y^\top Y$ takes $O(mn^2)$ in dense and $O(smn^2)$ in sparse algebra.

Computing the Jaccard matrix involves calculating $Y=X^\top X$ ($O(mn^2)$ in dense and $O(smn^2)$ in sparse algebra) and adding the column sums of X (computed in O(mn) in dense and O(smn) in sparse algebra) to all rows and columns ($O(n^2)$ in both dense and sparse algebra).

Computing the weighted Jaccard matrix (or s-matrix) involves calculating the row sums of the input matrix which are used as weights (O(mn) in dense and O(smn) in sparse algebra), componentwise multiplication of all columns with the weights (O(mn) in dense and O(smn) in sparse algebra), and one matrix-matrix multiplication ($O(mn^2)$ in dense and $O(smn^2)$ in sparse algebra).

Computing the GRM matrix involves the calculation of population frequencies across rows (O(mn) in dense and O(smn) in sparse algebra), one matrix-matrix multiplication ($O(mn^2)$ in dense and $O(smn^2)$ in sparse algebra), as well as multiplying the input matrix with the population frequencies (O(mn) in dense and O(smn) in sparse algebra). Additionally, one outer vector product is required ($O(n^2)$ in both dense and sparse algebra).

Table 2. Computation of the global eigenvector (global EV) and complete stratification scan of chromosome 1 of the 1,000 Genome Project as a function of the window size. Runtimes in seconds for locStra and PLINK2.

Full size table

C Comparison of locStra to PLINK2

Table 2 shows a runtime comparison between locStra and PLINK2. As test data we use chromosome 1 of the 1,000 Genome Project. Before running either locStra or PLINK2, we prepare the raw data from the 1,000 Genome Project using the same parameters as given in Sect. 3. However, locStra and PLINK2 require different input files, and thus we write out the processed data once in the .bed format for PLINK2, and once as .Rdata file containing a sparse matrix of class Matrix in R.

A local stratification scan can be performed in PLINK2: With the command --pca 1, the first eigenvector can be computed for an input .bed file. In order to do a sliding window scan, we use the parameters --from and --to followed by the rs numbers to specify a local window. All eigenvectors are written to an output file with extension .eigenvec by PLINK2, from which we read the vectors and compute correlations in R.

In the locStra package, the local stratification scan is performed using the function fullscan as described in Sect. 2.2.

The results in Table 1 show that even for the computation of the single global eigenvector, locStra is considerably faster than PLINK2. All runtimes for both locStra and PLINK2 include the time to read the .Rdata or .bed input files. For a full scan, PLINK2 needs to (inefficiently) write the eigenvector data for each local window into a file. In comparison to PLINK2, locStra is around one order of magnitude faster, where the speed-up is more pronounced for larger window sizes.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hahn, G., Lutz, S.M., Hecker, J., Prokopenko, D., Lange, C. (2020). Local and Global Stratification Analysis in Whole Genome Sequencing (WGS) Studies Using LocStra. In: Măndoiu, I., Murali, T., Narasimhan, G., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds) Computational Advances in Bio and Medical Sciences. ICCABS 2019. Lecture Notes in Computer Science(), vol 12029. Springer, Cham. https://doi.org/10.1007/978-3-030-46165-2_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-46165-2_13
Published: 29 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46164-5
Online ISBN: 978-3-030-46165-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics