Abstract
We are interested in the analysis of local and global population stratification in WGS studies. We present a new R package (locStra) that utilizes the covariance matrix, the genomic relationship matrix, and the unweighted/weighted genetic Jaccard similarity matrix in order to assess population substructure. The package allows one to use a tailored sliding window approach, for instance using user-defined window sizes and metrics, in order to compare local and global similarity matrices. A technique to select the window size is proposed. Population stratification with locStra is efficient due to its C++ implementation which fully exploits sparse matrix algebra. The runtime for the genome-wide computation of all local similarity matrices does typically not exceed one hour for realistic study sizes. This makes an unprecedented investigation of local stratification across the entire genome possible. We apply our package to the 1,000 Genomes Project.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Bates, D., Eddelbuettel, D.: Fast and elegant numerical linear algebra using the RcppEigen package. J. Stat. Softw. 52(5), 1–24 (2013)
Chang, C.C., Chow, C.C., Tellier, L.C., Vattikuti, S., Purcell, S.M., Lee, J.J.: Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4 (2015)
Devlin, B., Roeder, K.: Genomic control for association studies. Biometrics 55(4), 997–1004 (1999)
Laird, N.M., Lange, C.: The Fundamentals of Modern Statistical Genetics. SBH. Springer, New York (2011). https://doi.org/10.1007/978-1-4419-7338-2
Lee, S., Epstein, M.P., Duncan, R., Lin, X.: Sparse principal component analysis for identifying ancestry-informative markers in genome-wide association studies. Genet. Epidemiol. 36(4), 293–302 (2012)
Martin, E.R., et al.: Properties of global and local ancestry adjustments in genetic association tests in admixed populations. Genet. Epidemiol. 42(2), 214–229 (2018)
Patterson, N., Price, A.L., Reich, D.: Population structure and Eigenanalysis. PLoS Genet. 2(12), e190 (2006)
Price, A.L., et al.: Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006)
Price, A.L., et al.: Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 5(6), e1000519 (2009)
Pritchard, J.K., Stephens, M., Rosenberg, N.A., Donnelly, P.: Association mapping in structured populations. Am. J. Hum. Genet. 67(1), 170–181 (2000)
Prokopenko, D., et al.: Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project. Bioinformatics 32(9), 1366–1372 (2016)
Purcell, S., Chang, C.: PLINK2 (2019)
Schlauch, D., Fier, H., Lange, C.: Identification of genetic outliers due to sub-structure and cryptic relationships. Bioinformatics 33(13), 1972–1979 (2017)
Schlauch, D.: Implementation of the stego algorithm - similarity test for estimating genetic outliers (2016)
The 1000 Genomes Project Consortium: A global reference for human genetic variation. Nature, 526, 68–74 (2015)
Mises, R.V., PollaczekGeiringer, H.: Praktische Verfahren der Gleichungsaufloesung. ZAMM Zeitschrift fur Angewandte Mathematik und Mechanik 9, 152–164 (1929)
Wang, B., Sverdlov, S., Thompson, E.: Efficient estimation of realized kinship from single nucleotide polymorphism genotypes. Genetics 205(3), 1063–1078 (2017)
Zhong, Y., Perera, M.A., Gamazon, E.R.: On using local ancestry to characterize the genetic architecture of human traits: genetic regulation of gene expression in multiethnic or admixed populations. Am. J. Hum. Genet. 104(6), 1097–1115 (2019)
Acknowledgment
The project described was supported by Cure Alzheimer’s fund, Award Number (R01MH081862, R01MH087590) from the National Institute of Mental Health and Award Number (R01HL089856, R01HL089897) from the National Heart, Lung and Blood Institute.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Details on the Implementation
The appendix provides two implementation details on the fully sparse matrix algebra used in the computations of the covariance and Jaccard matrices. Default implementations were used for the GRM matrix [17] and the s-matrix [14]. We assume \(X \in \mathbb {R}^{m \times n}\) for the matrix containing (genomic) data of length m in each of the n columns (one column per individual) throughout the section.
1.1 A.1 Covariance Matrix
We first look at computing the covariance matrix in dense algebra. Let the column means of X be given as vector \(v \in \mathbb {R}^n\) and denote as \(Y \in \mathbb {R}^{m \times n}\) the matrix consisting of the rows of X with their mean substracted. Then
In the sparse case, it is not possible to compute \(\text {cov}(X)\) as above. This is because normalizing X as above by substracting the column means results in a dense matrix which easily exceeds available memory. Thus the computation is split up suitably to always avoid the creation of dense matrices. Letting v be the column means as above, and \(w \in \mathbb {R}^n\) be the column sums,
We observe that computing \(X^\top X\) involves only the sparse input matrix and one sparse matrix multiplication (which can be done efficiently). The other three terms are vector-vector products resulting in dense \(n \times n\) matrices, the (necessary) size of the output covariance matrix.
1.2 A.2 Jaccard Similarity Matrix
Denoting the ith column of X as \(X_i\), each entry (i, j) of the Jaccard matrix is given as
For this we assume that X is binary.
Naïvely, we iterate over all the entries of the Jaccard matrix and compute them as given above using binary and as well as binary or operations. This turned out to be slow in our experiments. The following is a faster way to compute the Jaccard matrix in practice even though the asymptotic runtime is unchanged.
Recall that \(w \in \mathbb {R}^n\) denotes the column sums of X. Using sparse matrix multiplication, we compute \(Y = X^\top X \in \mathbb {R}^{n \times n}\), which is a dense matrix. Let \(Z \in \mathbb {R}^{n \times n}\) be the matrix obtained by adding w to all rows and all columns of \(-Y\). Observe that \(\text {jac}(X) = Y/Z\), where the matrix division is performed entry-wise. Since we only need one sparse matrix multiplication to compute Y (which can be done efficiently), this approach is computationally very fast. The few other operations on the matrices Y and Z are efficient since both matrices are already of same size as the dense Jaccard output matrix.
B Theoretical Runtimes
Table 1 shows the theoretical runtimes for both dense and sparse implementations. As can be seen, the theoretical runtimes for the dense computations are equal, but runtimes for sparse implementations differ. A detailed overview of the computations being carried out and their runtimes is given below. All efforts are given in the dimensions \(m \in \mathbb {N}\) and \(n \in \mathbb {N}\) of the input data \(X \in \mathbb {R}^{m \times n}\), and the matrix sparsity parameter \(s \in [0,1]\) (the proportion of non-zero matrix entries).
Computing the covariance matrix involves calculating the column means of X and substracting them from the matrix X (O(mn) for the dense case and \(O(n^2)\) for the sparse case involving a correction with outer products, see Sect. A.1). Multiplying \(Y^\top Y\) takes \(O(mn^2)\) in dense and \(O(smn^2)\) in sparse algebra.
Computing the Jaccard matrix involves calculating \(Y=X^\top X\) (\(O(mn^2)\) in dense and \(O(smn^2)\) in sparse algebra) and adding the column sums of X (computed in O(mn) in dense and O(smn) in sparse algebra) to all rows and columns (\(O(n^2)\) in both dense and sparse algebra).
Computing the weighted Jaccard matrix (or s-matrix) involves calculating the row sums of the input matrix which are used as weights (O(mn) in dense and O(smn) in sparse algebra), componentwise multiplication of all columns with the weights (O(mn) in dense and O(smn) in sparse algebra), and one matrix-matrix multiplication (\(O(mn^2)\) in dense and \(O(smn^2)\) in sparse algebra).
Computing the GRM matrix involves the calculation of population frequencies across rows (O(mn) in dense and O(smn) in sparse algebra), one matrix-matrix multiplication (\(O(mn^2)\) in dense and \(O(smn^2)\) in sparse algebra), as well as multiplying the input matrix with the population frequencies (O(mn) in dense and O(smn) in sparse algebra). Additionally, one outer vector product is required (\(O(n^2)\) in both dense and sparse algebra).
C Comparison of locStra to PLINK2
Table 2 shows a runtime comparison between locStra and PLINK2. As test data we use chromosome 1 of the 1,000 Genome Project. Before running either locStra or PLINK2, we prepare the raw data from the 1,000 Genome Project using the same parameters as given in Sect. 3. However, locStra and PLINK2 require different input files, and thus we write out the processed data once in the .bed format for PLINK2, and once as .Rdata file containing a sparse matrix of class Matrix in R.
A local stratification scan can be performed in PLINK2: With the command --pca 1, the first eigenvector can be computed for an input .bed file. In order to do a sliding window scan, we use the parameters --from and --to followed by the rs numbers to specify a local window. All eigenvectors are written to an output file with extension .eigenvec by PLINK2, from which we read the vectors and compute correlations in R.
In the locStra package, the local stratification scan is performed using the function fullscan as described in Sect. 2.2.
The results in Table 1 show that even for the computation of the single global eigenvector, locStra is considerably faster than PLINK2. All runtimes for both locStra and PLINK2 include the time to read the .Rdata or .bed input files. For a full scan, PLINK2 needs to (inefficiently) write the eigenvector data for each local window into a file. In comparison to PLINK2, locStra is around one order of magnitude faster, where the speed-up is more pronounced for larger window sizes.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Hahn, G., Lutz, S.M., Hecker, J., Prokopenko, D., Lange, C. (2020). Local and Global Stratification Analysis in Whole Genome Sequencing (WGS) Studies Using LocStra. In: Măndoiu, I., Murali, T., Narasimhan, G., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds) Computational Advances in Bio and Medical Sciences. ICCABS 2019. Lecture Notes in Computer Science(), vol 12029. Springer, Cham. https://doi.org/10.1007/978-3-030-46165-2_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-46165-2_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46164-5
Online ISBN: 978-3-030-46165-2
eBook Packages: Computer ScienceComputer Science (R0)