Skip to main content

Local and Global Stratification Analysis in Whole Genome Sequencing (WGS) Studies Using LocStra

  • Conference paper
  • First Online:
  • 2356 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12029))

Abstract

We are interested in the analysis of local and global population stratification in WGS studies. We present a new R package (locStra) that utilizes the covariance matrix, the genomic relationship matrix, and the unweighted/weighted genetic Jaccard similarity matrix in order to assess population substructure. The package allows one to use a tailored sliding window approach, for instance using user-defined window sizes and metrics, in order to compare local and global similarity matrices. A technique to select the window size is proposed. Population stratification with locStra is efficient due to its C++ implementation which fully exploits sparse matrix algebra. The runtime for the genome-wide computation of all local similarity matrices does typically not exceed one hour for realistic study sizes. This makes an unprecedented investigation of local stratification across the entire genome possible. We apply our package to the 1,000 Genomes Project.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Bates, D., Eddelbuettel, D.: Fast and elegant numerical linear algebra using the RcppEigen package. J. Stat. Softw. 52(5), 1–24 (2013)

    Article  Google Scholar 

  2. Chang, C.C., Chow, C.C., Tellier, L.C., Vattikuti, S., Purcell, S.M., Lee, J.J.: Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4 (2015)

    Google Scholar 

  3. Devlin, B., Roeder, K.: Genomic control for association studies. Biometrics 55(4), 997–1004 (1999)

    Article  Google Scholar 

  4. Laird, N.M., Lange, C.: The Fundamentals of Modern Statistical Genetics. SBH. Springer, New York (2011). https://doi.org/10.1007/978-1-4419-7338-2

    Book  MATH  Google Scholar 

  5. Lee, S., Epstein, M.P., Duncan, R., Lin, X.: Sparse principal component analysis for identifying ancestry-informative markers in genome-wide association studies. Genet. Epidemiol. 36(4), 293–302 (2012)

    Article  Google Scholar 

  6. Martin, E.R., et al.: Properties of global and local ancestry adjustments in genetic association tests in admixed populations. Genet. Epidemiol. 42(2), 214–229 (2018)

    Article  Google Scholar 

  7. Patterson, N., Price, A.L., Reich, D.: Population structure and Eigenanalysis. PLoS Genet. 2(12), e190 (2006)

    Article  Google Scholar 

  8. Price, A.L., et al.: Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006)

    Article  Google Scholar 

  9. Price, A.L., et al.: Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 5(6), e1000519 (2009)

    Article  Google Scholar 

  10. Pritchard, J.K., Stephens, M., Rosenberg, N.A., Donnelly, P.: Association mapping in structured populations. Am. J. Hum. Genet. 67(1), 170–181 (2000)

    Article  Google Scholar 

  11. Prokopenko, D., et al.: Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project. Bioinformatics 32(9), 1366–1372 (2016)

    Article  Google Scholar 

  12. Purcell, S., Chang, C.: PLINK2 (2019)

    Google Scholar 

  13. Schlauch, D., Fier, H., Lange, C.: Identification of genetic outliers due to sub-structure and cryptic relationships. Bioinformatics 33(13), 1972–1979 (2017)

    Article  Google Scholar 

  14. Schlauch, D.: Implementation of the stego algorithm - similarity test for estimating genetic outliers (2016)

    Google Scholar 

  15. The 1000 Genomes Project Consortium: A global reference for human genetic variation. Nature, 526, 68–74 (2015)

    Google Scholar 

  16. Mises, R.V., PollaczekGeiringer, H.: Praktische Verfahren der Gleichungsaufloesung. ZAMM Zeitschrift fur Angewandte Mathematik und Mechanik 9, 152–164 (1929)

    Article  Google Scholar 

  17. Wang, B., Sverdlov, S., Thompson, E.: Efficient estimation of realized kinship from single nucleotide polymorphism genotypes. Genetics 205(3), 1063–1078 (2017)

    Article  Google Scholar 

  18. Zhong, Y., Perera, M.A., Gamazon, E.R.: On using local ancestry to characterize the genetic architecture of human traits: genetic regulation of gene expression in multiethnic or admixed populations. Am. J. Hum. Genet. 104(6), 1097–1115 (2019)

    Article  Google Scholar 

Download references

Acknowledgment

The project described was supported by Cure Alzheimer’s fund, Award Number (R01MH081862, R01MH087590) from the National Institute of Mental Health and Award Number (R01HL089856, R01HL089897) from the National Heart, Lung and Blood Institute.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Georg Hahn .

Editor information

Editors and Affiliations

Appendices

A Details on the Implementation

The appendix provides two implementation details on the fully sparse matrix algebra used in the computations of the covariance and Jaccard matrices. Default implementations were used for the GRM matrix [17] and the s-matrix [14]. We assume \(X \in \mathbb {R}^{m \times n}\) for the matrix containing (genomic) data of length m in each of the n columns (one column per individual) throughout the section.

1.1 A.1 Covariance Matrix

We first look at computing the covariance matrix in dense algebra. Let the column means of X be given as vector \(v \in \mathbb {R}^n\) and denote as \(Y \in \mathbb {R}^{m \times n}\) the matrix consisting of the rows of X with their mean substracted. Then

$$\text {cov}(X) = \frac{1}{m-1} Y^\top Y.$$

In the sparse case, it is not possible to compute \(\text {cov}(X)\) as above. This is because normalizing X as above by substracting the column means results in a dense matrix which easily exceeds available memory. Thus the computation is split up suitably to always avoid the creation of dense matrices. Letting v be the column means as above, and \(w \in \mathbb {R}^n\) be the column sums,

$$\text {cov}(X) = \frac{1}{m-1} \left( X^\top X - w v^\top - v w^\top + m v v^\top \right) .$$

We observe that computing \(X^\top X\) involves only the sparse input matrix and one sparse matrix multiplication (which can be done efficiently). The other three terms are vector-vector products resulting in dense \(n \times n\) matrices, the (necessary) size of the output covariance matrix.

1.2 A.2 Jaccard Similarity Matrix

Denoting the ith column of X as \(X_i\), each entry (ij) of the Jaccard matrix is given as

$$\text {jac}(X)_{ij} = \frac{\left| \left\{ k: X_{ik} \wedge X_{jk} \right\} \right| }{\left| \left\{ k: X_{ik} \vee X_{jk} \right\} \right| }.$$

For this we assume that X is binary.

Naïvely, we iterate over all the entries of the Jaccard matrix and compute them as given above using binary and as well as binary or operations. This turned out to be slow in our experiments. The following is a faster way to compute the Jaccard matrix in practice even though the asymptotic runtime is unchanged.

Recall that \(w \in \mathbb {R}^n\) denotes the column sums of X. Using sparse matrix multiplication, we compute \(Y = X^\top X \in \mathbb {R}^{n \times n}\), which is a dense matrix. Let \(Z \in \mathbb {R}^{n \times n}\) be the matrix obtained by adding w to all rows and all columns of \(-Y\). Observe that \(\text {jac}(X) = Y/Z\), where the matrix division is performed entry-wise. Since we only need one sparse matrix multiplication to compute Y (which can be done efficiently), this approach is computationally very fast. The few other operations on the matrices Y and Z are efficient since both matrices are already of same size as the dense Jaccard output matrix.

Table 1. Theoretical runtimes for computing the four similarity matrices. Runtimes differ between dense and sparse implementations. The parameters are: dimensions \(m \in \mathbb {N}\) and \(n \in \mathbb {N}\) of the input data \(X \in \mathbb {R}^{m \times n}\), matrix sparsity parameter \(s \in [0,1]\).

B Theoretical Runtimes

Table 1 shows the theoretical runtimes for both dense and sparse implementations. As can be seen, the theoretical runtimes for the dense computations are equal, but runtimes for sparse implementations differ. A detailed overview of the computations being carried out and their runtimes is given below. All efforts are given in the dimensions \(m \in \mathbb {N}\) and \(n \in \mathbb {N}\) of the input data \(X \in \mathbb {R}^{m \times n}\), and the matrix sparsity parameter \(s \in [0,1]\) (the proportion of non-zero matrix entries).

Computing the covariance matrix involves calculating the column means of X and substracting them from the matrix X (O(mn) for the dense case and \(O(n^2)\) for the sparse case involving a correction with outer products, see Sect. A.1). Multiplying \(Y^\top Y\) takes \(O(mn^2)\) in dense and \(O(smn^2)\) in sparse algebra.

Computing the Jaccard matrix involves calculating \(Y=X^\top X\) (\(O(mn^2)\) in dense and \(O(smn^2)\) in sparse algebra) and adding the column sums of X (computed in O(mn) in dense and O(smn) in sparse algebra) to all rows and columns (\(O(n^2)\) in both dense and sparse algebra).

Computing the weighted Jaccard matrix (or s-matrix) involves calculating the row sums of the input matrix which are used as weights (O(mn) in dense and O(smn) in sparse algebra), componentwise multiplication of all columns with the weights (O(mn) in dense and O(smn) in sparse algebra), and one matrix-matrix multiplication (\(O(mn^2)\) in dense and \(O(smn^2)\) in sparse algebra).

Computing the GRM matrix involves the calculation of population frequencies across rows (O(mn) in dense and O(smn) in sparse algebra), one matrix-matrix multiplication (\(O(mn^2)\) in dense and \(O(smn^2)\) in sparse algebra), as well as multiplying the input matrix with the population frequencies (O(mn) in dense and O(smn) in sparse algebra). Additionally, one outer vector product is required (\(O(n^2)\) in both dense and sparse algebra).

Table 2. Computation of the global eigenvector (global EV) and complete stratification scan of chromosome 1 of the 1,000 Genome Project as a function of the window size. Runtimes in seconds for locStra and PLINK2.

C Comparison of locStra to PLINK2

Table 2 shows a runtime comparison between locStra and PLINK2. As test data we use chromosome 1 of the 1,000 Genome Project. Before running either locStra or PLINK2, we prepare the raw data from the 1,000 Genome Project using the same parameters as given in Sect. 3. However, locStra and PLINK2 require different input files, and thus we write out the processed data once in the .bed format for PLINK2, and once as .Rdata file containing a sparse matrix of class Matrix in R.

A local stratification scan can be performed in PLINK2: With the command --pca 1, the first eigenvector can be computed for an input .bed file. In order to do a sliding window scan, we use the parameters --from and --to followed by the rs numbers to specify a local window. All eigenvectors are written to an output file with extension .eigenvec by PLINK2, from which we read the vectors and compute correlations in R.

In the locStra package, the local stratification scan is performed using the function fullscan as described in Sect. 2.2.

The results in Table 1 show that even for the computation of the single global eigenvector, locStra is considerably faster than PLINK2. All runtimes for both locStra and PLINK2 include the time to read the .Rdata or .bed input files. For a full scan, PLINK2 needs to (inefficiently) write the eigenvector data for each local window into a file. In comparison to PLINK2, locStra is around one order of magnitude faster, where the speed-up is more pronounced for larger window sizes.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hahn, G., Lutz, S.M., Hecker, J., Prokopenko, D., Lange, C. (2020). Local and Global Stratification Analysis in Whole Genome Sequencing (WGS) Studies Using LocStra. In: Măndoiu, I., Murali, T., Narasimhan, G., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds) Computational Advances in Bio and Medical Sciences. ICCABS 2019. Lecture Notes in Computer Science(), vol 12029. Springer, Cham. https://doi.org/10.1007/978-3-030-46165-2_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-46165-2_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-46164-5

  • Online ISBN: 978-3-030-46165-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics