Skip to main content

Simultaneous Phasing of Multiple Polyploids

  • Conference paper
  • First Online:
Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB 2018)

Abstract

We address the problem of phasing polyploids specifically with polyploidy larger than two. We consider the scenario where the input is the genotype of samples along a genic chromosomal segment. In this setting, instead of NGS reads of the segments of a sample, genotype data from multiple individuals is available for simultaneous phasing. For this mathematically interesting problem, with application in plant genomics, we design and test two algorithms under a parsimony model. The first is a linear time greedy algorithm and the second is a more carefully crafted algebraic algorithm. We show that both the methods work reasonably well (with accuracy on an average larger than 80%). The former is very time-efficient and the latter improves the accuracy further.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aguiar, D., Istrail, S.: Haplotype assembly in polyploid genomes and identical by descent shared tracts. Bioinformatics 29(13), i352–i360 (2013). https://doi.org/10.1093/bioinformatics/btt213. http://dx.doi.org/10.1093/bioinformatics/btt213

    Article  Google Scholar 

  2. Browning, S., Browning, B.: Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703 (2011)

    Article  Google Scholar 

  3. Chaisson, M.J., Mukherjee, S., Kannan, S., Eichler, E.E.: Resolving multicopy duplications de novo using polyploid phasing. In: Sahinalp, S.C. (ed.) RECOMB 2017. LNCS, vol. 10229, pp. 117–133. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56970-3_8

    Chapter  Google Scholar 

  4. Halldórsson, B.V., Bafna, V., Edwards, N., Lippert, R., Yooseph, S., Istrail, S.: A survey of computational methods for determining haplotypes. In: Istrail, S., Waterman, M., Clark, A. (eds.) RSNPsH 2002. LNCS, vol. 2983, pp. 26–47. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24719-7_3

    Chapter  MATH  Google Scholar 

  5. He, D., Saha, S., Finkers, R., Parida, L.: Efficient algorithms for polyploid haplotype phasing. BMC Genom. 19(2), 110 (2018)

    Article  Google Scholar 

  6. Motazedi, E., Finkers, R., Maliepaard, C., de Ridder, D.: Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study. Brief. Bioinform. 19(3), 387–403 (2018)

    Google Scholar 

  7. Siragusa, E., Haiminen, N., Utro, F., Parida, L.: Linear time algorithms to construct populations fitting multiple constraint distributions at genomic scales. IEEE/ACM Trans. Comput. Biol. Bioinform. 16, 1132–1142 (2018)

    Article  Google Scholar 

  8. Siragusa, E., Haiminen, N., Finkers, R., Visser, R., Parida, L.: Haplotype assembly of autotetraploid potato using integer linear programing. Bioinformatics (2019). https://doi.org/10.1093/bioinformatics/btz060

  9. Utro, F., et al.: iXora: exact haplotype inferencing and trait association. BMC Genet. 14(1), 48 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laxmi Parida .

Editor information

Editors and Affiliations

6 Supplement

6 Supplement

1.1 6.1 On the Greedy Method

The Greedy Algorithm: Concrete Examples. Refer to Fig. 2 for the example input matrix G and the trie \(\mathcal T\) that is constructed. The nodes in the Trie are labeled by the column identifier on top and then following vertically down from that column. For example, \(v_{5,3}\) refers to the third node from top along the column marked 5. It is labeled by SNP 0 (i.e., hollow circle) and the cardinality of the label set is 4.

Fig. 2.
figure 2

An input genotype matrix G (70% homozygous) with \(n=10\) samples and \(m=8\) SNPs and \(p=4\). The trie \(\mathcal T\) constructed by the PiXora algorithm. The root is the leftmost rectangular node. Each hollow circle has a SNP label of 0 and a solid circle a SNP label of 1. The columns are considered (left to right) in the order 8, 7, .., 2, 1. To avoid clutter we only give the cardinality of the list label of each node, which is also reflected in the thickness the incident in-coming edge on the vertex. The solution has 11 distinct haplotypes corresponding to each of the 11 leafnodes. The gold solution (simulated data) has 11 haplotypes with 5 unique occurrences; 2 twice, 2 thrice, 1 four times and 1 occurring twenty-one times in G. The PiXora solution is shown above.

figure b
figure c

Greedy Min Set cover algorithm:

  1. 1.

    Associate a Weight to the columns by the number of rows they cover (each multiplied by the number in that row).

  2. 2.

    Sort the columns by this weight.

  3. 3.

    Solution: Traverse down the sorted list in descending order till all rows are covered.

A node is untouchable if and only if all its ancestral nodes are produced by homozygosity. Every other node is touchable.

Priority to fix the matrix, after the 1’s have been assigned:

  1. 1.

    Eliminate singletons.

    1. (a)

      Mark a singleton column as 1 if you need to remove the 1 and −1 if you need to remove a 0.

    2. (b)

      Then, search for pairs of −1 marked columns and 1 marked columns where the exchange can happen, i.e., there exists at least one row in the matrix such that both are not marked X for these 2 columns. Make the exchange.

    3. (c)

      If columns are still marked; then check if they can be moved without generating new singletons.

    4. (d)

      If all fails then leave the singletons as they are.

  2. 2.

    Reduce gaps between siblings.

    1. (a)

      Mark a column as 1 if you need to remove the 1 to get a balance. Mark it as −1 if you need to move the 0 to balance it. A balanced column is marked 0.

    2. (b)

      Then, search for pairs of −1 marked columns and 1 marked columns where the exchange can happen, i.e., there exists at least one row in the matrix such that both are not marked X for these 2 columns. Make the exchange.

The Heuristics proposed are: (Heuristic I): Using set cover (explained earlier); (Heuristic II) controlling the MAF allele (1’s) along each path of the tree or haplotype; (Heuristic III (backtracking)) Collapsing two isomorphic sub-trees:

  1. 1.

    they must have the same values, i.e., 0 or 1 and

  2. 2.

    there must exist a column in the matrix where they can get aligned, i.e., collapsed.

and (Heuristic IV (Trie-shaking)) See Fig. 3.

figure d
Fig. 3.
figure 3

Trie-shake algorithm. An illustrative example. Only those branches can be exchanged if the leaf labels are from the same individual (for instance 1d with 1c and 2b with 2a). Secondly, the marked internal nodes should be at the same depth and of opposite labels, i.e. one hollow and the other solid.

figure e

1.2 6.2 On the Algebraic Method

Notation and Basic Definitions. Each element of the matrix is a genotype, say X.

Definition 1 (coded genotype X and \(x_1,x_0,x_l,x_L, X_p, \{X\}\) of X). Each genotype is equivalent to a 3-tuple (triple)

The implementation tracks the states of the variables as v or \(\bar{v}\) where \(v \in x_L\). For brevity, we skip the details. Some concrete illustrative examples of genotypes:

$$\begin{array}{cccccc} X &{} \hbox {3-tuple} &{}\{X\} &{}|X|&{}X_p&{}x_L\\ \hline \fbox {11100} &{}(3,2,0) &{}\left\{ \fbox {11100}\right\} &{}1&{}5 &{}\emptyset \\ \fbox {1110q} &{}(3,1,1) &{}\left\{ \fbox {11100}, \fbox {11110}\right\} &{}2&{}5&{}\{q\}\\ \fbox {11qr} &{} (2,0,2) &{} \left\{ \fbox {1100}, \fbox {1110}, \fbox {1111} \right\} &{}3&{}4&{}\{q,r\} \end{array} $$

Definition 2 (VOID, empty genotype). A coded genotype X is empty when \(\{X\} = \emptyset \). X is VOID when \(x_1 < 0\) or \(x_0 <0\) holds.

Definition 3 (X \(\le \) Y). Let X and Y be genotypes. \(X \le Y \Leftrightarrow x_1\le y_1, x_0 \le y_0, x_l \le y_l\).

Lemma 3

For a genotype \(X \equiv (x_1,x_0,x_l)\) the following hold:

  1. 1.

    \(|X| = x_l + 1.\)

  2. 2.

    \(X \subseteq Y \Leftrightarrow (X_p=Y_p) \hbox { AND } y_1 \le x_1 \le x_1 + x_l \le y_1 + y_l.\)

Sketch of Proof: 2. Since \(X_p=Y_p\), it is adequate to base the arguments only on the number of 1’s in X and Y. The possible number of 1’s in X is in the interval \(\left[ x_1,x_1+x_l\right] \) and similarly in Y. So if \(X\subseteq Y\), then \(\left[ x_1,x_1+x_l\right] \) is contained in \(\left[ y_1,y_1+y_l\right] \) and vice-versa, leading to the above.    \(\Box \)

Definition 4 (\(\langle X \rangle \), ploidy \(\langle X \rangle _p\)). \(\left\langle X \right\rangle \) is defined to an ordered finite list of coded genotypes \(X^1, X^2, \ldots , X^j \ldots \) with the same ploidy k. Then k is defined to be \(\left\langle X \right\rangle _p\), the ploidy of the \(\langle X \rangle \).

Algebra of Genotypes

Resolving Variables. Two randomized procedures variable-to-constant (v2c) and variable-to-variable (v2v) are defined below. Also, a composition of these two primitive operations in resVar() on two coded genotypes.

figure f

Primitive Genotype Operations. When X and Y are two given coded genotypes, Z is produced based on the operations as follows:

figure g

VOID/Empty Genotypes. When a primitive operation fails, i.e., either results in an empty genotype \(Z \equiv (0,0,\emptyset )\) or at least one of \(z_0\), \(z_1\) is negative, then we resolve some of the variables, either by assigning explicit 1 or 0 (v2c) or assigning it to other variables (v2v). Note that if \(z_L\) is empty, then there is no variable to resolve and this failure cannot be rescued (unless the model admits possible errors in the input). However, when \(z_L\) is non-empty, there is a possibility that it can be rescued and in the following operations we minimize the number of resolved variables to do so:

figure h
Fig. 4.
figure 4

The bounded plane of plausible solutions of variable resolution in the \(t_1, t_0, t_v\) space. In \(\cap _f\) operation: a = gap\(_1\) or buff\(_1\); b = gap\(_0\) or buff\(_0\); c = min\((x_l,y_l)\). In \(\setminus _f\) operation: a = buff\(_1\); b = buff\(_0\); c = \(x_l'\) - (gap\(_1\) + gap\(_0\)).

Lemma 4

1. If \(Z = X \cap _k Y\), then \(Z_p = k\).

2. If \(Z = X \setminus Y\), then \(Z_p = X_p - Y_p\).

Sketch of Proof: 1. Note that the union is over genotypes that each have a ploidy of k. Since the union operation maintains the ploidy, the result must hold.

2. If the operation is not a failure, then

$$\begin{array}{lll} Z_p &{}= &{}z_1 + z_0 + z_l \\ &{}= &{}(x_1 - y_1 - y_l') + (x_0 - y_0 - y_l') + (x_l' + y_l')\\ &{}= &{}(x_1 + x_0 + x_l') - (y_1 + y_0 + y_l')\\ &{}= &{}(x_1 + x_0 + x_l' + |x_L \cap y_L|) - (y_1 + y_0 + y_l'+ |x_L \cap y_L|)\\ &{}= &{}X_p - Y_p. \end{array}$$

   \(\Box \)

Primitive Operation Illustrative Examples

figure i

Any negative value of the tuple is flagged as VOID. The relaxed intersection \(X \cap _2 Y\) is carried out as follows.

figure j

1.2.1 Operations on Row \(\langle X\rangle \)

Definition 5 (\(\langle X\rangle \cap \langle Y\rangle ,\langle X\rangle \setminus \langle Y\rangle \)). If \(\left\langle X \right\rangle \) and \(\left\langle Y \right\rangle \) are two rows then The intersection and difference operations on \(\langle X\rangle \) and \(\langle Y\rangle \) are defined as:

$$\begin{aligned} \left\langle X \right\rangle \cap \left\langle Y \right\rangle= & {} \left\langle X\cap _k Y\right\rangle , { where }\,k = \min _j \left\{ (X^j\cap Y^j)_p \right\} , \end{aligned}$$
(5)
$$\begin{aligned} \left\langle X \right\rangle \setminus \left\langle Y \right\rangle= & {} \left\langle X\setminus Y\right\rangle . \end{aligned}$$
(6)

Executing the Row-Row Operation. Let \(S_x\) be the sample haplotypes associated with ith row say \(\left\langle X \right\rangle \) and \(S_y\) be the sample haplotypes associated with \(i'\)th row say \(\left\langle Y \right\rangle \). Note that the set S tracks multiplicities as well, i.e., multiple haplotypes of the same sample. In other words if \(S = \{a (2), b\}\), this is interpreted as two haplotypes of sample a and 1 haplotype of sample b.

The row-row operation on \(\langle X\rangle \) and \(\langle Y\rangle \) is defined as follows.

  • Case I \(X_p >1\), \(Y_p >1\): The intersection or overlap operation between the two row results in the following three new rows (that replace the ith and \(i'\)th rows):

    1. 1.

      \(\left\langle Z \right\rangle \leftarrow \left\langle X \right\rangle \cap \left\langle Y \right\rangle \) with \(S_z = S_x \cup S_y\) and \(\left\langle Z \right\rangle _p = k\), where k is defined in Eq. 5.

    2. 2.

      \(\left\langle V \right\rangle \leftarrow \left\langle X \right\rangle \setminus \left\langle Z \right\rangle \) with \(S_v = S_x\) and \(\langle V \rangle _p = \langle X \rangle _p - k\).

    3. 3.

      \(\left\langle W \right\rangle \leftarrow \left\langle Y \right\rangle \setminus \left\langle Z \right\rangle \) with \(S_w = S_y\) and \(\langle W \rangle _p = \langle Y \rangle _p - k\).

  • Case II \(X_p >1\), \(Y_p =1\): The intersection or overlap operation between the two row results in the following new row (that replace the ith row):

    1. 1.

      \(\left\langle V \right\rangle \leftarrow \left\langle X \right\rangle \setminus \left\langle Y \right\rangle \) with \(S_v = S_x\) and \(\langle V \rangle _p = \langle X \rangle _p - 1\).

    2. 2.

      \(S_y \leftarrow S_y \cup S_x\).

Row-Row Operation FAILURE. Let X and Y be two genotypes. Then \(X \cap Y\) is successful, if and only if the following hold.

  • Case I \(X_p >1\), \(Y_p >1\): None of the following result in EMPTY/VOID: (1) \(Z = X \cap Y\) (2) \(X \setminus Z\) and (3) \(Y \setminus Z\).

  • Case II \(X_p >1\), \(Y_p =1\): \(X \setminus Y\) is not VOID.

Use “\(\cap _f\)” instead of “\(\cap \)” and “\(\setminus _f\)” instead of “\(\setminus \)” for the genotype pair when there is EMPTY or VOID result.

Empirical Lemmas. Let n be the number of samples and m the number of SNPs.

Lemma 5

Accuracy of the algorithm improves with increase in n and m.

Lemma 6

For a given fraction of heterozygous alleles and m, the value of n can be estimated where the accuracy of reconstruction saturates.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Parida, L., Utro, F. (2020). Simultaneous Phasing of Multiple Polyploids. In: Raposo, M., Ribeiro, P., Sério, S., Staiano, A., Ciaramella, A. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2018. Lecture Notes in Computer Science(), vol 11925. Springer, Cham. https://doi.org/10.1007/978-3-030-34585-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-34585-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-34584-6

  • Online ISBN: 978-3-030-34585-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics