# hoDCA: higher order direct-coupling analysis

## Abstract

### Background

Direct-coupling analysis (DCA) is a method for protein contact prediction from sequence information alone. Its underlying principle is parameter estimation for a Hamiltonian interaction function stemming from a maximum entropy model with one- and two-point interactions. Vastly growing sequence databases enable the construction of large multiple sequence alignments (MSA). Thus, enough data exists to include higher order terms, such as three-body correlations.

### Results

We present an implementation of hoDCA, which is an extension of DCA by including three-body interactions into the inverse Ising problem posed by parameter estimation. In a previous study, these three-body-interactions improved contact prediction accuracy for the PSICOV benchmark dataset. Our implementation can be executed in parallel, which results in fast runtimes and makes it suitable for large-scale application.

### Conclusion

Our hoDCA software allows improved contact prediction using the Julia language, leveraging power of multi-core machines in an automated fashion.

## Keywords

Contact prediction Proteins DCA## Abbreviations

- APC
Average product correction

- DCA
Direct-coupling analysis

- MSA
Multiple sequence alignment

## Background

Thanks to rapidly growing sequence databases, the prediction of protein contacts from sequence information has become an promising route for computational structural biophysics [1, 2, 3, 4]. The so called direct-coupling analysis (DCA) uses a multiple sequence alignment (MSA) to predict residue contacts in a maximum entropy approach. Its high accuracy was shown in various studies [5, 6, 7, 8, 9, 10, 11] and also made it suitable for protein structure prediction software [12, 13, 14].

*N*being the length of the sequences. \(Z={\sum \nolimits }_{\vec {\sigma }\in \mathcal {A}^{N}} P(\vec {\sigma })\) is the partition function as the sum over all sequences where each position is chosen from the alphabet \(\mathcal {A}\). After estimation of parameters {

*h*

_{i},

*J*

_{ij}} from empirical sequences \(\vec {\sigma }^{(b)}\), a contact prediction score for residue

*i*and

*j*can be obtained by taking the

*l*

_{2}-norm ∥

*J*

_{ij}∥

_{2}. In a recent study [15], an improved prediction accuracy was shown by incorporating three-body interactions

*V*

_{ijk}(

*σ*

_{i},

*σ*

_{j},

*σ*

_{k}) into

*H*, obtaining a three-body Hamiltonian

Here, we present an implementation of this method, which we call hoDCA.

## Implementation

hoDCA is implemented in the julia language (0.6.2) [16], and depends directly on a) the ArgParse [17] module for command-line arguments and b) on the GaussDCA [18] module for performing preprocessing operations on the MSA and the implicit dependencies for those packages. A typical command-line call is julia hoDCA.jl Example.fasta Example.csv –No_A_Map=1 –Path_Map=A_Map.csv –MaxGap=0.9 –theta=0.2 –Pseudocount=4.0 –No_Threads=2 –Ign_Last=0 with input Example.fasta and output Example.csv. The latter consists of lists of all two-body contact scores *J*_{ij} separated by at least one residue along the backbone. The meaning of the remaining (optional) parameters will become clear in the following.

*General notes.*For inference of parameters {

*h*

_{i},

*J*

_{ij},

*V*

_{ijk}}, we use the mean-field approximation as described in [15] with a reduced alphabet for three-body couplings. This is accomplished by a mapping

with *q* being the full alphabet of the MSA and *q*_{red}≤*q*. On the one hand, this accounts for the so called curse of dimensionality [19], occuring if the size of the MSA is too small to observe all possible *q*^{3} combinations for each *V*_{ijk}. On the other hand, this significantly reduces memory usage and allows for a faster computation of contact prediction scores. The mapping *μ* can be specified by Path_Map, which is a csv file with every row representing a mapping. No_A_Map tells which row to choose. As the bottleneck is still the calculation of three-body couplings, it can be performed using parallel threads by specifying the No_Threads flag.

In traditional DCA, the last amino acid *q* usually represents the gap character and is not taken into account for score computation within the *l*_{2}-norm. In hoDCA, each two-body coupling state *l*≤*q* contains contributions from {*n*≤*q*|*μ*(*n*)=*μ*(*l*)} due to the reduced alphabet. We therefore take gap contributions into account by default, which can be changed by the Ign_Last flag.

*MSA preprocessing.* The MSA is read in by the GaussDCA module, ignoring sequences with a higher amount of gaps than MaxGap, and subsequently converted into an array of integers. However, in contrast to GaussDCA, we check for the actual number of amino acids types contained in the MSA given. We, then, reduce the alphabet from *q*=21 to the number of present characters (amino acid types). Afterwards, the reweighting for every sequence \(\vec {\sigma }^{(b)}\) is obtained by the GaussDCA module via \(w_{b}=1/|\{a \in \{1,...,B \}: \text {difference}\left (\vec {\sigma }^{(a)},\vec {\sigma }^{(b)}\right) \leq \texttt {theta} \}|\), where the difference is computed by the percentage hamming distance [6]. The aim of reweighting is to reduce potential phylogenetic bias.

*Frequency computation.*Empirical frequency counts for the full alphabet are computed according to [6]

with *δ* being the Kronecker delta, *B* the number of sequences in the MSA, \(B_{eff}={\sum \nolimits }_{b=1}^{B}w_{b}\) and *λ*_{c}=Pseudocount·*B*_{eff}. The Pseudocount parameter shifts empirical data towards a uniform distribution. This is necessary to ensure invertibility of the empirical covariance matrix in the mean-field approach.

The computation of three-point frequencies takes some time and will be executed on No_Threads threads. For this, we parallelized their calculation over the sequence size *N*, meaning that the *i*-th process computes \(f_{ijk}^{\text {red}}\) for all *k*≥*j*≥*i* and fixed *i*. Besides the parallelization scheme, three-point frequencies are preprocessed in the same manner as one- and two-point frequencies.

*Contact prediction scores.*Contact prediction scores follow directly from two-body couplings. Two-body couplings are obtained within the mean-field approximation by

*g*

_{ij}(

*l*,

*m*) is the inverse of the empirical two-point covariance matrix

*e*

_{ij}(

*l*,

*m*)=

*f*

_{ij}(

*l*,

*m*)−

*f*

_{i}(

*l*)

*f*

_{j}(

*m*). \(g_{ijk}^{\text {red}}(\alpha,\beta,\gamma)\) is given by a relation to the three-point covariance matrix over the reduced alphabet

*g*

_{ij}(

*α*,

*β*) is the inverse of the two-point covariance matrix over the reduced alphabet (see [15] for more details). For the calculation of scores, {

*J*

_{ij}} are transformed into so called zero-sum gauge, satisfying \({\sum \nolimits }_{l}^{q} \hat {J}_{ij}(l,.)=\sum _{m}^{q} \hat {J}_{ij}(.,m)=0\), where "." stands for an arbitrary state via

*l*

_{2}norm [21] via

and \(\left \| \hat {\textbf {J}}_{ij} \right \|_{2} = \sqrt {{\sum \nolimits }_{l,m=1}^{q} \hat {J}_{ij}(l,m)^{2}}\).

## Discussion

*C*is the total amount of contacts and

*p*

_{i}is the number of true positives of the first

*i*predictions. Figure 1 shows the predicted contact map of the protein data bank entry 1fx2A as an exemplary case. For this particular protein, the classical two-body DCA has an

*A*-value of

*A*≈0.5 while hoDCA shows a superior

*A*≈0.72.

## Results

*N*=242,

*B*=18,170) and parameters as in Eq. (2). The overall speedup is about five-fold when executed on

*n*≥12 threads in comparison to a single CPU core. A fit of Amdahl’s law

*T*=

*T*

_{0}·(1−

*p*·(1−1/

*n*)), with

*T*

_{0}being the single-threaded runtime and

*n*=No_Threads, reveals the proportion of parallelized routines as

*p*≈0.86. The serial runtime proportion of ≈0.14 comes mainly due to computation of two-body terms. Also note that we did not modify the standard julia parameters, meaning, e.g., a parallel computation of the matrix inverse by default.

## Conclusions

Higher-order interactions have been shown to have a strong influence on contact prediction in certain proteins [15, 22, 23]. Here, we implemented hoDCA, an extension of DCA by incorporating three-body couplings into the Hamiltonian. The accessible command-line user interface and the significant speedup within parallel execution make hoDCA suitable for contact prediction in a variety of proteins, using biochemical inspired alphabet reduction schemes. We hope to have made this method easily accessible for other researchers by this software release.

## Availability and requirements

**Project name:** hoDCA **Project home page:**http://www.cbs.tu-darmstadt.de/hoDCA/**Operating systems:** Linux, Windows, macOS **Programming language:** julia (0.6.2) **Other requirements:** julia packages Argparse, GaussDCA **License:** GNU General Public License v3, http://www.gnu.org/licenses/gpl-3.0.html**Any restrictions to use by non-academics:** Any commercial use is subject to a contractual agreement between involved parties.

## Notes

### Acknowledgements

N/A

### Funding

The authors thank the LOEWE project iNAPO funded by the Ministry of Higher Education, Research and the Arts (HMWK) of the Hessen state. The authors acknowledge financial support by the Deutsche Forschungsgemeinschaft via the Graduate School (GRK 1657) “Radiation Biology”, project 1A. The funding bodies had no role in the design of the study or the collection/analysis/interpretation of data or the writing of the manuscript.

### Availability of data and materials

The source code of our package and its documentation are available from the url http://www.cbs.tu-darmstadt.de/hoDCA.

### Authors’ contributions

KH conceived the study, MS wrote the software and documentation, analyzed data and prepared packaging; both authors wrote the paper. Both authors read and approved the final manuscript.

### Ethics approval and consent to participate

Not applicable.

### Consent for publication

Not applicable.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

