1 Introduction

One of the underlying assumption of the goal of achieving “predictive science” is the availability of suitable metric to quantify the accuracy of the predictions. Although mean square error-type quantities are often well-suited for this purpose when the predicted quantities are simple functions or vectors, quantifying errors in a more general graph, such as phase diagram, in a single figure-of-merit represents a considerably more complex problem. Given this, it is perhaps not surprising that the area of research focusing on the construction of thermodynamic models [often referred to as CALculation of PHAse Diagram (CALPHAD)][5,10,11,12,15] is currently lacking a theoretically justified and widely adopted figure-of-merit to quantify the discrepancies between two possible phase diagrams obtained via different routes. This paper intends to fill this gap by building upon earlier proposals.[17]

There are multiple challenges associated with this task, as illustrated in Fig. 1. Phase boundaries cannot be considered single-valued functions: For instance, in a binary phase diagram, a phase boundary might cross a given vertical line (see label 1 on the figure) multiple times, thus precluding the use of simple “mean-square error along one axis” criteria. The same can occur for horizontal lines (see label 2 on the figure). More generally, no matter which “axes” one uses, the phase boundaries will generally be defined on different domains (label 3). The use of perpendicular distances between two curves is also ambiguous: To which of the two curves should the distance be perpendicular to (label 4)? What does “perpendicular” mean when the axes have different units so that their relative scaling is arbitrary (label 5)? For the same reason, the use of the Hausdorff metric[2] is also affected by the fact that different axes may have different units. Boundary distance-based metrics are also unable to handle the fact that, sometimes, phases are entirely absent from one phase diagram but present in another (label 6). How should a figure-of-merit quantify this situation?

Fig. 1
figure 1

Problems associated with metrics based on distance between phase boundaries. Solid and dashed curves distinguish two different phase diagrams while thicker curves mark potential problems: multiple crossing of a vertical (1) or horizontal (2) axis, different domains of definition (3) of the functions defining the boundaries, ambiguities is defining the orthogonal distance between 2 curves (4), ambiguity of the notion of orthogonality when axes have arbitrary relative scales due to differing units (5) and potential absence of a phase (6)

Given these issues, we instead propose to quantify the differences between two phase diagrams via differences in the predicted phase fractions of the corresponding phases. Phase fractions have been put forward as powerful and fundamental descriptors of phase equilibria.[16] Phase fractions are scalar, dimensionless, everywhere defined and merely take the value 0 when a phase is not stable. These desirable properties solve all of the aforementioned problems. We prove that a criterion based on phase fractions satisfies all the properties of the mathematical notion of a norm or of a metric, in addition to other properties directly relevant to phase stability problems. We illustrate the use of such a criterion to the study of the convergence of assessments performed on the same alloy system by different authors over time.

2 Definition and Motivation

Let \(f\left( c\right) =\left( f_{1}\left( c\right) ,\ldots ,f_{p}\left( c\right) \right) ^{T}\) denote a p-dimensional vector of the phase fractions of all possible phases in the system under conditions \(c=\left( x,T\right) \), where T is temperature and x is a vector of overall compositions (omitting one composition, to avoid redundancy). One could also include pressure into the vector c, if desired. The knowledge of this vector-valued function over some region R fully defines the phase diagram over that region.

We propose to quantify the difference between two phase diagrams \( f^{1},f^{2} \) in a region R of interest via the following figure-of-merit:

$$\begin{aligned} \left\| f^{1}-f^{2}\right\| _{R}\equiv \frac{\int _{c\in R}\sum _{\alpha =1}^{p}\left| f_{\alpha }^{1}\left( c\right) -f_{\alpha }^{2}\left( c\right) \right| dc}{\int _{c\in R}dc}, \end{aligned}$$
(1)

where the integrals are multivariate, i.e. dc is a (hyper)volume element. This definition exhibits a number of desirable properties. It has the natural interpretation of the expected total absolute error in phase fraction for a condition chosen at random in region R.

Since this definition is mathematically equivalent to a so-called weighted \( L_{1}\) norm[18] defined on a vector-valued field, it automatically inherits all the natural properties of norm: It is zero if only if the two phase diagrams \(f^{1}\) and \(f^{2}\) agree,Footnote 1 it is symmetric (\( \left\| f^{1}-f^{2}\right\| _{R}=\left\| f^{2}-f^{1}\right\| _{R} \)) and it obeys the triangular inequality \(\left\| f^{1}-f^{2}\right\| _{R}\le \left\| f^{1}-f^{3}\right\| _{R}+\left\| f^{3}-f^{2}\right\| _{R}\), for 3 phase diagrams \(f^{1},f^{2},f^{3}\). (A norm also satisfies \(\left\| af\right\| =\left| a\right| \left\| f\right\| \) but this property is not useful in this context, since the phase fractions must sum up to one.) Since a norm is a special case of a metric, our proposal also defines a proper metric.

Definition (1) provides a dimensionless quantity, which facilitates its interpretation. Another desirable property is that it naturally handles the case when one phase is simply missing in one phase diagram. This possibility is not uncommon when comparing experimental and ab initio phase diagrams. Also, in novel systems that are not yet well characterized, there may not be perfect knowledge of which phases are stable or metastable and it is useful to be able to quantify this type of discrepancy. It is not clear how missing phases could be handled with a figure-of-merit based on distances between phase boundaries. This definition also applies a less severe penalty in situations in which only one phase is in disagreement while the others agree. This makes sense, since this situation typically arises when two phases have very similar free energies and can easily be mispredicted without affecting the reliability of the phase diagram elsewhere.

Our approach also makes it simple to account for the fact that Gibbs triangles (or, more generally, Gibbs simplexes) are the most natural way to represent multiple composition axes. The fact that the axes are not orthogonal can be ignored in definition (1) because the same Jacobian terms appear in both the numerator and denominator. One can thus simply integrate over all but one composition using orthogonal axes, whether these axes are truly orthogonal or not in the phase diagram’s representation.

The definition (1) is also computationally attractive since it can easily be calculated via Monte Carlo sampling. Let \(c_{i}\), \(i=1,\ldots ,n\) denote independent random draws from a uniform distribution over the region R, then, by the Law of large numbers

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}\sum _{\alpha =1}^{p}\left| f_{\alpha }^{1}\left( c_{i}\right) -f_{\alpha }^{2}\left( c_{i}\right) \right| \end{aligned}$$
(2)

converges in probability to \(\left\| f^{1}-f^{2}\right\| _{R}\) as \( n\longrightarrow \infty \). In multicomponent systems, some care must be taken to ensure that the compositions are indeed drawn uniformly. For instance, in a ternary system, picking 3 composition uniformly on \(\left[ 0,1 \right] \) and scaling them so their sum is one does not generate a uniform distribution. However, picking the 3 composition from an exponential distribution and then normalizing them to sum to 1, does generate a uniform distribution (see Ref 19). This Monte Carlo algorithm scales very favorably with the number of components in the system, unlike a standard grid integration.

The definition does exhibit some limitations. Most importantly, it is dependent on the choice of the region of interest R. This could be mitigated by agreeing on standardized regions. For instance, one can use a region including the entire composition range and temperatures from absolute zero (or room temperature) to the highest phase transformation temperature of all systems considered.

As an alternative, one could also use

$$\begin{aligned} \left\| f^{1}-f^{2}\right\| _{\rho ,R}\equiv \left( \frac{\int _{c\in R}\sum _{\alpha =1}^{p}\left| f_{\alpha }^{1}\left( c\right) -f_{\alpha }^{2}\left( c\right) \right| ^{\rho }dc}{\int _{c\in R}dc}\right) ^{1/\rho } \end{aligned}$$
(3)

for some \(\rho \ge 1\) to obtain analogues of any of the familiar \(L_{\rho }\) norms.[18] For single phase regions, this substitution has no effect since \(0^{\rho }=0\) and \(1^{\rho }=1\). For multiphase equilibria, the choice of \(\rho \) does matter, but we suggest keeping the \( \rho =1\) choice due to its ease of interpretation.

Another possible alternative to Definition (1), which is related to a previously proposed figure-of-merit,[17] is to only compare the presence or absence of a phase, independently of its phase fraction and define the metric:

$$\begin{aligned} d_{R}\left( f^{1},f^{2}\right) \equiv \frac{\int _{c\in R}\sum _{\alpha =1}^{p}\left| {\mathbf {1}}_{+}\left( f_{\alpha }^{1}\left( c\right) \right) -{\mathbf {1}}_{+}\left( f_{\alpha }^{2}\left( c\right) \right) \right| dc}{\int _{c\in R}dc}. \end{aligned}$$
(4)

where the function \({\mathbf {1}}_{+}\left( f\right) \) is equal to 1 if \(f>0\) and 0 otherwise. This definition has the interpretation of the expected number of mismatched phases for a point c picked at random in R. It also can be roughly interpreted as the fraction of the (hyper)volume of the phase diagram where disagreements between the nature of the stable phases exist, weighted by the number of mismatches [One could eliminate this weighting by replacing \(\sum _{\alpha =1}^{p}\) by \(\max _{\alpha \in \left\{ 1,\ldots ,p\right\} }\) in (4)]. The notation \(d_{R}\left( f^{1},f^{2}\right) \) (instead of \(\left\| f^{1}-f^{2}\right\| _{R}\)) is used because Eq 4 does not define a norm when viewed as a function of the \(f_{\alpha }^{i}\left( c\right) \), since it cannot be written as a function of \(f_{\alpha }^{1}\left( c\right) -f_{\alpha }^{2}\left( c\right) \). However, it is a norm when viewed as a function of the \({\mathbf {1}}_{+}\left( f_{\alpha }^{i}\left( c\right) \right) \). This definition therefore still defines a proper metric \(d_{R}\left( f^{1},f^{2}\right) \), since the knowledge of \({\mathbf {1}}_{+}\left( f_{\alpha }\left( c\right) \right) \) also fully characterizes the phase diagram. Definition (4) may be easier to implement if one only has access to the picture of the phase diagram (instead of its underlying thermodynamic model), because it avoids implementing the level rule to recover the phase fractions (which becomes difficult beyond binary systems). It is also in the same spirit as recent efforts to cast inverse problems in thermodynamics modeling as constraint satisfaction problems.[8]

The presence of a miscibility gap leads to a subtle complication in Definition (1) [or (4)]. In this case, multiple phases exhibiting the same crystal structure but at different compositions could be in a multiphase equilibrium. We handle this by considering each phase (even with the same crystal structure) as distinct but when comparing the resulting phase fractions across two phase diagrams, we always re-order the phase fractions (among phases sharing the same crystal structure) so as to minimize (1). If the number of phases with the same crystal structure is different in the two phase diagrams, we then add the appropriate number of phases with a zero phase fraction. In the limit where the differences between the two phases diagrams are small, this simple rule yields differences in phase fraction between corresponding phases.

The proposed metrics have been implemented as the phasenorm command in the ATAT package[23,24,25] and this implementation relies on OpenCalphad[21,22] to compute phase equilibria. It has the following syntax:

  • phasenorm -tdb1= tdbfile1 -tdb2= tdbfile2 -e= element1,element2,... -n= nb of samples -T0= min temperature -T1= max temperature [-01]

where

  • tdbfile1 and tdbfile2 are the two thermodynamic database files (in the TDB format[1]) of the assessments to be compared;

  • element1,element2,... is a comma-separated list of the elements involved in the phase diagram of interest (which allows the user to extract a subsystem from the TDB files);

  • nb of samples specifies the number of Monte Carlo sampling steps performed;

  • min temperature and max temperature define the temperature range of the region R of interest (the full composition range is assumed);

  • the optional -01 switch instructs the use of Eq 4 instead of (1).

When using this tool, it is important to ensure that the two thermodynamic database files use the same naming conventions for the phases.

3 Application Example

The figure-of-merit proposed here enables instructive quantitative studies of the accuracy of phase diagrams. One natural question, for instance, is whether the assessments of an alloy system are actually converging, that is, becoming more accurate over time as more data because available and more researchers study the same system.

Since one never actually knows for sure what is the “true” phase diagram, it may not be obvious if any convergence of our knowledge of given system is really taking place. One needs to study convergence through an internal consistency criterion that does not require that the true phase diagram be perfectly known. Given a sequence \(s_{1},s_{2},\ldots \), how can one know if it converges without first calculating the limit \(\lim _{n\longrightarrow \infty }s_{n}\)? The way out of this circular reasoning is to check if the sequence \(\left\{ s_{n}\right\} \) forms a Cauchy sequence, i.e., whether it has the property that

$$\begin{aligned} \lim _{n\longrightarrow \infty }\max _{i,j\ge n}\left\| s_{i}-s_{j}\right\| =0. \end{aligned}$$

It can be shown that any such Cauchy sequence necessarily converges[18] (under a technical condition known as completeness of the space in which the \(s_{n}\) live, which is typically satisfied for vector spaces commonly used to represent scientific data).

In our context, this amounts to checking if the distance between all phase diagrams \(\left\{ f^{i}\right\} \) published on a given system after some time t decreases significantly as time t progresses:

$$\begin{aligned} D_{t}=\max _{i,j\,\text {such}\,\text {that }\,t_{i}\ge t\,\text { and }\,t_{j}\ge t}\left\| f^{i}-f^{j}\right\| _{R}, \end{aligned}$$
(5)

where \(t_{i}\) is the publication time of phase diagram i. Of course, by construction, the quantity \(D_{t}\) necessarily decreases with increasing t, but the rate at which this happens is instructive and the absolute magnitude of \(D_{t}\) is indicative of the expected remaining errors at time t.

Fig. 2
figure 2

Convergence of assessments of the Fe-Ti system. Each cross represents the distances between a phase diagram published at time t and a future (or concurrent) phase diagram. The line represents \(D_t\), the largest distance between pairs of phase diagrams published at or after time t

Fig. 3
figure 3

Convergence of assessments of the Al-Cu system. Each cross represents the distances between a phase diagram published at time t and a future (or concurrent) phase diagram. The line represents \(D_t\), the largest distance between pairs of phase diagrams published at or after time t

Using the recently developed Thermodynamic DataBase DataBase (TDBDB),[26] one can easily identify popular alloy systems that have been repeatedly assessed, so that the Cauchy property can be tested. We have selected the many assessments of the Fe-Ti system[3,4,7,9,13,27] and of the Al-Cu system[6,14,20,22,28] that are available in the TDBDB and that can be parsed by OpenCalphad. We have used the Monte Carlo algorithm (Eq 2) with 2000 draws and a region bounded by 300 and 2000 K that covers the entire composition range.

Fig. 4
figure 4

Identification of “clusters” of assessments of the Fe-Ti system. The graph reports all distances between a given assessment and all other assessments. Assessments numbered 1 through 6 correspond to Ref 3, 47913 and 27, respectively. Assessments 4,5,6 are found to lie close to each other, within the 1% similarity threshold indicated a dashed line. A 2% threshold (dotted line) would also include assessment 3 in the cluster

Fig. 5
figure 5

Identification of “clusters” of assessments of the Al-Cu system. The graph reports all distances between a given assessment and all other assessments. Assessments numbered 1 through 6 correspond to Ref 6, 142028 and 22, respectively. Assessments 1,4,5 are found to lie close to each other, within the 1% similarity threshold indicated a dashed line

In Fig. 2, it can be seen that the distances between assessments do clearly decrease sharply over time, indicating that a consensus regarding the Fe-Ti phase diagram is steadily emerging. In contrast, in the Al-Cu system (shown in Fig. 3), it appears that disagreements have persisted for many years, although the two latest assessments reported in 2016[20,22] do seem to show good mutual agreement. This analysis implicitly assumes that even if a recent study re-uses older assessments, its authors consider it as the current state-of-the-art, so that this data set inherits the “time stamp” of its most recent (re)use.

We can also use our metric to identify clusters of work that report mutually consistent results. Figure 4 and 5 report all pairwise distances between the assessments. For a given similarity threshold (here 1 or 2%), one can find groups of assessments that lie close to each other, within that threshold. Encouragingly, these clusters seem to primarily consist of recent publications thus again suggesting an emerging consensus.

4 Conclusion

We have described a formal methodology to quantify, in a single figure-of-merit, the level of agreement between two phase diagrams. Our proposal not only satisfies the mathematical requirements of a norm or a metric, but also has a sound physical basis, is invariant to scaling of the graph axes, and is easy to compute via Monte Carlo sampling, with or without access to the thermodynamic model underlying each phase diagram. We illustrate its usefulness in a meta-analysis of a set of thermodynamic assessments in popular alloy systems, in an effort to determine whether the most current assessments have reached a consensus. Our metric may find applications in other areas as well, for instance, to report how well phase diagrams generated purely via ab initio methods agree with the corresponding experiments-based thermodynamic assessements.