1 Water Chemistry Data as Compositional Data

When geochemical data are analysed by using statistical methods, several units can be used to express concentrations and a first discussion of their compositional nature is reported in Buccianti and Pawlowsky-Glahn (2005). The usual units of measurement include milligrams per liter (mg/L), parts per million by weight (ppm), parts per billion by weight (ppb), millimole per liter (mmol/L), and milliequivalent per liter (meq/L). The ppm and mg/L units are numerically equal if the density of the water sample is 1 g/cm3, as in pure water. Samples can be converted from mg/L to ppm by multiplying each component by the density of water. The term mmol/L indicates the number of ions or molecules in the water when multiplied by Avogadro’s number (the number of molecules in a mole of material, 6.023 × 1023). The measure mg/L is converted to mmol/L by dividing by the atomic or molecular weight. To express concentration by meq/L (electrical charges are considered), mmol/L is multiplied by the charge of the ions. In each case the base of the calculus is given by the content of some chemical species referred to a given weight or volume then multiplied by a constant (atomic or molecular weight, electrical charges).

These types of data describe parts of some whole and even if proportions are expressed as real numbers, they cannot be interpreted, or even analysed, as real data. It is well known that this practice can lead to paradoxes and/or misinterpretations (e.g. intervals covering negative proportions, spurious correlations) already discussed a century ago (Pearson 1897), but mostly forgotten and neglected over the years (Chayes 1960).

No other ways are possible to compare different samples from dissimilar sites and times, as is usually required. Thus the compositional nature of the experimental data is an intrinsic property related to their origin (e.g. instrument calibration) and to the necessity of making comparisons to investigate the genesis of environmental variability. As directional (circular) observations (Fisher 1995) compositional data move in a constrained sample space called simplex (Aitchison 1986):

$$ S^{D} = \left\{ {\left. {{\mathbf{x}} = \left[ {x_{1} ,x_{2} , \ldots ,x_{D} } \right]} \right|x_{i} } \right\}, > 0, \,i = 1,2, \ldots ,D; \;\sum\limits_{i = 1}^{D} {x_{i} = \kappa } $$
(16.1)

where the D components of the vector SD are called parts (variables) of the composition. The value of κ depends of the units of the measurement or rescaling procedure, and usual values are 1 (proportions), 100 (%), 106 (ppm) or similar. Note that it is not necessary to have \( \sum\nolimits_{i = 1}^{D} {x_{i} = \kappa } \) (closed data) to obtain compositional observations. In fact, a (row) vector \( {\mathbf{x}} = \left[ {x_{1} ,x_{2} , \ldots ,x_{D} } \right] \) is a D-part composition when all its components are strictly positive real numbers and carry only relative information. This means that the message about what is occurring is mainly contained in the ratios between the parts since the numerical value of each variable by itself is not relevant. A recent thorough analysis of the “compositional problem” can be found in Pawlowsky-Glahn and Buccianti (2011) and Pawlowsky-Glahn et al. (2015). On the other hand interesting applications on water chemistry can be found in literature (e.g. Engle and Rowan 2013, 2014; Engle and Blondes 2014; Buccianti and Zuo 2016; Owen et al. 2016; Buccianti et al. 2018; Shelton et al. 2018) where the different potentialities of the family of the log-ratio transformations are differently exploited posing at the central point of the analysis the relativity of the values and the multivariate vision. The cited papers are not exhaustive but have been chosen since they successfully focus on the use of the isometric log-ratio transformation as a way to describe the dynamics of geochemical processes.

2 Isometric-Log Ratio Transformation: Is This the Key to Decipher the Dynamics of Geochemical Systems?

2.1 Coordinates as Balances

Water present below the land surface and running above it tells the history of the environment with which it has been in contact. Rainfall and snowmelt interact with the rock of the Earth surface and percolate through the soil zone where chemical reactions with gases, minerals and organic compounds take place. Chemical reactions occur because the composition of the water is not in equilibrium with the solid phases or the gaseous component (Kleidon 2010). Thus disequilibrium drives the reactions and solutes in the water are derived from the dissolution or leaching of the solid phases and from the dissolution of gases from the air or from the oxidation of organic matter. Most of the natural systems are open and according with Nicolis and Prigogine (1989) they are characterized by dissipative structures and presence of irreversible processes. Dissipative structures contain subsystems, which permanently fluctuate until the fluctuation becomes so strong that it breaks the original system to generate a new condition, more complex and characterized by a higher level of order. The dynamics of systems being far from equilibrium requires a continuous self-organization and to maintain this condition the energy flux from the environment is higher than required for the initial state and irreversible processes can be a source of order rather than chaos. Most of the geological systems are open and dynamic, characterized by a great number of components and develop in a nonlinear way far from equilibrium (Shvartsev 2009). Particularly interesting from this point of view is the water-rock system where also synergetic properties can be found, with respect to the thermodynamical equilibrium where elements (molecules) behave independently of one another (Shvartsev 2013).

The use of the isometric log-ratio coordinates (Egozcue et al. 2003) not only allows us to manage compositional data with classical statistical tools, but also could offer a powerful tool to probe the level of self-organization of a geochemical system as a whole. When coordinates are obtained by using the sequential binary partition method (Egozcue and Pawlowsky-Glahn 2005), guided by a geochemical criterion, the analysis of their frequency distribution may represent an interesting way to understand the laws governing randomness and variability. By taking into account this consideration, an improvement of the balance dendrogram (Pawlowsky-Glahn and Egozcue 2001) is here presented with the aim to investigate the behavior of aqueous systems.

The sample space of D-part compositional data, the simplex, being a subset of the real space RD, has a real Euclidean vector space structure (Billheimer et al. 2001; Pawlowsky-Glahn and Egozcue 2001; Buccianti and Magli 2011). This situation allows the representation of data in coordinates with respect to an orthonormal basis, for example following the Gram-Schmidt orthonormalization process or a Singular Value Decomposition (Egozcue et al. 2003). Since these methods often reveal coordinates not easy to interpret, balances, a specific type of orthonormal coordinates associated with groups of parts, have been proposed (Egozcue and Pawlowsky-Glahn 2005). This method is based on a sequential binary partition of a D-part composition into non-overlapping groups and when the procedure is geochemically guided it leads to coordinates easy to interpret. Moreover, it allows understanding of how the total variance is decomposed into marginal variances, thus pointing out the relationship between intra-group and inter-group compositional parts variability. For the i-th order of partition, the balance is

$$ b_{i} = \sqrt {\frac{{r_{i} \cdot s_{i} }}{{r_{i} + s_{i} }}} log\frac{{\left( {\prod\nolimits_{{x_{{j \in G_{i1} }} }} {x_{j} } } \right)^{{1/r_{i} }} }}{{\left( {\prod\nolimits_{{x_{{l \in G_{i2} }} }} {x_{l} } } \right)^{{1/s_{i} }} }} $$
(16.2)

where r i and s i are the number of parts in the groups of numerator (G i1 ) and denominator (G i2 ), respectively. As we can see, the balance is defined as the natural logarithm of the ratio of geometric means of the parts in each group, normalized by the coefficient needed to obtain unit length of the vectors of the basis.

2.2 Behavior of Self-organizing Systems and CoDA Phylosophy

A general characteristic of self-organizing systems is robustness and resilience (Dakos et al. 2014; Dai et al. 2015). This means that they are relatively insensitive to perturbations or errors, and can show a strong capacity to restore themselves after changes (Scheffer et al. 2009, 2012). One reason for this fault-tolerance is the redundant, distributed organization so that the non-damaged regions can usually make up for the damaged ones. Within certain limits, another reason for the intrinsic robustness is that self-organization is facilitated by randomness, fluctuations or “noise” while the stabilizing effect of feedback loops guarantee resilience. The presence of feedback mechanisms generates systems that can be responsible for their own maintenance, and thus largely independent from the environment. Although in general there will still be exchange of matter and energy between systems and surroundings, the organization is determined purely internally. Thus the system is thermodynamically open, but organizationally closed. Organizational closure turns a collection of interacting elements into an individual, coherent whole. This whole has properties that arise out of its organization that can be described by the probability laws that govern the relative behaviour of its elements (van Rooij 2013). From this point of view CoDA theory appears to capture the philosophy of this condition and the analysis of the shape of the frequency distribution of isometric coordinates should be the adequate tool (Allegre and Lewin 1995; Seely et al. 2012; Holden and Rajaraman 2012; Buccianti and Zuo 2016).

As reported in Scheffer et al. (2012) the probability density distribution of some variables describing the state of a system can be used to estimate how the potential landscape is reflecting its stability properties. The shape of the probability density function indicates where the data are more aggregated and which laws are governing the variability, giving us fundamental information about the genesis of randomness (Agterberg 2014). In our case it will be the shape of the frequency distribution of isometric log-ratio coordinates representing some geochemical process that will inform us about dynamic properties of the system. In Fig. 16.1 some examples of a non-equilibrium dynamics are reported (Scheffer et al. 2009). Conditions represented in (a) are far from a bifurcation point. The pothole in the potential line corresponds to an area where data tend to aggregate in the density probability distribution function. Here resilience is large since the basin of attraction is wide and the rate of recovery from perturbations is relatively high. If the system is stochastically forced, the resulting dynamics will be characterised by low correlation between states at subsequent time intervals. In (b) the system is closer to the transition point and resilience decreases due to the shrinking of the attraction basin and the low rate of recovery from small perturbations. Here the slight depression could be related to presence of bimodality indicating presence of alternative states. In this case the system in a stochastic environment will have a long memory for perturbations and its dynamics will be governed by high variance and stronger correlations between subsequent states.

Fig. 16.1
figure 1

Example of non-equilibrium dynamics (from Scheffer et al. 2009, modified). The pothole in the potential line of diagram a corresponds to an area where data tend to aggregate in the density probability distribution function. The slight depression in b could be related to presence of bimodality indicating presence of alternative states (Scheffer et al. 2012)

3 Improving CoDA-Dendrogram: Checking for Variability, Resilience and Stability

The chemical composition of groundwaters from the Arezzo basin aquifer (Tuscany, central Italy) was analysed, as an application example, to obtain information about the dynamics of the aqueous geochemical system. The Arezzo Basin (Fig. 16.2), formed since Upper Pliocene, is a structural depression bordered to the North and to the East by the Pratomagno and Chianti belts, respectively, and to the South and to the East by two tectonic lineaments (Val d’Arbia-Val Marecchia transversal and Chitignano normal faults). Along these tectonic discontinuities CO2-rich manifestations either seep out or are exploited by private companies down to the depth of 1000 m. Three main aquifers are recognized: (i) a relatively deep aquifer hosted in Tertiary sandstone formations; (ii) an intermediate aquifer hosted in Quaternary fluvio-lacustrine sediments and (iii) a shallow aquifer in recent alluvial sediments. The available geochemical data-base consists of about 500 samples that were collected in different dry and rainy seasons in recent years from 80 wells diffused in all the basin area. Depth of the sampling is, unfortunately, not always known and few differences can be related to seasonal changes. Physical parameters (temperature and electrical conductivity), major, minor and trace dissolved species (pH, Ca, Mg, Na, K, NH4, HCO3, SO4, NO3, NO2, Cl, Br, F and heavy metals), oxygen and hydrogen isotopes in the water molecules and dissolved gases (including 13C-CO2) were analyzed. On the basis of Total Dissolved Solids (TDS) the waters from Arezzo aquifer can be considered mainly oligomineral and medium-mineral, whereas mineral waters are almost exclusively associated with CO2-rich wells. From a classification point of view, Ca(Mg)-HCO3 is by far the most representative geochemical facies, followed by Na(K)-HCO3, Ca(Mg)-SO4 and Na(K)-Cl types. It is noteworthy to point out here that the Na(K)-HCO3 waters, whose origin is related to the presence of CO2-rich waters that favor cation exchange processes with clay minerals contained in the sedimentary formations, are aligned along the Val d’Arbia-Val Marecchia transversal tectonic system.

Fig. 16.2
figure 2

The hydrographic system of the Arezzo basin (Tuscany, central Italy) (http://sit.comune.arezzo.it/normativa/index.php?normativa=_ps&mappa=ps_b11a)

In Table 16.1 the sequential binary partition process to construct the isometric log-ratio coordinates is reported. The first coordinate could represent the balance between the most important chemical reactions involving carbonatic and silicatic rocks (Ca2+, Mg2+, Na+, K+, HCO3 and H+) versus elements and chemical species whose sources could be different, including pollution (Cl, SO4, NO3). The second coordinate is an analysis inside the carbonatic and silicatic cycle, balancing cations and anions. The third compares the behaviour of the involved bivalent versus monovalent elements while the fourth and the fifth compare their relative behaviour. The sixth coordinate analyses the anions giving us information about the pH water conditions. Finally, the remaining coordinates investigate the behaviour of variables whose source may be related to pollution. Considering Cl in absence of atmospheric cyclic salts and evaporates about 30% of its amount is related to pollution, 54% in case of SO42−, while for nitrate the most important anthropogenic sources are septic tanks, application of nitrogen-rich fertilizers to turf grass, and intensive agricultural processes (Berner and Berner 1996; Liu et al. 2011; Menció et al. 2016).

Table 16.1 Sequential binary partition process for the groundwater chemistry of the Arezzo basin (Tuscany, central Italy). Units of the chemical concentrations are mol/L while the variance expresses the contribution of each balance in explaining the total variability (% in parenthesis)

As we can see variance is higher for the first balance comparing natural and anthropic processes, and the last one, comparing SO42− and NO3 whose ratio variability is a further witness of the presence of numerous sources/fluctuations. A first result here reveals that when elements are more related to natural weathering processes their balance variability appears to be reduced, probably indicating that the same processes have been working through time in a similar way. By taking into account the previous discussion about the dynamics of geochemical systems more information should be obtained by the analysis of the frequency distribution of the balances.

To achieve this aim in Fig. 16.3 an improved version of the balance dendrogram is reported where the original boxplots (Pawlowsky-Glahn and Egozcue 2011) are associated with the frequency distribution of the coordinates. Histograms have the same horizontal and vertical scale so they are comparable. Red line is related to the Gaussian distribution, black treated line to the Kernel density estimation.

Fig. 16.3
figure 3

Balance dendrogram (Thió-Henestrosa et al. 2008) with associated histograms. Red line corresponds to the Gaussian model, black treated line to the Kernel density estimation. The length of the vertical bar represents the proportion of the sample total variance

Application of several normality tests indicates that under no circumstances the Normal distribution can be considered as model for the log-ratio coordinates; the consequence is that the log-normal model cannot be used to describe ratios between parts or group of parts. In most of the cases it appears to be due to some bimodality or to the presence of a heavy tail in the right-hand part of the distribution. The presence of power laws is associated with complex systems composed of processes that interact to self-organize their behavior across multiple temporal and/or spatial scales. Both fractals and multifractals are commonly associated with local self-similarity or scale-independence, generally leading to power-law relations (Agterberg 2014). On the other hand the lognormal shape represents a special condition in which the interdependencies among processes are minimized or absent and repeated fragmentation (or dilution) dominates. As we can see in Fig. 16.3 the presence of heavy tails characterizes coordinates that mainly balance weathering of silicate and carbonates (K+, Na+, Mg2+, Ca2+, H+, HCO3) versus other environmental processes (NO3, SO4, Cl). Moreover, considering the internal partition of the previous balances, K+/Na+, Mg2+/Ca2+ and, in particular, NO3/SO4 ratios repeat this type of behavior.

The use of the complementary distribution function reveals the presence of power laws more clearly. In this plot, reported in Fig. 16.4, if X has a power law distribution the behavior of the Prob[X  x] will be a straight line (Mitzenmacher 2004). As we can see, linear models can well describe several portions of curves for all the coordinates. This condition asks for multifractality perhaps associated to the space-time heterogeneity of the aquifer structure. Here a sudden change in the number of data with given concentration values is expected, particularly for pollution processes (Agterberg 2014). The fractal dimension of the phenomena, related to the slope of the straight lines, indicates how much more often there are low differences between the data rather then high differences.

Fig. 16.4
figure 4

Complementary distribution function to reveals the presence of power laws. If X has a power law distribution the behavior of the Prob[X ≥ x] will be a straight line (Mitzenmacher 2004)

On the whole the aquifer system appears to be governed by an interaction-dominant dynamics but it does not present a clear multimodality (or bimodality) that could be associated to different states. By considering Fig. 16.1 and the information deduced by the shape of the frequency distribution (Figs. 16.3 and 16.4) the aquifer could be associated with a sufficient resilience and recovery state (Scheffer et al. 2009, 2012). Of notice here is that the most important contribution to variability appears to be related to chemicals such as NO3 and SO4 suggesting the weight and the intermittency of the anthropic pressure. The multifractality revealed in Fig. 16.4 could indicate that in the dynamical system the energy dissipation cannot be neglected and that extended areas (intervals) of low fluctuations intermittent with small areas of extremely large fluctuations are to be expected. Moreover, the system as a whole is undergoing a non-linear dissipation with the energy interchange on different scales.

4 Conclusions

Starting from Garrels and Christ (1965) equilibrium in the water-rock system is usually analysed through the application of thermodynamic methods. In this context the statistical analysis of water concentrations, opportunely transformed into isometric logratio coordinates, could be an effective approach to understand where the randomness in nature comes from (Agterberg 2014) and if equilibrium conditions are really encountered.

The frequency distribution of the ratio of the compositional parts of Arezzo aquifer chemistry exhibits an overlapping between log-normal and power-law probability distributions when silicate and carbonate weathering (K+, Na+, Mg2+, Ca2+, H+, HCO3) is balanced versus other environmental processes (NO3, SO4, Cl). Similar results are obtained when the partition to generate new balances is applied to the previous group of parts (NO3 versus SO4, K+ versus Na+ or Mg2+ versus Ca2+). The result indicates a system subjected to nonlinear compositional changes due to presence of feedback effects attributable in a porous medium to change in porosity causing a remarkable change in permeability, in the pore-fluid flow and in the chemical-species concentration (Zhao 2014). Since thermodynamic equilibrium represents a homogeneous distribution of the parts, the obtained results indicate that the system is able to create and maintain a given amount of gradient, generating heterogeneity. However no clear multimodality is present and for the span of time here analysed different steady states (basins of attraction for concentration values) have not yet clearly emerged. Thus, from a compositional point of view, the system could be characterised by sufficient resilience and recovery rate from disturbances since the dissipative behaviour appears to be able to adsorb fluctuations. New progress would be made in this direction by exploiting the capacity of CoDA to capture the interdependence of concentration values, thus describing the water system and the surrounding as a whole, as in reality.