Machine Learning of Mineralization-Related Geochemical Anomalies: A Review of Potential Methods
- First Online:
- 217 Downloads
Research on processing geochemical data and identifying geochemical anomalies has made important progress in recent decades. Fractal/multi-fractal models, compositional data analysis, and machine learning (ML) are three widely used techniques in the field of geochemical data processing. In recent years, ML has been applied to model the complex and unknown multivariate geochemical distribution and extract meaningful elemental associations related to mineralization or environmental pollution. It is expected that ML will have a more significant role in geochemical mapping with the development of big data science and artificial intelligence in the near future. In this study, state-of-the-art applications of ML in identifying geochemical anomalies were reviewed, and the advantages and disadvantages of ML for geochemical prospecting were investigated. More applications are needed to demonstrate the advantage of ML in solving complex problems in the geosciences.
KeywordsGeochemical prospecting Geochemical anomalies Fractal model Compositional data analysis Machine learning
In recent years, increasing attention has been paid to geochemical data processing and identification of geochemical anomalies related to mineralization or environmental pollution. Based on literature review of papers in three journals: Journal of Geochemical Exploration, Geochemistry: Exploration, Environment, Analysis, and Applied Geochemistry in recent decades, there are three hot topics in geochemical data processing.
The first hot topic is the application of fractal/multi-fractal models. Cheng et al. (1994) proposed the concentration–area (C–A) model, which has become a basic tool for analyzing exploration geochemical data (Carranza 2008; Zuo et al. 2016). The C–A model is also known as Cheng–Agterberg model after the last names of the first two authors of the seminal paper. Based on the C–A model, the spectrum–area multi-fractal model (Cheng et al. 2000) and local singularity analysis (LSA) (Cheng 2007) were developed. Fractal/multi-fractal models consider both the frequency and spatial variances of geochemical patterns and therefore are efficient for identification of geochemical anomalies. Currently, fractal/multi-fractal modeling, especially LSA, of geochemical data in support of GIS is a hot research direction in geochemical prospecting (e.g., Carranza 2010; Zuo 2011; Luz et al. 2014; Huang and Zhao 2015; Zuo and Wang 2016; Zuo et al. 2009, 2013a, 2013b, 2015, 2016; Bucciant and Zuo 2016; Chen et al. 2016; Zhao et al. 2017).
The second hot topic is the application of compositional data analysis. Compositional data are components of samples measured as proportions or percentages of the whole sample, and therefore, each component only carries relative information (Aitchison 1986). There is growing awareness of the problem of closure in the analysis of compositional data, which leads to spurious correlations among variables. Three methods, the additive (Aitchison 1986), centered (Aitchison 1986), and isometric (ilr) (Egozcue et al. 2003) log-ratio transformations have been proposed to address the effects of the closure problem with compositional data. Among these three transformation methods, the ilr is arguably the best choice because it obtains the correct representation of compositional data in Euclidean space. Especially for bivariate and multivariate analyses, geochemical data for major elements (e.g., SiO2, CaO, MgO, K2O) should be preprocessed using log-ratio transformation (e.g., Zuo et al. 2013b; Zuo 2014; Xiong and Zuo 2016b). However, the effects of closure problem on the univariate analysis of trace elements are unclear in practice; it has been pointed out that log-ratio transformation of data for trace elements should be considered in univariate analysis (Filzmoser et al. 2009; McKinley et al. 2016). More studies on univariate analysis of trace elements from a compositional data perspective are needed in mapping of geochemical anomalies. The above first two hot topics have also been researched by the International Association for Mathematical Geosciences (IAMG) and were singled out for special consideration among a list of IAMG potential special outreach topics (https://www.iamg.org/images/File/documents/Newsletters/NewslettersHSP/NL89lo.pdf).
The third hot topic is the application of machine learning (ML) methods to modeling of geochemical data. Although there are few publications on this topic compared to the first two topics, ML is potentially powerful for geochemical mapping. The aim of this paper is to review the state-of-the-art applications of ML in identifying geochemical anomalies and to summarize the advantages and disadvantages of applications of ML in the geosciences in general.
Artificial intelligence (AI) is the science and engineering of making intelligent machines, especially intelligent computer programs. It is concerned with the task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable. As a subfield of AI, the basic task of ML is to use algorithms to analyze and learn from data, in order to make a determination or prediction in various fields. Based on the nature of learning feedback available to learning systems, ML methods are typically classified as supervised learning, unsupervised learning, and reinforcement learning (Russell and Norvig 2010).
In supervised learning, each sample consists of an input object and a desired output. A supervised learning method is used to analyze training samples and generate a function used for mapping inputs to outputs. In unsupervised learning, each training sample is unlabeled. The aim of an unsupervised learning method is to produce an inferred function and discover the hidden structure from the unlabeled data. Reinforcement learning focuses on how the learning system ought to behave in an environment in order to receive the feedback about rewards or punishments. Between supervised learning and unsupervised learning is semi-supervised learning, suitable for training samples that contain a few labeled data and several unlabeled data (Russell and Norvig 2010; Zhou et al. 2017).
The advantage of ML is that they can learn and approximate complex nonlinear mapping, and exploit the information contained in a dataset without assumption of data distribution. Therefore, ML methods have been widely used in a wide range of applications such as classification, anomaly detection, pattern recognition, and dimensionality reduction.
Among various ML methods, artificial neural networks (ANN), support vector machine (SVM), random forests (RF), logistic regression (LR), and the Bayesian network generally belong to supervised learning group. The autoencoder network, self-organizing map (SOM), and K-means clustering belong to unsupervised learning group.
Artificial neural network is a mathematical model of human brain neurons (Haykin 1999). Multilayer perceptron, usually trained with a backpropagation algorithm, is one of the most commonly used ANN models. Multilayer perceptron consists of an input layer, one or more hidden layers, and an output layer. The output from the previous layer of neurons is used as the input for the next layer of neurons. The gradient algorithms are used to update the weights during the training phase and to achieve the purpose of minimizing misclassification errors.
Support vector machine, derived from statistical learning theory and Vapnik–Chervonenkis (VC) dimension (Vapnik 1995), has been widely used for classification and regression. The method first constructs a hyperplane in a higher-dimensional space and then finds an optimal hyperplane that maximizes the margin between the classes. The performance of the SVM model is significantly influenced by the selection of kernel functions, such as linear, polynomial, sigmoid, and radial basis (RBF) functions.
Random forest is an ensemble learning method that combines a set of decision trees for classification and prediction (Breiman 2001). The bootstrapping technique is used to randomly choose the training samples from the original data to build the RF model. About two-thirds of the original data are used to train the model, and the remaining data are used to validate the model. The most important parameters for RF are the number of trees and predictors used to split each node of the trees. Generally, the number of variables for splitting each node of the trees should be less than log2M + 1, where M is the total number of the input (Breiman 2001). In addition, the RF model is capable of estimating the relative importance of each variable by observing the increased prediction error when that variable is removed while keeping the left variables unchanged.
Logistic regression aims at constructing a multivariate relationship between a dependent variable (e.g., deposits or non-deposits) and set of independent variables (e.g., faults, granites, geochemical anomaly). The dependent variable is dichotomous (0 or 1), while the independent variable does not necessarily have a normal distribution, and can be interval, dichotomous, or categorical (Menard 2001). The coefficients of LR are estimated using maximum likelihood estimation (Cox and Snell 1989). Subsequently, the probability of a specific event occurring can be estimated via LR.
A Bayesian network is a directed acyclic graph, which consists of nodes and directed edges (Pearl 1988). The nodes represent random variables and edges represent conditional dependencies which can be expressed using the conditional probability. When the nodes between any of the variables are not connected, the variables are conditionally independent and the conditional probability is equal to 0. The visualization of interrelations and dependencies of the model parameters make the Bayesian network model intelligible for complex problem analysis.
The autoencoder network, derived from ANN, is an unsupervised learning method to learn a representation or encoding of the data (Bengio 2009). The aim of the autoencoder network is to minimize the difference between the input and output. Various variants of backpropagation are used to train an autoencoder. However, when the hidden layers of the network are too many, the back propagated errors to the first layer are insignificant. Hinton and Salakhutdinov (2006) proposed a deep autoencoder by pre-training the network using a restricted Boltzmann machine and fine-tuning the network using backpropagation.
Self-organizing map is another type of ANN that is trained using unsupervised learning. It is an effective tool for visualizing high-dimensional data and mapping the data into a low-dimensional feature space (Kohonen 1998). The method consists of several neurons which associated with a weight vector that has the same dimension as the input sample and a position in the map. The purpose of SOM is to detect correlations in their input and also to recognize groups of similar input vectors.
K-means clustering is used to solve the problem of data clustering in data mining (Hartigan and Wong 1979). The method aims to separate n observations into k clusters, and the samples in each cluster have the nearest distance between each other. It first selects k number of points as the initial centroids, and then calculates and clusters according to the distance between each sample and the initial centroids, and finally updates the cluster centroids until the cluster centroids are unchanged. An appropriate selection of k is essential to the method performance; thus, diagnostic checks are necessary for the determination of optimum k while performing k-means.
Applications in Geosciences
ML methods have been applied in the geosciences. One of the most commonly used ML methods, ANN, has proved to be a powerful tool for the classification and identification of the minerals (Thompson et al. 2001; Baykan and Yılmaz 2010). Cracknell and Reading (2014) compared five ML algorithms for geological mapping, and demonstrated that RF is a good first choice for classifying lithology. Recently, ML has been a popular research topic in the field of mineral prospectivity mapping (MPM). Some supervised classification methods, such as ANN (Singer and Kouda 1996, 1999; Harris and Pan 1999; Brown et al. 2000, 2003a, 2003b; Bougrain et al. 2003; Rigol-Sanchez et al. 2003; Harris et al. 2003; Porwal et al. 2003; Skabar 2005, 2007; Behnia 2007; Nykänen 2008; Oh and Lee 2010), SVM (Zuo and Carranza 2011; Abedi et al. 2012), RF (Rodriguez-Galiano et al. 2014; Carranza and Laborte 2015a, 2015b, 2016; Zhang et al. 2016; Gao et al. 2016), Bayesian classifiers (Porwal et al. 2006; Porwal and Carranza 2008), LR (Agterberg 1992; Sahoo and Pandalai 1999; Harris et al. 2001; Raines and Mihalasky 2002; Daneshfar et al. 2006; Nykänen and Ojala 2007; Mejía-Herrera et al. 2015), restricted Boltzmann machine (Chen 2015), extreme learning machine (Chen and Wu 2017a), decision trees (DT) (Reddy and Bonham-Carter 1991; Chen et al. 2014a) and cost-sensitive learning (Xiong and Zuo 2017) have been applied to the mapping of mineral prospectivity. Rodriguez-Galiano et al. (2015) compared ANN, DT, RF and SVM based on four criteria, and demonstrated that RF outperformed the other three ML methods for MPM.
In addition, owing to the complexity of geological settings, the distribution of the geochemical data is usually unknown, and so some ML-based anomaly detection techniques have been introduced to the field of geochemical anomaly detection. These methods have the advantage that no assumptions regarding data distribution are required, and can well manage nonlinear relations among geochemical data. For example, Twarakavi et al. (2006) applied SVM and robust least-square SVM to map arsenic concentrations using the gold concentration distribution present within the sediments in Alaska. Beucher et al. (2013) applied ANN for soil mapping in the Sirppujoki River catchment area, southwestern Finland. Chen et al. (2014b) used a continuous restricted Boltzmann machine to identify multivariate geochemical anomaly in southern Jilin province of China. Gonbadi et al. (2015) applied supervised ML methods to separate porphyry Cu-related geochemical anomalies from background in the Kerman Province of Iran. O’Brien et al. (2015) applied RF to identify gahnite compositions as an exploration guide to Broken Hill-type Pb–Zn–Ag deposits in the Broken Hill domain of Australia. Zhao et al. (2016) applied an ANN to extract geochemical anomalies related to Cu–Au mineralization in Shaanxi Province, China. Xiong and Zuo (2016a) used a deep autoencoder network to integrate multi-elements associated with Fe mineralization in Fujian Province, China. Kirkwood et al. (2016) used quantile regression forests to map regional soil geochemistry in southwest England using geophysics, height and remote sensing, and the resultant maps are more useful than their spatially interpolated univariant equivalents, providing increased detail, accuracy and interpretability. Zaremotlagh and Hezarkhani (2017) applied the DT and ANN to recognize the geochemical distribution patterns of LREE in the Choghart deposit, Central Iran. Chen and Wu (2017b) applied one-class SVM to identify multivariate geochemical anomalies from stream sediment survey data of the Lalingzaohuo District in Qinghai Province, China. These case studies have been demonstrated that ML models are useful tools for identifying multivariate geochemical anomalies.
Discussion and Conclusions
ML methods have the potential to model complex and nonlinear systems, and they are expected to model unknown multivariate geochemical distributions that capture complex and multistage geological events. Several successfully case studies have demonstrated that ML can successfully integrate multi-element geochemical data and extract geochemical anomalies related to mineralization (e.g., Chen et al. 2014b; Xiong and Zuo 2016a). For example, anomalous patterns obtained from ML model have a strong spatial correlation with known mineralization/occurrences (Xiong and Zuo 2016a). However, some ML methods are ‘black box’ techniques and the inner structure of multi-elements is usually unknown to geoscientists (Fig. 1), leading to lack of information for geological interpretation. In addition, most ML models involve complex mathematical equations and computer programs that are not easily understood by many geoscientists. Therefore, the application of ML in the geosciences requires having a good background in mathematics and computers. Meanwhile, various ML models are available and pose results in difficulty to users on the selection of ML methods that suitable for a particular research problem.
ML is a large family of analytical tools. Until now, only a few ML methods have been applied in the geosciences. More applications are needed to demonstrate the advantages of ML in solving complex geological, geochemical, and geophysical problems. Each ML model has different parameters, and optimizing each parameter to obtain optimum performance is another challenge in the geosciences. In addition, geological and geochemical patterns are linked to spatial variability, heterogeneity, and anisotropy. The question of how to integrate these spatial characteristics of spatial patterns into ML model for improving the performance of ML in the geosciences should be further explored. Meanwhile, educating geoscientists on the benefits of applying ML to geological problems is an important outreach objective.
The author thanks Dr. John Carranza and two anonymous reviewers for their edits and comments that improved the manuscript. The author also thanks Yihui Xiong from the China University of Geosciences for preparing the figure and checking the references. This research benefited from the joint financial support from the National Natural Science Foundation of China (No. 41522206) and MOST Special Fund from the State Key Laboratory of Geological Processes and Mineral Resources, China University of Geosciences (MSFGPMR03-3).