Natural Resources Research

, Volume 26, Issue 4, pp 489–507

Random Forest-Based Prospectivity Modelling of Greenfield Terrains Using Sparse Deposit Data: An Example from the Tanami Region, Western Australia

  • Siddharth Hariharan
  • Siddhesh Tirodkar
  • Alok Porwal
  • Avik Bhattacharya
  • Aurore Joly
Original Paper


Data-driven prospectivity modelling of greenfields terrains is challenging because very few deposits are available and the training data are overwhelmingly dominated by non-deposit samples. This could lead to biased estimates of model parameters. In the present study involving Random Forest (RF)-based gold prospectivity modelling of the Tanami region, a greenfields terrain in Western Australia, we apply the Synthetic Minority Over-sampling Technique to modify the initial dataset and bring the deposit-to-non-deposit ratio closer to 50:50. An optimal threshold range is determined objectively using statistical measures such as the data sensitivity, specificity, kappa and per cent correctly classified. The RF regression modelling with the modified dataset of close to 50:50 sample ratio of deposit to non-deposit delineates 4.67% of the study area as high prospectivity areas as compared to only 1.06% by the original dataset, implying that the original “sparse” dataset underestimates prospectivity.


Random forest Mineral prospectivity Threshold Mapping Modelling 


The common methods for mineral prospectivity mapping include: probabilistic methods, artificial intelligence-/machine learning-based methods and regression-based methods (Porwal and Carranza 2015). In this paper, we apply Random Forest (RF) (Breiman 2001), a regression-based method. Regression-based methods estimate a best fit function that correlates the target mineral deposit to the input geological predictor maps. A regression problem is defined as follows:
$$\begin{aligned} y = f(x_i) \end{aligned}$$
where y is mineral deposit occurrence and \(x_i\) are predictor maps.

The idea here is: for a specific deposit type, a well-explored area is used to develop a regression model relating the occurrence of the targeted deposit type (dependent variable) to a series of predictor maps (independent variables). Once the regression model is developed and validated, it can be applied to geologically similar areas (Harris et al. 2001). The dependent variable is a binary variable indicating the presence or absence of a mineral deposit of the type sought. The locations where known mineral deposits occur and those locations which do not contain mineral deposits are termed “deposit” and “non-deposit” locations, respectively, and are used as training data for building the RF regression model.

Linear and logistic regressions are the most commonly used regression techniques. Linear regression is a parametric method that needs explicit modelling of nonlinearities in the data and interactions between the parameters (Grömping 2009). Logistic regression assumes the underlying data distribution and also assumes the independence between the predictor variables (Carranza and Laborte 2015a). For mineral prospectivity mapping, these two assumptions rarely apply since data distributions are usually not known a-priori. Also, the geological processes involved in mineral deposit formation are not always independent of each other. The RF does not require that the underlying distribution of the data be known a-priori and also does not assume independence amongst the geological predictor maps. The RF is a nonparametric method and, unlike linear regression, it does not require nonlinearities and parameter interactions to be explicitly modelled since these can be learned from the data themselves.

In recent years, RF has been used widely in applications related to earth observations. Some of them include land cover classification (Gislason et al. 2006), urban classification (Hariharan et al. 2016) and crop classification (Ok et al. 2012). The algorithm is being increasingly applied to mineral potential modelling (Cracknell et al. 2014; Harris et al. 2015; Carranza and Laborte 2016; Gao et al. 2016; Zhang et al. 2016). Some recent work using RF for gold prospectivity include Rodriguez-Galiano et al. (2014), Carranza and Laborte (2015a, b), Rodriguez-Galiano et al. (2015), and McKay and Harris (2016). Rodriguez-Galiano et al. (2015) tested a series of machine learning algorithms including RF regression for prospectivity modelling and noted the superiority of RF compared to other algorithms.

One of the main issues in data-driven mineral prospectivity mapping is the huge imbalance in the number of deposit and non-deposit samples. Since there are usually sparse deposits compared to barren locations (non-deposits), it is undesirable to apply machine learning models on such datasets since the performance of the classifier gets biased towards the majority class.  Carranza and Laborte (2015a) note that the number of non-deposit and deposit samples derived should be equal.  King and Zeng (2001) mention that information content from predictors starts to diminish as the number of non-deposits exceeds the number of deposits. Breslow and Cain (1988) and Schill et al. (1993) discuss the need for equal number of deposit and non-deposit samples (“balance” dataset) to perform regression.

Various methods for data balancing in machine learning have been discussed in detail by Batista et al. (2004) and He and Garcia (2009). These studies show that the Synthetic Minority Over-sampling Technique (SMOTE) is a powerful balancing technique for various applications. The SMOTE technique (Chawla et al. 2002) over-samples the minority class in feature space and avoids overfitting by spreading the decision boundary of the minority class further into the majority class space. In the present study, we modify the original training dataset using SMOTE in order to bring the deposit-to-non-deposit ratio close to 50:50. The original training dataset is referred to as the “sparse” dataset and the modified training dataset as the “balance” dataset.

The outputs of RF-based regression are floating point continuous prospectivity values that range from 0 (barren) to 1 (mineralized). These values cannot be used directly for selecting exploration targets without binary reclassification into prospective and non-prospective values. Often a value of 0.5 is used as threshold for reclassification (Carranza et al. 2015; Carranza and Laborte 2015a; Rodriguez-Galiano et al. 2015). In principle, the value of 0.5 is justifiable as a threshold only when the number of deposits and non-deposit is expected to be equal. In the case of mineral prospectivity modelling, the deposit samples are typically much less compared to non-deposit samples, and hence, the algorithms tend to over-learn the characters of non-deposit locations. As a result, the prospectivity is biased towards non-deposit class or the 0 value, and therefore, using a threshold of 0.5 for separating prospective from non-prospective cells is not advisable. In this paper, we attempt to select an optimal threshold value keeping in view the imbalanced training data. We used three methods to optimize the threshold selection, namely maximization of sensitivity and specificity (Cantor et al. 1999; Manel et al. 2001); maximization of kappa (Huntley et al. 1995); and maximization of per cent correctly classified (PCC) (Guisan et al. 1998). These objective estimates can be determined from the predicted data and are not dependent on the underlying machine learning algorithm. These methods of threshold determination have been reported useful in the field of ecology (Jiménez-Valverde and Lobo 2007), ecography (Liu et al. 2005; Bean et al. 2012), ecology and biogeography (Lobo et al. 2008), marine environment modelling (Maloney et al. 2013), genetics (Cushman et al. 2013), gynaecology and reproductive biology (Koskas et al. 2014).


Random Forest

A regression tree (Breiman et al. 1984) is built by recursively partitioning the root node into more and more homogeneous groups down to the terminal nodes. Each split is based on the values of one of the predictors, which is selected according some best splitting criterion. Once a tree has been built, the response for any observation can be predicted by following the path from the root node down to the appropriate terminal node of the tree. The predicted response is simply the average of the values in that terminal node.

The RF is an ensemble learning technique developed by Breiman (2001) that involves combining a large set of decision trees generated independently so that no two trees are the same. The independence between the trees is achieved by randomly selecting one third of the predictors at each node for node splitting and by using a random bootstrap sample comprising about 67% of the training samples to build each tree of the RF. The remaining 33% samples are called out-of-bag (OOB) samples that are used to obtain an error estimate based on the bootstrap subset. At each node, the best split (Liaw and Wiener 2002) is chosen to form child nodes. The value of each child node is the average of the sample values in that node. Once a RF model is built, the entire dataset is passed through the model for making a prediction for each sample. Every data sample from the root node will be allocated to the child node whose node splitting condition it satisfies. The final prediction value of a sample is its average prediction over all the trees in the RF.

Predictor Importance Measure

A statistic that measures the net decline in the node impurities brought about by splitting on a predictor, averaged over all trees, is used for estimating the importance of the variable. The node impurity is estimated using the residual sum of squares (RSS) (Hocking and Leslie 1967) of the root node \(({\hbox {RSS}}_\mathrm{{root}})\) and the child node \(({\hbox {RSS}}_\mathrm{{child}})\) as follows:
$$\begin{aligned} {\hbox {RSS}}_\mathrm{{root}} = \sum _{\mathrm{cases}j}{\big (y_i-{\mu _{|j|}}\big )}^2 , \end{aligned}$$
where j is the number of samples in the root node; \(y_i\) is the value of each sample in the root node; and \(\mu \) is the average root node value; and
$$\begin{aligned} {\hbox {RSS}}_\mathrm{child} = \sum \limits _\mathrm{left}{\big (y_{i}-{y_{L}}^*\big )}^2 + \sum \limits _\mathrm{right}{\big (y_{i}-{y_{R}}^*\big )}^2, \end{aligned}$$
where \({y_\mathrm{L}}^*\) = mean y-value for the left node; \({y_\mathrm{R}}^*\) =  mean y-value for the right node.

A split that maximizes the decline in RSS from the root node to the child node is selected as the best split. The predictor that contributes maximum to the decline in node impurities on splitting is ranked highest, and subsequent predictors are ranked in descending order of their contribution. The coefficient of determination \((R^{2})\) (Nagelkerke 1991) is used to test the goodness of fit of the regression. It is a measure of the proportion of variance explained by the regression model. This measure shows the efficiency of the RF model and hence can be used as a performance metric.

Synthetic Minority Over-sampling Technique

The SMOTE (Chawla et al. 2002) is used for balancing a two-class imbalanced training dataset that contains disproportionate samples of the two classes. The technique involves over-sampling of the minority class by creating several "synthetic" samples for each minority sample along the line segments joining k nearest neighbour minority samples. Synthetic samples are generated by performing vector subtraction between a minority class feature vector and its nearest neighbour minority class feature vector. This difference is then multiplied by a random number between 0 and 1 and added back to the minority class feature vector. This causes the selection of a random point along the line segment between two minority features, thus increasing the number of minority samples. SMOTE thus expands the decision region of the minority class to become more general.

Statistical Measures for Threshold Determination

The optimal threshold range for reclassification of the regression output for identifying prospective areas can be estimated using a variety of statistical measures such as sensitivity, specificity, PCC and kappa \((\kappa) \) (Liu et al. 2005) as follows:
$$\begin{aligned} {\hbox {Sensitivity}}= & {} \frac{d}{c+d}, \end{aligned}$$
$$\begin{aligned} {\hbox {Specificity}}= & {} \frac{a}{a+b}, {\hbox {and}} \end{aligned}$$
$$\begin{aligned} {\hbox {PCC}}= & {} 100\% * \frac{a+d}{n}, \end{aligned}$$

= true negative (observed = negative; predicted = negative),


= false positive (observed = negative; predicted = positive),


= false negative (observed = positive; predicted = negative),


= true positive (observed = positive; predicted = positive),


= a + b + c + d; and

$$\begin{aligned} \kappa =\frac{(\text{ observed}\,{\text{ agreement }})-({\text{chance}}\,{\text{ agreement}})}{1-({\text{chance }}\,{\text{ agreement }})}, \end{aligned}$$
$$\begin{aligned} {\text{ chance }} {\text{ agreement }}= & {} \frac{\frac{(a+b) \times (a+c)}{n}+\frac{(c+d) \times (b+d)}{n}}{n}, {\hbox {and}} \end{aligned}$$
$$\begin{aligned} {\text{ observed }} {\text{ agreement }}= & {} \frac{a+d}{n}. \end{aligned}$$
Using these statistical estimates, three methods of threshold determination were applied in this study: maximizing (sensitivity + specificity), maximizing kappa \((\kappa) \) and maximizing PCC.

Study area and data

Figure 1

Geological map of Tanami region, Western Australia (adopted from Joly et al. 2012)

Table 1

Orogenic gold deposit predictor maps in the GTO

Mineral systems component

Exploration criteria

Predictor maps



Primary data source


Proximity to faults

(1) Proximity to DE fault

\(\texttt {F\_DE\_FLT\_I}\)

De structures buffered to 12,000 m


(2) Proximity to D1 fault

\(\texttt {F\_D1\_FLT\_I}\)

D1 faults buffered to 7500 m

Interpreted structural geological map (Joly et al. 2010, 2012)

(3) Proximity to D2 fault

\(\texttt {F\_D2\_FLT\_I}\)

D2 faults buffered to 8000 m


Proximity to structure with elevated gold values

(4) Proximity to structure with elevated Au values

\(\texttt {F\_STR\_DRL\_}\)

Faults buffered to 1 km and attributed with Au values interpolated from drill hole data+surfacegeochem

Proprietary geochemical data of Tanami Gold NL (Geoscience Australia 2010)

Chemical trap

Chemical contrast at geological contacts

(5) Chemical contrast density

\(\texttt {F\_CHMDENS\_}\)

Density of geological contacts weighted by chemical contrast


(6) Chemical contrast across contact

\(\texttt {F\_CHMCNT\_I}\)

Chemical contrast across geological contacts

Bed rock geological map of Western Australia (Geoscience Australia 2008)

(7) Dolerite density

\(\texttt {F\_DOLDENS\_}\)

Density of dolerite contacts


Physical trap

Physical contrast at geological contacts

(8) Competency contrast density

\(\texttt {F\_CMPCNTDS}\)

Density of geological contacts weighted by competency contrast across the contacts


(9) Competency contrast

\(\texttt {F\_COMPCNT\_}\)

Competency contrast across geological contacts

Bed rock geological map of Western Australia (Geoscience Australia 2008)

(10) Contact density

\(\texttt {F\_CNTCTDS\_}\)

Density of geological contacts


Proximity to intersection of faults

(11) Proximity to D2 and D1 fault intersection

\(\texttt {F\_D1XD2\_IN}\)

D1 x D2 intersections buffered to 8000 m

Interpreted structural geological map (Joly et al. 2010, 2012)

Proximity to anticlinal fold axis

(12) Proximity to D2 anticlinal fold axis

\(\texttt {F\_D2ANTI\_I}\)

D2 anticlines buffered to 600 m


The GTO (Granite Tanami Orogen) is a one of the most important gold provinces in the Northern Australia Craton. The total resources in the province are of the order of 10 Moz (Goleby et al. 2009). Although most early deposits were discovered in the Northern territory part of the GTO, significant mineralization has been discovered in the Bald Hill (e.g., Kookaburra and Sandpiper) and Coyote areas of the Western Australian part as shown in Figure 1. The resolution of the data used is 100 m and is projected using the universal transverse mercator (UTM) projection and the geocentric datum of australia (GDA).

The GTO comprises Paleoproterozoic volcano-sedimentary rocks of Tanami Group (ca. 1864-1844 Ma) deposited on an Archaean basement (locally exposed; \(2514 \pm 3\) Ma) and intruded by ca. 1795 Ma Granites. The Tanami Group comprises (a) the ca. 1864 Ma Stubbins Formation, which includes banded iron formation, iron-rich siltstone and shale, carbonaceous shale, chert, pillow basalt and contemporaneous dolerite sills, and rare rhyolite, and is overlain conformably by (b) the ca. 1864-1844 Ma Killi Formation, which is a 5-km-thick turbiditic succession having a predominantly granitic provenance. The Tanami Group is intruded by granitoids derived from partial melting of an Archaean substratum and emplaced at \(1795 \pm 3\) Ma. The intrusion event is broadly synchronous with gold mineralization and the peak of metamorphism. The Stubbins Formation hosts several deposits/small prospects in the Bald Hill area in a sequence of turbiditic mafic volcanic rocks and tholeiitic dolerite sills. The turbidtic Killi-Killi Formation hosts the Coyote deposit and several prospects in the vicinity as shown in Figure 1.

A 4D geological modelling based on forward gravity modelling constrained by seismic data along profiles 1 and 2 [Fig. 1 (Joly et al. 2010, 2012)] indicates that the spatial distribution of gold deposits in the GTO is fundamentally controlled by the pre- to syn-mineralization \(D_{GTOE}{-}D_{GTO1}{-}D_{GTO2}\) architecture. The deep-penetrating south-dipping \(D_{GTOE}\) faults possibly provided the primary structural corridors for tapping metal-bearing fluids in the deeper crustal levels (Joly et al. 2012). The 3D geometry of the \(D_{GTO2}\) structures that formed during the subsequent deformation phase is primarily controlled by the \(D_{GTOE}\) architecture, and, since the \(D_{GTO2}\) structures are broadly synchronous with the mineralization, the \(D_{GTOE}{-}D_{GTO2}\) interaction possibly forms a key spatial control on fluid transport. In particular, the large \(D_{GTO2}\) faults that connect to \(D_{GTOE}\) faults at depth are considered the most likely active structural pathways for metal-bearing solutions during the \(D_{GTO2}\) deformation event. The NNW–SSE trending compression during the \(D_{GTO2}\) event also led to extensional opening of the pre-existing N–S trending \(D_{GTO1}\) faults, which could have provided additional structural pathways.

Most of the known deposits in the orogen are associated with anticlinal fold axes. The 4D modelling, supported by mine-scale studies, shows that the \(F_{GTO2}\) anticlines are coeval with gold mineralization and occur on the hanging walls of the \(D_{GTO2}\) faults. Therefore, \(D_{GTO2}\) anticlines constitute prime physical traps for gold in the orogen. The Stubbins Formation is compositionally heterogeneous, as opposed to the seismically transparent, lithologically homogenous, Killi-Killi Formation. It is characterized by stronger rheological and reactivity contrasts and contains ubiquitous iron-rich units and dolerite, which could facilitate gold deposition from the hydrothermal solutions.

The geology and orogenic gold mineralization in the GTO is discussed in detail by Joly et al. (2010, 2012). The input predictor maps for the RF modelling were drawn from a variety of publically available sources. The lithologies in the study area were extracted from the interpreted bedrock geological map in the West Tanami geological information package of the Geological Survey of Western Australia (GSWA) (Geoscience Australia 2008). This database (Geoscience Australia 2008) contains all available geophysical datasets and the location of deposits, occurrences and prospects. The structural data were taken from Joly et al. (2010). Geochemical data were extracted from the GSWA state geochemistry database (Geoscience Australia 2010) and Geoscience Australia Ozchem (Champion et al. 2007).

The training samples (deposit and non-deposit) are shown in Figure 1. There were 33 gold training samples and 343 non-deposit training samples. The 33 gold samples were chosen from the same locations mentioned in Joly et al. (2012). The 343 non-deposits were randomly selected from the areas that are expected to be non-prospective on geological grounds. The details of key exploration criteria for orogenic gold deposits in the GTO along with the respective predictor maps that were used as inputs to the RF modelling are summarized in Table 1. The predictor maps and the procedures used to derive them are described in detail by Joly et al. (2012). The individual predictor maps are not shown in this paper due to space constraints.


The schematic workflow of the methodology used for processing data is given in Figure 2.
Figure 2

Schematic workflow of the processing done in this study

The RF model was initialized using 1000 regression trees and 12 predictor maps with 33 deposit and 343 non-deposits as training samples. The goodness of fit was evaluated by plotting the \(R^{2}\) curve. The important predictor maps were identified using the node purity measure. The original “sparse” data were balanced to bring the deposit-to-non-deposit ratio closer to 50:50 using the SMOTE technique (Chawla et al. 2002). The training samples of the minority class (gold deposits) were over-sampled in the feature space using SMOTE. The training samples of the majority class (non-deposits) are under-sampled by randomly removing samples until the minority class samples become closer in number to the majority class. By iteratively under-sampling and over-sampling the majority (non-deposit) class and minority (deposit) class, respectively, the initial bias towards the former is reversed in favor of the latter and a balanced training dataset was derived. We then constructed a new RF model using the balanced dataset and evaluated the goodness of fit. The R software (R Core Team 2013) was used for Random Forest model generation and prediction while ArcGIS (McCoy et al. 2001) was used for generating the input predictor maps and mapping the output regression values and creating the prospectivity maps.
Figure 3

Flowchart of optimal threshold range determination for gold prospectivity mapping

For determining the optimal threshold for reclassification of output regression values, we applied three techniques, namely maximizing the sum of specificity and sensitivity, maximizing the per cent correctly classified (PCC) and maximizing kappa. The objectively identified optimal threshold ranges were used to identify three classes, namely “low prospectivity”, “moderate prospectivity” and “high prospectivity”. In the flowchart in Figure 3, when the three conditions of optimal threshold are “TRUE”, it signifies moderate prospectivity since these conditions are used to calculate the optimal threshold range. The optimal threshold ranges for the sparse and the balanced datasets are given in Table 4. The areas that fall below this optimal threshold range are marked “low prospectivity” zones, and the areas that lie above this optimal threshold range are marked “high prospectivity” zones.

Results and Discussions

The original sparse dataset of 33 deposits and 343 non-deposits was balanced using SMOTE. The minority (deposit) class was over-sampled by 500%, and the majority (non-deposit) class was iteratively under-sampled by 10–800% to derive an optimal balanced dataset. Figure 4 shows 500% over-sampling of the minority class and 100% under-sampling of the majority provides the best classification accuracy.
Figure 4

Optimal majority class under-sampling percentage for 500% over-sampling of minority class

An over-sampling by 500% involves adding deposit feature vectors, one in each direction between a known deposit feature vector and its five nearest neighbors. In this way, \((33 \times 5 =)\, 165\) new deposit feature vectors were added in the feature space, thereby giving rise to a total of 198 (= 165 + 33) deposit training samples. An under-sampling of 100% for the non-deposits was done, and the final number of non-deposit samples was calculated as follows: (first digit of over-sampling %* first digit of under-sampling %* no. of minority class samples). Thus, the number of non-deposit samples in our study became \((5 \times 1 \times 33 =)\, 165\) non-deposit samples. The synthetic balanced dataset therefore contained 198 deposit and 165 non-deposit samples, thus bringing up the ratio of deposit to non-deposit samples from 8.8:91.2 in the original sparse training data to 54.54:45.45 in the derived balanced training data. The balanced data were used for constructing the RF-based regression model. The RF regression predictions and the corresponding \(R^{2}\) fit for the sparse and balanced dataset are shown in Figures 5 and 6, respectively. It can be seen that the coefficient of determination \(R^{2}\) is 0.85 for the balanced dataset, which means the predicted values closely fit to the input values creating a better RF model compared to the \(R^{2}\) of 0.41 for the original sparse dataset.
Figure 5

RF regression predictions for the “sparse” (a) and “balance” (b) datasets

Figure 6

Goodness of fit for the “sparse” (a) and “balance” (b) datasets

Table 2

Predictor importance of “sparse” dataset


Increase node purity

\(\texttt {F\_STR\_DRL\_}\)


\(\texttt {F\_CHMDENS\_}\)


\(\texttt {F\_CMPCNTDS}\)


\(\texttt {F\_CNTCTDS\_}\)


\(\texttt {F\_DOLDENS\_}\)


\(\texttt {F\_D2ANTI\_I}\)


\(\texttt {F\_D2\_FLT\_I}\)


\(\texttt {F\_COMPCNT\_}\)


\(\texttt {F\_CHMCNT\_I}\)


\(\texttt {F\_DE\_FLT\_I}\)


\(\texttt {F\_D1\_FLT\_I}\)


\(\texttt {F\_D1XD2\_IN}\)


Table 3

Predictor importance of “balance” dataset


Increase node purity

\(\texttt {F\_STR\_DRL\_}\)


\(\texttt {F\_CHMDENS\_}\)


\(\texttt {F\_CHMCNT\_I}\)


\(\texttt {F\_CNTCTDS\_}\)


\(\texttt {F\_D2ANTI\_I}\)


\(\texttt {F\_COMPCNT\_}\)


\(\texttt {F\_CMPCNTDS}\)


\(\texttt {F\_DOLDENS\_}\)


\(\texttt {F\_D2\_FLT\_I}\)


\(\texttt {F\_DE\_FLT\_I}\)


\(\texttt {F\_D1\_FLT\_I}\)


\(\texttt {F\_D1XD2\_IN}\)


The rankings of the input geological predictor maps derived using both the sparse and the balanced datasets are shown in Tables 2 and 3, respectively. The maps showing proximity to structures with elevated gold values \((F\_STR \_DRL \_)\) and geological contact density weighted by chemical contrast \((F\_CHMDENS\_)\) were ranked as the best predictors by both models. The predictor map \(F\_STR\_DRL\_\) is a proxy for favourable pathways for gold-bearing hydrothermal fluids. This map was generated by attributing deeply penetrating crustal scale faults with gold anomalies mapped from drilling and surface geochemical data. As such, this map is one of the most reliable predictors for passage of gold-bearing hydrothermal fluids (Joly et al. 2012). The \(F\_CHMDENS\_\) map was generated by mapping the geological contact density weighted by the chemical reactivity contrast (Joly et al. 2012). It represents the processes involved in the breakdown of soluble gold complexes by the reaction of hydrothermal fluids with reactive wall rocks. This \(F\_CHMDENS\_\) map is a significant vector to gold deposit mapping since it identifies gold traps that are pointers to gold deposits.
Table 4

Optimal threshold values using objective approaches


Max. (Sensitivity + Specificity)

Max. Kappa

PCC (%)

Optimal threshold range

Sparse dataset





Balance dataset





Figure 7

Gold prospectivity mapping based on the (a) “sparse” dataset and (b) “balance” dataset

Table 5

Gold deposits prospectivity after applying the optimal threshold


Deposit samples

Non-deposit samples

Low prospectivity (%)

Moderate prospectivity (%)

High prospectivity (%)













Figure 8

Prospectivity with predictor maps for top two parameters using the “balance” dataset (a) \(F\_STR\_DRL\_\) (b) \(F\_CHMDENS\_ \)

Figure 9

Prospectivity with predictor maps for bottom two parameters using the “balance” dataset (a) \(F\_D1\_FLT\_I\) (b) \(F\_D1XD2\_IN\)

The three techniques of objective threshold determination shown in Figure 3 were applied to mapping prospective exploration targets. The optimal threshold ranges determined by the three objective approaches with their values are shown in Table 4. The prospectivity maps for the “sparse” and the “balance” datasets are shown in Figure 7. Table 5 shows the percentage of gold prospectivity identified using the sparse and the balanced training datasets. The RF model based on the balanced training data identifies 4.67% of the study area as highly prospective while that based on the original sparse training data identifies 1.06% of the study area as highly prospective. This shows that the balanced data under-estimated zones of high prospectivity.

The predictor maps for the top two and bottom two ranked predictors for the balanced dataset were analyzed by overlaying the output gold prospectivity map on them. It was observed that the predictor map of proximity to structures with elevated gold values \((F\_STR\_DRL\_)\) and the predictor map related to the geological contact density weighted by chemical contrast \((F\_CHMDENS\_)\) (Fig. 8) demonstrate high spatial correlation with gold prospectivity, which further validates the ranking of predictor maps returned by the RF algorithm. The bottom ranked predictor maps, namely proximity to D1 Fault \((F\_D1\_FLT\_I)\) and proximity to D2 and D1 fault intersection \((F\_D1XD2\_IN)\)(Fig. 9), do not show significant spatial correlation with gold prospectivity.

A study on gold prospectivity in the Tanami region was also conducted by Joly et al. (2012), which describes in detail a combination of three approaches for gold prospectivity mapping: manual method, fuzzy method, weight of evidence (WofE) technique. The first two are knowledge-driven techniques and the third is data-driven. As commented by Joly et al. (2012), a prospectivity map is not a treasure map; it is a decision support tool. Different approaches to modelling provide different insights into the prospectivity of an area and should be used in conjunction to make decisions regarding exploration targeting.

The WofE technique applied in Joly et al. (2012) uses conditional probability for determining spatial association between a set of predictor maps and a set of gold deposits. The final posterior probability, which is a combination of the individual conditional probabilities, assumes conditional independence amongst the predictor maps that may not be real (Joly et al. 2012). Random Forest on the other hand does not assume conditional independence amongst the predictor maps. It creates independent trees by building each tree on a random bootstrap sample and by selecting a random subset of predictors for splitting at each node of every decision tree. In this study, RF created 1000 regression trees. It was observed by Svetnik et al. (2003) that the OOB error is the least when 1000 trees are taken in the RF. Random Forest creates high independence and low correlation amongst its trees, giving an unbiased prediction. On the other hand, Joly et al. (2012) observed that the assumption of conditional independence may lead to biased prospectivity maps.

Our RF-based prospective map using the balanced training dataset identifies similar but narrower “high” prospective zones compared to those in Joly et al. (2012). In our study, we applied an optimal threshold range based on objective estimates to convert the floating point regression output to classification map.  Joly et al. (2012) used a capture efficiency curve of the WofE model for reclassification of the continuous map into ternary prospectivity map (Joly et al. 2012). This leads to 2.3 and 12% of the area being classified as “very high prospective” and “high prospective”, respectively. Our RF model using the balanced dataset established 4.67% areas as highly prospective and 1.64% moderately prospective. Therefore, the total area deemed prospective for gold deposits by Joly et al. (2012) is close to 15% of the study area while it is close to 6% of the study area in our model. This does not necessarily imply that the RF model is superior; smaller prospective zones could mean more false negatives [Type II error (Lieberman and Cunningham 2009)]. On the other hand, larger prospective areas could imply more false positives (Type I error (Lieberman and Cunningham 2009)). Both techniques are useful in making decisions regarding exploration targeting under the conditions of uncertainty.

An experiment was performed using a training dataset comprising 33 deposits and 33 non-deposits samples. This training dataset will be referred to as “equal” training dataset. The RF regression predictions for this equal dataset are shown in Figure 10. The goodness of fit plot (Fig. 11) shows that the \(R^{2}\) is 0.46 for the RF model using the “equal” training dataset. Thus, the RF model is able to explain only 46% variance in the training dataset. The rankings of the predictor maps for the equal training dataset are shown in Table 6. The optimal threshold range for reclassification was evaluated to be in the range (0.41–0.52) for the equal dataset. The resulting prospectivity map is shown in Figure 12. The RF model using the equal dataset established 15.69% of the area to be highly prospective. The high prospectivity zones occur mostly in those areas where the rank 1 predictor map—geological contact density weighted by chemical contrast \((F\_CHMDENS\_)\), identifies gold traps as shown in Figure 13.
Figure 10

RF regression predictions for “equal” dataset

Figure 11

Goodness of fit for the “equal” dataset

Figure 12

RF prospectivity map for “equal” dataset

Table 6

Predictor importance of “equal” dataset


Increase node purity

\(\texttt {F\_CHMDENS\_}\)


\(\texttt {F\_STR\_DRL\_}\)


\(\texttt {F\_CMPCNTDS}\)


\(\texttt {F\_DOLDENS\_}\)


\(\texttt {F\_CNTCTDS\_}\)


\(\texttt {F\_CHMCNT\_I}\)


\(\texttt {F\_COMPCNT\_}\)


\(\texttt {F\_D2\_FLT\_I}\)


\(\texttt {F\_D2ANTI\_I}\)


\(\texttt {F\_DE\_FLT\_I}\)


\(\texttt {F\_D1\_FLT\_I}\)


\(\texttt {F\_D1XD2\_IN}\)


Figure 13

Prospectivity with predictor map for top parameter \((F\_CHMDENS\_)\) of the “equal” dataset

Figure 14

False positive determination for the “balance” (a) and “equal” dataset (b)

Although the equal training dataset delineates wider zones of high prospectivity compared to the balanced training dataset, many of these areas contain several non-deposit points (Fig. 14b). In other words, the equal training dataset leads to an increase in false positives as compared to the balanced training dataset shown in Figure 14a. The RF model using the equal training dataset had a goodness of fit \((R^{2}=0.46)\) which is much lower than that of the balanced training dataset \((R^{2}=0.85)\). It was observed that a low \(R^{2}\) value leads to a prediction that either overfits based on the training samples as in case of the sparse dataset \((R^{2}=0.41)\) or leads to a prediction with increase in false positives as in case of equal dataset \((R^{2}=0.46)\).


This study applied RF-based regression for prospectivity modelling of gold deposits in the Tanami region, a greenfields province in Western Australia that contains very few deposits. The imbalance between the number of deposit (minority class) and non-deposit (majority class) samples causes severe problems in learning the RF model, which becomes biased towards the majority class. In this study, we applied the SMOTE technique for balancing the training dataset. This technique involves over-sampling of the minority class and under-sampling of the majority class in order to bring sample ratio of the two classes close to 1:1.

The balanced dataset identified wider high prospective zones for gold deposits mapping (4.67%) compared to the sparse dataset (1.06%). An RF model built with a low \(R^{2}\) value may either overfit predictions based on the training data or leads to predictions with increase in false positives. Our model indicates that the predictor map for the proximity to structures with elevated gold values and the predictor map for the geological contact density weighted by chemical contrast are most important predictors of gold deposits in this study. It was observed that the top two predictor maps ranked by RF showed high spatial correlation to gold prospectivity.


The authors would like to thank the two anonymous reviewers for their insightful comments and suggestions which we believe has improved the overall technical quality of the paper. We also thank the editors of Natural Resources Research for suggesting edits to the manuscript.

Copyright information

© International Association for Mathematical Geosciences 2017

Authors and Affiliations

  • Siddharth Hariharan
    • 1
  • Siddhesh Tirodkar
    • 2
  • Alok Porwal
    • 1
    • 3
  • Avik Bhattacharya
    • 1
  • Aurore Joly
    • 4
  1. 1.Centre of Studies in Resources EngineeringIndian Institute of Technology BombayMumbaiIndia
  2. 2.Climate StudiesIndian Institute of Technology BombayMumbaiIndia
  3. 3.Centre for Exploration TargetingUniversity of Western AustraliaCrawleyAustralia
  4. 4.Aurora Australis GeoconsultingSubiacoAustralia

Personalised recommendations