Abstract
Data-driven prospectivity modelling of greenfields terrains is challenging because very few deposits are available and the training data are overwhelmingly dominated by non-deposit samples. This could lead to biased estimates of model parameters. In the present study involving Random Forest (RF)-based gold prospectivity modelling of the Tanami region, a greenfields terrain in Western Australia, we apply the Synthetic Minority Over-sampling Technique to modify the initial dataset and bring the deposit-to-non-deposit ratio closer to 50:50. An optimal threshold range is determined objectively using statistical measures such as the data sensitivity, specificity, kappa and per cent correctly classified. The RF regression modelling with the modified dataset of close to 50:50 sample ratio of deposit to non-deposit delineates 4.67% of the study area as high prospectivity areas as compared to only 1.06% by the original dataset, implying that the original “sparse” dataset underestimates prospectivity.
Similar content being viewed by others
References
Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter, 6(1), 20–29.
Bean, W. T., Stafford, R., & Brashares, J. S. (2012). The effects of small sample size and sample bias on threshold selection and accuracy assessment of species distribution models. Ecography, 35(3), 250–258.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. doi:10.1023/A%3A1010933404324.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees, the wadsworth statistics and probability series (p. 356). Belmont California: Wadsworth International Group.
Breslow, N., & Cain, K. (1988). Logistic regression for two-stage case-control data. Biometrika, 75, 11–20.
Cantor, S. B., Sun, C. C., Tortolero-Luna, G., Richards-Kortum, R., & Follen, M. (1999). A comparison of c/b ratios from studies using receiver operating characteristic curve analysis. Journal of Clinical Epidemiology, 52(9), 885–892. http://www.sciencedirect.com/science/article/pii/S089543569900075X
Carranza, E. J. M., & Laborte, A. G. (2015a). Data-driven predictive mapping of gold prospectivity, baguio district, philippines: Application of random forests algorithm. Ore Geology Reviews, 71, 777–787.
Carranza, E. J. M., & Laborte, A. G. (2015b). Random forest predictive modeling of mineral prospectivity with small number of prospects and data with missing values in abra (philippines). Computers & Geosciences, 74, 60–70.
Carranza, E. J. M., & Laborte, A. G. (2016). Data-driven predictive modeling of mineral prospectivity using random forests: A case study in catanduanes island (philippines). Natural Resources Research, 25(1), 35–50.
Carranza, E. J. M., Sadeghi, M., & Billay, A. (2015). Predictive mapping of prospectivity for orogenic gold, giyani greenstone belt (south africa). Ore Geology Reviews, 71, 703–718.
Champion, D., Budd, A., & Wyborn, L. (2007). Ozchem national whole rock geochemistry database. Geoscience Australia. http://www.ga.gov.au/metadata-gateway/metadata/record/gcat_65464
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
Core Team, R. (2013). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. ISBN 3-900051-07-0. http://www.R-project.org/
Cracknell, M. J., Reading, A. M., & McNeill, A. W. (2014). Mapping geology and volcanic-hosted massive sulfide alteration in the Hellyer–Mt Charter region, tasmania, using random forests and self-organising maps. Australian Journal of Earth Sciences, 61(2), 287–304. doi:10.1080/08120099.2014.858081.
Cushman, S. A., Shirk, A. J., & Landguth, E. L. (2013). Landscape genetics and limiting factors. Conservation Genetics, 14(2), 263–274.
Gao, Y., Zhang, Z., Xiong, Y., & Zuo, R. (2016). Mapping mineral prospectivity for cu polymetallic mineralization in Southwest Fujian province, China. Ore Geology Reviews, 75, 16–28.
Geoscience Australia. (2008). Geological survey of western australia. West Tanami, 2008: Western Australia Geological Survey, 1:100 000 Geological Information Series, ISBN 978-1-74168-186-4.
Geoscience Australia. (2010). Geological survey of western australia, geochem database, geological survey of western australia. http://geochem.dmp.wa.gov.au/geochem/
Gislason, P. O., Benediktsson, J. A., & Sveinsson, J. R. (2006). Random forests for land cover classification. Pattern Recognition Letters, 27(4), 294–300. Pattern Recognition in Remote Sensing (PRRS 2004). http://www.sciencedirect.com/science/article/pii/S0167865505002242
Goleby, B. R., Huston, D. L., Lyons, P., Vandenberg, L., Bagas, L., Davies, B. M., et al. (2009). The tanami deep seismic reflection experiment: An insight into gold mineralization and paleoproterozoic collision in the north australian craton. Tectonophysics, 472(1), 169–182.
Grömping, U. (2009). Variable importance assessment in regression: Linear regression versus random forest. The American Statistician, 63(4), 308–319.
Guisan, A., Theurillat, J.-P., & Kienast, F. (1998). Predicting the potential distribution of plant species in an alpine environment. Journal of Vegetation Science, 9(1), 65–74.
Hariharan, S., Tirodkar, S., & Bhattacharya, A. (2016). Polarimetric sar decomposition parameter subset selection and their optimal dynamic range evaluation for urban area classification using random forest. International Journal of Applied Earth Observation and Geoinformation, 44, 144–158.
Harris, J., Grunsky, E., Behnia, P., & Corrigan, D. (2015). Data-and knowledge-driven mineral prospectivity maps for canada’s north. Ore Geology Reviews, 71, 788–803.
Harris, J., Wilkinson, L., Heather, K., Fumerton, S., Bernier, M., Ayer, J., et al. (2001). Application of gis processing techniques for producing mineral prospectivity mapsa case study: Mesothermal au in the swayze greenstone belt, ontario, canada. Natural Resources Research, 10(2), 91–124.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
Hocking, R., & Leslie, R. (1967). Selection of the best subset in regression analysis. Technometrics, 9(4), 531–540.
Huntley, B., Berry, P. M., Cramer, W., & McDonald, A. P. (1995). Special paper: Modelling present and potential future ranges of some european higher plants using climate response surfaces. Journal of Biogeography, 22, 967–1001.
Jiménez-Valverde, A., & Lobo, J. M. (2007). Threshold criteria for conversion of probability of species presence to either-or presence-absence. Acta Oecologica, 31(3), 361–369.
Joly, A., McCuaig, T. C., & Bagas, L. (2010). The importance of early crustal architecture for subsequent basin-forming, magmatic and fluid flow events. the granites-tanami orogen example. Precambrian Research, 182(1), 15–29.
Joly, A., Porwal, A., & McCuaig, T. C. (2012). Exploration targeting for orogenic gold deposits in the granites-tanami orogen: Mineral system analysis, targeting model and prospectivity analysis. Ore Geology Reviews, 48, 349–383.
King, G., & Zeng, L. (2001). Logistic regression in rare events data. Political Analysis, 9, 137–163.
Koskas, M., Genin, A. S., Graesslin, O., Barranger, E., Haddad, B., Darai, E., et al. (2014). Evaluation of a method of predicting lymph node metastasis in endometrial cancer based on five pre-operative characteristics. European Journal of Obstetrics & Gynecology and Reproductive Biology, 172, 115–119.
Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News, 2(3), 18–22.
Lieberman, M. D., & Cunningham, W. A. (2009). Type i and type ii error concerns in fmri research: Re-balancing the scale. Social Cognitive and Affective Neuroscience, 4(4), 423.
Liu, C., Berry, P. M., Dawson, T. P., & Pearson, R. G. (2005). Selecting thresholds of occurrence in the prediction of species distributions. Ecography, 28(3), 385–393. doi:10.1111/j.0906-7590.2005.03957.x.
Lobo, J. M., Jimnez-Valverde, A., & Real, R. (2008). Auc: A misleading measure of the performance of predictive distribution models. Global Ecology and Biogeography, 17(2), 145–151. doi:10.1111/j.1466-8238.2007.00358.x.
Maloney, K. O., Weller, D. E., Michaelson, D. E., & Ciccotto, P. J. (2013). Species distribution models of freshwater stream fishes in maryland and their implications for management. Environmental Modeling & Assessment, 18(1), 1–12.
Manel, S., Williams, H. C., & Ormerod, S. (2001). Evaluating presenceabsence models in ecology: The need to account for prevalence. Journal of Applied Ecology, 38(5), 921–931. doi:10.1046/j.1365-2664.2001.00647.x.
McCoy, J., Johnston, K., & Environmental Systems Research Institute. (2001). Using ArcGIS spatial analyst: GIS by ESRI. Redlands, CA: Environmental Systems Research Institute.
McKay, G., & Harris, J. (2016). Comparison of the data-driven random forests model and a knowledge-driven method for mineral prospectivity mapping: a case study for gold deposits around the huritz group and nueltin suite, nunavut, canada. Natural Resources Research, 25(2), 125–143.
Nagelkerke, N. J. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78(3), 691–692.
Ok, A. O., Akar, O., & Gungor, O. (2012). Evaluation of random forest method for agricultural crop classification. European Journal of Remote Sensing, 45(3), 421.
Porwal, A., & Carranza, E. J. M. (2015). Introduction to the special issue: Gis-based mineral potential modelling and geological data analyses for mineral exploration. Ore Geology Reviews, 71, 477–483.
Rodriguez-Galiano, V., Chica-Olmo, M., & Chica-Rivas, M. (2014). Predictive modelling of gold potential with the integration of multisource information based on random forest: a case study on the rodalquilar area, southern spain. International Journal of Geographical Information Science, 28(7), 1336–1354. doi:10.1080/13658816.2014.885527.
Rodriguez-Galiano, V., Sanchez-Castillo, M., Chica-Olmo, M., & Chica-Rivas, M. (2015). Machine learning predictive models for mineral prospectivity: An evaluation of neural networks, random forest, regression trees and support vector machines. Ore Geology Reviews, 71, 804–818.
Schill, W., Jöckel, K.-H., Drescher, K., & Timm, J. (1993). Logistic analysis in case-control studies under validation sampling. Biometrika, 80, 339–352.
Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., & Feuston, B. P. (2003). Random forest: A classification and regression tool for compound classification and qsar modeling. Journal of Chemical Information and Computer Sciences, 43(6), 1947–1958.
Zhang, Z., Zuo, R., & Xiong, Y. (2016). A comparative study of fuzzy weights of evidence and random forests for mapping mineral prospectivity for skarn-type fe deposits in the southwestern fujian metallogenic belt, china. Science China Earth Sciences, 59(3), 556–572.
Acknowledgements
The authors would like to thank the two anonymous reviewers for their insightful comments and suggestions which we believe has improved the overall technical quality of the paper. We also thank the editors of Natural Resources Research for suggesting edits to the manuscript.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hariharan, S., Tirodkar, S., Porwal, A. et al. Random Forest-Based Prospectivity Modelling of Greenfield Terrains Using Sparse Deposit Data: An Example from the Tanami Region, Western Australia. Nat Resour Res 26, 489–507 (2017). https://doi.org/10.1007/s11053-017-9335-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11053-017-9335-6