Using the right tool for the job: the difference between unsupervised and supervised analyses of multivariate ecological data

Abstract

Ecologists often collect data with the aim of determining which of many variables are associated with a particular cause or consequence. Unsupervised analyses (e.g. principal components analysis, PCA) summarize variation in the data, without regard to the response. Supervised analyses (e.g., partial least squares, PLS) evaluate the variables to find the combination that best explain a causal relationship. These approaches are not interchangeable, especially when the variables most responsible for a causal relationship are not the greatest source of overall variation in the data—a situation that ecologists are likely to encounter. To illustrate the differences between unsupervised and supervised techniques, we analyze a published dataset using both PCA and PLS and compare the questions and answers associated with each method. We also use simulated datasets representing situations that further illustrate differences between unsupervised and supervised analyses. For simulated data with many correlated variables that were unrelated to the response, PLS was better than PCA at identifying which variables were associated with the response. There are many applications for both unsupervised and supervised approaches in ecology. However, PCA is currently overused, at least in part because supervised approaches, such as PLS, are less familiar.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Data accessibility

Reproducible R code for generating and analyzing data is archived in Zenodo (https://doi.org/10.5281/zenodo.3568392). Complete data from Muir et al. (2017b) are available on Data Dryad.

References

  1. Aguilera AM, Escabias M, Valderrama MJ (2006) Using principal components for estimating logistic regression with high-dimensional multicollinear data. Comput Stat Data Anal 50:1905–1924. https://doi.org/10.1016/J.CSDA.2005.03.011

    Article  Google Scholar 

  2. Anderson MJ (2001) A new method for non-parametric multivariate analysis of variance. Austral Ecol 26:32–46. https://doi.org/10.1111/j.1442-9993.2001.01070.pp.x

    Article  Google Scholar 

  3. Aplin P (2005) Remote sensing: ecology. Prog Phys Geogr 29:104–113. https://doi.org/10.1191/030913305pp437pr

    Article  Google Scholar 

  4. Berger B, Parent B, Tester M (2010) High-throughput shoot imaging to study drought responses. J Exp Bot 61:3519–3528. https://doi.org/10.1093/jxb/erq201

    CAS  Article  PubMed  Google Scholar 

  5. Bonney R, Cooper CB, Dickinson J, Kelling S, Phillips T, Rosenberg KV, Shirk J (2009) Citizen science: a developing tool for expanding science knowledge and scientific literacy. Bioscience 59:977–984. https://doi.org/10.1525/bio.2009.59.11.9

    Article  Google Scholar 

  6. Borcard D, Gillet F, Legendre P (2018) Numerical ecology with R. Springer International Publishing, Cham

    Google Scholar 

  7. Cardini A, O’Higgins P, Rohlf FJ (2019) Seeing distinct groups where there are none: spurious patterns from between-group PCA. Evol Biol 46:303–316. https://doi.org/10.1007/s11692-019-09487-5

    Article  Google Scholar 

  8. Carrascal LM, Galván I, Gordo O (2009) Partial least squares regression as an alternative to current regression methods used in ecology. Oikos 118:681–690. https://doi.org/10.1111/j.1600-0706.2008.16881.x

    Article  Google Scholar 

  9. Cooke SJ, Hinch SG, Wikelski M, Andrews RD, Kuchel LJ, Wolcott TG, Butler PJ (2004) Biotelemetry: a mechanistic approach to ecology. Trends Ecol Evol 19:334–343. https://doi.org/10.1016/J.TREE.2004.04.003

    Article  PubMed  Google Scholar 

  10. Dickinson JL, Shirk J, Bonter D, Bonney R, Crain RL, Martin J, Phillips T, Purcell K (2012) The current state of citizen science as a tool for ecological research and public engagement. Front Ecol Environ 10:291–297. https://doi.org/10.1890/110236

    Article  Google Scholar 

  11. Dray S, Chessel D, Thioulouse J (2003) Co-inertia analysis and the linking of ecological data tables. Ecology 84:3078–3089. https://doi.org/10.1890/03-0178

    Article  Google Scholar 

  12. Dray S, Pélissier R, Couteron P, Fortin MJ, Legendre P, Peres-Neto PR, Bellier E, Bivand R, Blanchet FG, De Cáceres M, Dufour AB, Heegaard E, Jombart T, Munoz F, Oksanen J, Thioulouse J, Wagner HH (2012) Community ecology in the age of multivariate multiscale spatial analysis. Ecol Monogr 82:257–275. https://doi.org/10.1890/11-1183.1

    Article  Google Scholar 

  13. Eriksson L, Johansson E, Kettaneh-Wold N, Trygg J, Wikström C, Wold S (2006) Multi-and megavariate data analysis part 1: basic principles and applications. Umetrics AB, Umeå, Sweeden

    Google Scholar 

  14. Fahlgren N, Gehan MA, Baxter I (2015) Lights, camera, action: high-throughput plant phenotyping is ready for a close-up. Curr Opin Plant Biol 24:93–99. https://doi.org/10.1016/J.PBI.2015.02.006

    Article  PubMed  Google Scholar 

  15. Fick SE, Hijmans RJ (2017) WorldClim 2: new 1-km spatial resolution climate surfaces for global land areas. Int J Climatol. https://doi.org/10.1002/joc.5086

    Article  Google Scholar 

  16. Gotelli NJ, Ellison AM (2013) A primer of ecological statistics, 2nd edn. Sinauer Associates Inc Publishers, Sunderland

    Google Scholar 

  17. Hervé MR, Nicolè F, Lê Cao K-A (2018) Multivariate analysis of multiple datasets: a practical guide for chemical ecology. J Chem Ecol 44:215–234. https://doi.org/10.1007/s10886-018-0932-6

    CAS  Article  PubMed  Google Scholar 

  18. Jolliffe IT (1982) A note on the use of principal components in regression. Appl Stat 31:300. https://doi.org/10.2307/2348005

    Article  Google Scholar 

  19. Kallenbach M, Oh Y, Eilers EJ, Veit D, Baldwin IT, Schuman MC (2014) A robust, simple, high-throughput technique for time-resolved plant volatile analysis in field experiments. Plant J 78:1060–1072. https://doi.org/10.1111/tpj.12523

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  20. Kfoury N, Scott E, Orians C, Robbat A (2017) Direct contact sorptive extraction: a robust method for sampling plant volatiles in the field. J Agric Food Chem 65:8501–8509. https://doi.org/10.1021/acs.jafc.7b02847

    CAS  Article  PubMed  Google Scholar 

  21. Kjeldahl K, Bro R (2010) Some common misunderstandings in chemometrics. J Chemom 24:558–564. https://doi.org/10.1002/cem.1346

    CAS  Article  Google Scholar 

  22. Kuhn M, Wickham H (2019) Rsample: general resampling infrastructure. https://cran.r-project.org/package=rsample

  23. Legendre P, Louis L, Louis L (1998) Numerical ecology. Elsevier, Amsterdam

    Google Scholar 

  24. Muir CD, Conesa M, Roldán EJ, Molins A, Galmés J (2017a) Weak coordination between leaf structure and function among closely related tomato species. New Phytol 213:1642–1653. https://doi.org/10.1111/nph.14285

    CAS  Article  PubMed  Google Scholar 

  25. Muir CD, Conesa MÀ, Roldán EJ, Molins A, Galmés J (2017b) Data from: weak coordination between leaf structure and function among closely related tomato species. Dryad Digit Repos. https://doi.org/10.5061/dryad.1r8c2

    Article  Google Scholar 

  26. Orloci L (1966) Geometric models in ecology: I. The theory and application of some ordination methods. J Ecol 54:193. https://doi.org/10.2307/2257667

    Article  Google Scholar 

  27. Porter J, Arzberger P, Braun H-W, Bryant P, Gage S, Hansen T, Hanson P, Lin C-C, Lin F-P, Kratz T, Michener W, Shapiro S, Williams T (2005) Wireless sensor networks for ecology. Bioscience 55:561–572. https://doi.org/10.1641/0006-3568(2005)055[0561:WSNFE]2.0.CO;2

    Article  Google Scholar 

  28. R Core Team (2018) R: a language and environment for statistical computing

  29. Reuter JA, Spacek DV, Snyder MP (2015) High-throughput sequencing technologies. Mol Cell 58:586–597. https://doi.org/10.1016/J.MOLCEL.2015.05.004

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  30. Roughgarden J, Running SW, Matson PA (1991) What does remote sensing do for ecology? Ecology 72:1918–1922. https://doi.org/10.2307/1941546

    Article  Google Scholar 

  31. Scott ER (2019a) Cupcakes vs muffins: Round 2. www.ericrscott.com/post/cupcakes-vs-muffins-round-2/. Accessed 17 Sep 2020

  32. Scott ER (2019b) Holodeck: a tidy interface for simulating multivariate data. https://cran.r-project.org/package=holodeck

  33. Silvertown J (2009) A new dawn for citizen science. Trends Ecol Evol 24:467–471. https://doi.org/10.1016/J.TREE.2009.03.017

    Article  PubMed  Google Scholar 

  34. Simpson RK, McGraw KJ (2018) It’s not just what you have, but how you use it: solar-positional and behavioural effects on hummingbird colour appearance during courtship. Ecol Lett 21:1413–1422. https://doi.org/10.1111/ele.13125

    Article  PubMed  Google Scholar 

  35. Thévenot EA, Roux A, Xu Y, Ezan E, Junot C (2015) Analysis of the human adult urinary metabolome variations with age, body mass index, and gender by implementing a comprehensive workflow for univariate and OPLS statistical analyses. J Proteome Res 14:3322–3335. https://doi.org/10.1021/acs.jproteome.5b00354

    CAS  Article  PubMed  Google Scholar 

  36. Tiede Y, Hemp C, Schmidt A, Nauss T, Farwig N, Brandl R (2018) Beyond body size: consistent decrease of traits within orthopteran assemblages with elevation. Ecology 99:2090–2102. https://doi.org/10.1002/ecy.2436

    Article  PubMed  Google Scholar 

  37. Tjur T (2009) Coefficients of determination in logistic regression models—a new proposal: the coefficient of discrimination. Am Stat 63:366–372. https://doi.org/10.1198/tast.2009.08210

    Article  Google Scholar 

  38. Valverde-Barrantes OJ, Smemo KA, Feinstein LM, Kershner MW, Blackwood CB (2018) Patterns in spatial distribution and root trait syndromes for ecto and arbuscular mycorrhizal temperate trees in a mixed broadleaf forest. Oecologia 186:731–741. https://doi.org/10.1007/s00442-017-4044-8

    Article  PubMed  Google Scholar 

  39. Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, van Velzen EJJ, van Duijnhoven JPM, van Dorsten FA (2008) Assessment of PLSDA cross validation. Metabolomics 4:81–89. https://doi.org/10.1007/s11306-007-0099-6

    CAS  Article  Google Scholar 

  40. Wiggins WD, Wilder SM (2018) Mismatch between dietary requirements for lipid by a predator and availability of lipid in prey. Oikos 127:1024–1032. https://doi.org/10.1111/oik.04766

    CAS  Article  Google Scholar 

  41. Wold H (1975) Soft modelling by latent variables: the non-linear iterative partial least squares (NIPALS) approach. J Appl Probab 12:117–142. https://doi.org/10.1017/S0021900200047604

    Article  Google Scholar 

  42. Worley B, Powers R (2016) PCA as a practical indicator of OPLS-DA model reliability. Curr Metab 4:97–103. https://doi.org/10.2174/2213235X04666160613122429

    CAS  Article  Google Scholar 

  43. Worley B, Halouska S, Powers R (2013) Utilities for quantifying separation in PCA/PLS-DA scores plots. Anal Biochem 433:102–104. https://doi.org/10.1016/J.AB.2012.10.011

    CAS  Article  PubMed  Google Scholar 

  44. Wright IJ, Reich PB, Westoby M, Ackerly DD, Baruch Z, Bongers F, Cavender-Bares J, Chapin T, Cornelissen JHC, Diemer M, Flexas J, Garnier E, Groom PK, Gulias J, Hikosaka K, Lamont BB, Lee T, Lee W, Lusk C, Midgley JJ, Navas M-L, Niinemets Ü, Oleksyn J, Osada N, Poorter H, Poot P, Prior L, Pyankov VI, Roumet C, Thomas SC, Tjoelker MG, Veneklaas EJ, Villar R (2004) The worldwide leaf economics spectrum. Nature 428:821–827. https://doi.org/10.1038/nature02403

    CAS  Article  PubMed  Google Scholar 

Download references

Acknowledgements

We thank Christopher Muir and Colin M. Orians for comments on a draft of this manuscript.

Author information

Affiliations

Authors

Contributions

ERS and EEC conceived and designed the study. ERS analyzed data and led the writing of the manuscript. Both authors contributed significantly to drafts and approve the final version for publication.

Corresponding author

Correspondence to Eric R. Scott.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Additional information

Supervised multivariate analyses are underutilized in ecology. These analyses give different results than unsupervised approaches (e.g. PCA) which find main axes of variation without respect to a response. Here, we show how unsupervised and supervised approaches are not interchangeable and require different interpretation. In particular, unsupervised approaches are likely to miss significant relationships with variables that are not part of a main axis of variation, a situation which may be common in ecological datasets.

Communicated by Casey P. terHorst.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 96 KB)

Supplementary file2 (XLSX 47 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Scott, E.R., Crone, E.E. Using the right tool for the job: the difference between unsupervised and supervised analyses of multivariate ecological data. Oecologia (2021). https://doi.org/10.1007/s00442-020-04848-w

Download citation

Keywords

  • Principal components analysis (PCA)
  • Partial least squares (PLS) regression
  • Principal components regression
  • Multivariate statistics
  • Supervised analyses