Extracting Predictive Models from Marked-Up Free-Text Documents at the Royal Botanic Gardens, Kew, London

  • Allan Tucker
  • Don Kirkup
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8819)


In this paper we explore the combination of text-mining, un-supervised and supervised learning to extract predictive models from a corpus of digitised historical floras. These documents deal with the nomenclature, geographical distribution, ecology and comparative morphology of the species of a region. Here we exploit the fact that portions of text in the floras are marked up as different types of trait and habitat. We infer models from these different texts that can predict different habitat-types based upon the traits of plant species. We also integrate plant taxonomy data in order to assist in the validation of our models. We have shown that by clustering text describing the habitat of different floras we can identify a number of important and distinct habitats that are associated with particular families of species along with statistical significance scores. We have also shown that by using these discovered habitat-types as labels for supervised learning we can predict them based upon a subset of traits, identified using wrapper feature selection.


Habitat Type Text Mining Plant Trait Sentiment Analysis Royal Botanical Garden 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bilmes, J.: A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Technical Report TR-97-021, ICSI (1997)Google Scholar
  2. 2.
    Cooper, G.F., Herskovitz, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Learning (9), 309–347 (1992)Google Scholar
  3. 3.
    Evans, M.R., Norris, K.J., Benton, T.G.: Introduction: Predictive ecology: systems approaches. Philosophical Transactions of the Royal Society: Part B 367(1586), 163–169 (2012)CrossRefGoogle Scholar
  4. 4.
    Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. Journal of Statistical Software 25(5), 1–54 (2008)Google Scholar
  5. 5.
    Feldman, R.: Techniques and applications for sentiment analysis. Communications of the ACM 56(4), 82–89 (2013)CrossRefGoogle Scholar
  6. 6.
    Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Machine Learning (29), 131–163 (1997)Google Scholar
  7. 7.
    Inza, I., Larrañaga, P., Blanco, R., Cerrolaza, A.J.: Filter versus wrapper gene selection approaches in dna microarray domains. Artificial Intelligence in Medicine (31), 91–103 (2004)Google Scholar
  8. 8.
    Jelier, R., Schuemie, M.J., Veldhoven, A., Dorssers, L.C.J., Jenster, G., Kors, J.A.: Anni 2.0: A multipurpose text-mining tool for the life sciences. Genome Biology 9(6), R96 (2008)Google Scholar
  9. 9.
    Kirkup, D., Malcolm, P., Christian, G., Paton, A.: Towards a digital african flora. Taxon 54(2) (2005)Google Scholar
  10. 10.
    Purves, D., Scharlemann, J., Harfoot, M., Newbold, T., Tittensor, D.P., Hutton, J., Emmott, S.: Ecosystems: Time to model all life on earth. Nature (493), 295–297 (2013)Google Scholar
  11. 11.
    Steele, E., Tucker, A., Schuemie, M.J.: Literature-based priors for gene regulatory networks. Bioinformatics 25(14), 1768–1774 (2009)CrossRefGoogle Scholar
  12. 12.
    Swanson, D.R.: Medical literature as a potential source of new knowledge. Bull. Med. Libr. Assoc. (78), 29–37 (1990)Google Scholar
  13. 13.
    Swift, S., Tucker, A., Vinciotti, V., Martin, N., Orengo, C., Liu, X., Kellam, P.: Consensus clustering and functional interpretation of gene-expression data. Genome Biology 5(11), R94 (2004)Google Scholar
  14. 14.
    Tamames, J., de Lorenzo, V.: Envmine: A text-mining system for the automatic extraction of contextual information. BMC Bioinformatics 11(294) (2010), doi:10.1186/1471-2105-11-294)Google Scholar
  15. 15.
    Tucker, A., Duplisea, D.: Bioinformatics tools in predictive ecology: Applications to fisheries. Philosophical Transactions of the Royal Society: Part B 356(1586), 279–290 (2012)CrossRefGoogle Scholar
  16. 16.
    Walter, H.: Vegetation of the Earth and Ecological Systems of the Geo-biosphere. Springer (1979)Google Scholar
  17. 17.
    White, F.: The Vegetation of Africa – A descriptive memoir to accompany the Unesco/AETFAT/UNSO vegetation map of Africa. UNESCO (1983)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Allan Tucker
    • 1
  • Don Kirkup
    • 2
  1. 1.Department of Computer ScienceBrunel UniversityUK
  2. 2.Royal Botanical Gardens at KewUK

Personalised recommendations