Skip to main content

Text-Based Annotation of Scientific Images Using Wikimedia Categories

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2018)

Abstract

The reuse of scientific raw data is a key demand of Open Science. In the project NOA we foster reuse of scientific images by collecting and uploading them to Wikimedia Commons. In this paper we present a text-based annotation method that proposes Wikipedia categories for open access images. The assigned categories can be used for image retrieval or to upload images to Wikimedia Commons. The annotation basically consists of two phases: extracting salient keywords and mapping these keywords to categories. The results are evaluated on a small record of open access images that were manually annotated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For an example of an extremely long capture see Fig. 5 in http://dx.doi.org/10.1002/ece3.2579. Also some parsing errors resulted in long captions.

  2. 2.

    Our source code will be released together with all developed source codes of the NOA project.

  3. 3.

    The images are available on Wikimedia Commons at the following link: https://commons.wikimedia.org/w/index.php?title=Special:ListFiles/Sohmen&ilshowall=1.

References

  1. Charbonnier, J., Sohmen, L., Rothman, J., Rohden, B., Wartena, C.: NOA: a search engine for reusable scientific images beyond the life sciences. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 797–800. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7_78

    Chapter  Google Scholar 

  2. Mihalcea, R., Csomai, A.: Wikify linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM 2007, pp. 233–242. ACM, New York (2007). https://doi.org/10.1145/1321440.1321475

  3. Medelyan, O., Witten, I.H., Milne, D.N.: Topic indexing with Wikipedia. AAAI Technical report WS-08-15, pp. 19–24 (2008). http://researchcommons.waikato.ac.nz/handle/10289/1776

  4. Wartena, C., Brussee, R.: Instanced-based mapping between thesauri and folksonomies. In: Sheth, A., et al. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 356–370. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88564-1_23

    Chapter  Google Scholar 

  5. Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI 1999, pp. 668–673. Morgan Kaufmann Publishers Inc., San Francisco (1999). http://dl.acm.org/citation.cfm?id=646307.687591

  6. Turney, P.D.: Learning algorithms for keyphrase extraction. Inf. Retr. 2(4), 303–336 (2000). https://doi.org/10.1023/A:1009976227802

    Article  Google Scholar 

  7. Mihalcea, R., Tarau, P.: Textrank: bringing order into text. In: Proceedings of 9th Conference on Empirical Methods in Natural Language Processing (EMNLP 2004) (2004). http://ci.nii.ac.jp/naid/20001460576/

  8. Leong, C.W., Mihalcea, R., Hassan, S.: Text mining for automatic image tagging. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, COLING 2010, pp. 647–655. Association for Computational Linguistics, Stroudsburg (2010). http://dl.acm.org/citation.cfm?id=1944566.1944640

  9. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM. 18(11), 613–620 (1975). https://doi.org/10.1145/361219.361220

    Article  MATH  Google Scholar 

  10. Wartena, C., Brussee, R., Slakhorst, W.: Keyword extraction using word co-occurrence. In: TIR 2010–7th International Workshop on Text-Based Information Retrieval, in Conjunction with DEXA 2010, pp. 54–58, October 2010

    Google Scholar 

  11. Wartena, C., Sommer, M.: Automatic classification of scientific records using the German Subject Heading Authority File (SWD), October 2012. https://serwiss.bib.hs-hannover.de/frontdoor/index/index/docId/328

  12. Voss, J., et al.: Normdaten in Wikidata, May 2014. https://serwiss.bib.hs-hannover.de/frontdoor/index/index/docId/438

  13. Wikimedia Foundation: Wikipedia:Categorization, page Version ID: 821464874, January 2018. https://en.wikipedia.org/w/index.php?title=Wikipedia:Categorization&oldid=821464874

  14. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python, 1st edn. O’Reilly and Associates, Beijing (2009)

    MATH  Google Scholar 

  15. English Penn Treebank tagset with modifications—Sketch Engine. https://www.sketchengine.eu/english-treetagger-pipeline-2/

  16. Charbonnier, J., Wartena, C.: Using word embeddings for unsupervised acronym disambiguation. In: Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe (2018, to appear)

    Google Scholar 

  17. Gazendam, L., Wartena, C., Malais, V., Schreiber, G., de Jong, A., Brugman, H.: Automatic annotation suggestions for audiovisual archives: evaluation aspects. Interdiscipl. Sci. Rev. 34(2–3), 172–188 (2009). https://doi.org/10.1179/174327909X441090

    Article  Google Scholar 

  18. Iivonen, M., Consistency in the selection of search concepts and search terms. Inf. Process. Manag. 31(2), 173–190 (1995). http://linkinghub.elsevier.com/retrieve/pii/030645739580034Q

  19. Schlötterer, J., Seifert, C., Granitzer, M.: Supporting web surfers in finding related material in digital library repositories. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds.) TPDL 2016. LNCS, vol. 9819, pp. 434–437. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43997-6_38

    Chapter  Google Scholar 

Download references

Acknowledgment

The presented work was developed within the NOA Project - Automatic Harvesting, Indexing and Provision of Open Access Figures from the Fields of Engineering and Technology Using the Infrastructure of Wikimedia Commons and Wikidata - funded by the DFG under grant number 315976924. NOA is a cooperative project of the Hochschule Hannover and the Technische Informationsbibliothek Hannover. We would like to thank the NOA project team.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Wartena .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Josi, F., Wartena, C., Charbonnier, J. (2018). Text-Based Annotation of Scientific Images Using Wikimedia Categories. In: Elloumi, M., et al. Database and Expert Systems Applications. DEXA 2018. Communications in Computer and Information Science, vol 903. Springer, Cham. https://doi.org/10.1007/978-3-319-99133-7_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99133-7_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99132-0

  • Online ISBN: 978-3-319-99133-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics