Text-Based Annotation of Scientific Images Using Wikimedia Categories

Josi, Frieda; Wartena, Christian; Charbonnier, Jean

doi:10.1007/978-3-319-99133-7_20

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 903))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

570 Accesses
1 Citations
8 Altmetric

Abstract

The reuse of scientific raw data is a key demand of Open Science. In the project NOA we foster reuse of scientific images by collecting and uploading them to Wikimedia Commons. In this paper we present a text-based annotation method that proposes Wikipedia categories for open access images. The assigned categories can be used for image retrieval or to upload images to Wikimedia Commons. The annotation basically consists of two phases: extracting salient keywords and mapping these keywords to categories. The results are evaluated on a small record of open access images that were manually annotated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For an example of an extremely long capture see Fig. 5 in http://dx.doi.org/10.1002/ece3.2579. Also some parsing errors resulted in long captions.
2.
Our source code will be released together with all developed source codes of the NOA project.
3.
The images are available on Wikimedia Commons at the following link: https://commons.wikimedia.org/w/index.php?title=Special:ListFiles/Sohmen&ilshowall=1.

References

Charbonnier, J., Sohmen, L., Rothman, J., Rohden, B., Wartena, C.: NOA: a search engine for reusable scientific images beyond the life sciences. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 797–800. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7_78
Chapter Google Scholar
Mihalcea, R., Csomai, A.: Wikify linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM 2007, pp. 233–242. ACM, New York (2007). https://doi.org/10.1145/1321440.1321475
Medelyan, O., Witten, I.H., Milne, D.N.: Topic indexing with Wikipedia. AAAI Technical report WS-08-15, pp. 19–24 (2008). http://researchcommons.waikato.ac.nz/handle/10289/1776
Wartena, C., Brussee, R.: Instanced-based mapping between thesauri and folksonomies. In: Sheth, A., et al. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 356–370. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88564-1_23
Chapter Google Scholar
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI 1999, pp. 668–673. Morgan Kaufmann Publishers Inc., San Francisco (1999). http://dl.acm.org/citation.cfm?id=646307.687591
Turney, P.D.: Learning algorithms for keyphrase extraction. Inf. Retr. 2(4), 303–336 (2000). https://doi.org/10.1023/A:1009976227802
Article Google Scholar
Mihalcea, R., Tarau, P.: Textrank: bringing order into text. In: Proceedings of 9th Conference on Empirical Methods in Natural Language Processing (EMNLP 2004) (2004). http://ci.nii.ac.jp/naid/20001460576/
Leong, C.W., Mihalcea, R., Hassan, S.: Text mining for automatic image tagging. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, COLING 2010, pp. 647–655. Association for Computational Linguistics, Stroudsburg (2010). http://dl.acm.org/citation.cfm?id=1944566.1944640
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM. 18(11), 613–620 (1975). https://doi.org/10.1145/361219.361220
Article MATH Google Scholar
Wartena, C., Brussee, R., Slakhorst, W.: Keyword extraction using word co-occurrence. In: TIR 2010–7th International Workshop on Text-Based Information Retrieval, in Conjunction with DEXA 2010, pp. 54–58, October 2010
Google Scholar
Wartena, C., Sommer, M.: Automatic classification of scientific records using the German Subject Heading Authority File (SWD), October 2012. https://serwiss.bib.hs-hannover.de/frontdoor/index/index/docId/328
Voss, J., et al.: Normdaten in Wikidata, May 2014. https://serwiss.bib.hs-hannover.de/frontdoor/index/index/docId/438
Wikimedia Foundation: Wikipedia:Categorization, page Version ID: 821464874, January 2018. https://en.wikipedia.org/w/index.php?title=Wikipedia:Categorization&oldid=821464874
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python, 1st edn. O’Reilly and Associates, Beijing (2009)
MATH Google Scholar
English Penn Treebank tagset with modifications—Sketch Engine. https://www.sketchengine.eu/english-treetagger-pipeline-2/
Charbonnier, J., Wartena, C.: Using word embeddings for unsupervised acronym disambiguation. In: Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe (2018, to appear)
Google Scholar
Gazendam, L., Wartena, C., Malais, V., Schreiber, G., de Jong, A., Brugman, H.: Automatic annotation suggestions for audiovisual archives: evaluation aspects. Interdiscipl. Sci. Rev. 34(2–3), 172–188 (2009). https://doi.org/10.1179/174327909X441090
Article Google Scholar
Iivonen, M., Consistency in the selection of search concepts and search terms. Inf. Process. Manag. 31(2), 173–190 (1995). http://linkinghub.elsevier.com/retrieve/pii/030645739580034Q
Schlötterer, J., Seifert, C., Granitzer, M.: Supporting web surfers in finding related material in digital library repositories. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds.) TPDL 2016. LNCS, vol. 9819, pp. 434–437. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43997-6_38
Chapter Google Scholar

Download references

Acknowledgment

The presented work was developed within the NOA Project - Automatic Harvesting, Indexing and Provision of Open Access Figures from the Fields of Engineering and Technology Using the Infrastructure of Wikimedia Commons and Wikidata - funded by the DFG under grant number 315976924. NOA is a cooperative project of the Hochschule Hannover and the Technische Informationsbibliothek Hannover. We would like to thank the NOA project team.

Author information

Authors and Affiliations

University of Applied Sciences and Arts Hanover, Expo Plaza 12, 30539, Hanover, Germany
Frieda Josi, Christian Wartena & Jean Charbonnier

Authors

Frieda Josi
View author publications
You can also search for this author in PubMed Google Scholar
Christian Wartena
View author publications
You can also search for this author in PubMed Google Scholar
Jean Charbonnier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Wartena .

Editor information

Editors and Affiliations

University of Tunis, Tunis, Tunisia
Mourad Elloumi
MiCS, Media Computer Science, University of Passau, Passau, Bayern, Germany
Michael Granitzer
IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
University of Twente, Enschede, Overijssel, The Netherlands
Christin Seifert
Fak. Medien, Bauhaus Universität Weimar, Weimar, Thüringen, Germany
Benno Stein
Inst. für Softwaretechnik, Vienna University of Technology, Vienna, Austria
A Min Tjoa
FAW, Johannes Kepler University of Linz, Linz, Austria
Roland Wagner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Josi, F., Wartena, C., Charbonnier, J. (2018). Text-Based Annotation of Scientific Images Using Wikimedia Categories. In: Elloumi, M., et al. Database and Expert Systems Applications. DEXA 2018. Communications in Computer and Information Science, vol 903. Springer, Cham. https://doi.org/10.1007/978-3-319-99133-7_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-99133-7_20
Published: 07 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99132-0
Online ISBN: 978-3-319-99133-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics