Abstract
The reuse of scientific raw data is a key demand of Open Science. In the project NOA we foster reuse of scientific images by collecting and uploading them to Wikimedia Commons. In this paper we present a text-based annotation method that proposes Wikipedia categories for open access images. The assigned categories can be used for image retrieval or to upload images to Wikimedia Commons. The annotation basically consists of two phases: extracting salient keywords and mapping these keywords to categories. The results are evaluated on a small record of open access images that were manually annotated.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
For an example of an extremely long capture see Fig. 5 in http://dx.doi.org/10.1002/ece3.2579. Also some parsing errors resulted in long captions.
- 2.
Our source code will be released together with all developed source codes of the NOA project.
- 3.
The images are available on Wikimedia Commons at the following link: https://commons.wikimedia.org/w/index.php?title=Special:ListFiles/Sohmen&ilshowall=1.
References
Charbonnier, J., Sohmen, L., Rothman, J., Rohden, B., Wartena, C.: NOA: a search engine for reusable scientific images beyond the life sciences. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 797–800. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7_78
Mihalcea, R., Csomai, A.: Wikify linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM 2007, pp. 233–242. ACM, New York (2007). https://doi.org/10.1145/1321440.1321475
Medelyan, O., Witten, I.H., Milne, D.N.: Topic indexing with Wikipedia. AAAI Technical report WS-08-15, pp. 19–24 (2008). http://researchcommons.waikato.ac.nz/handle/10289/1776
Wartena, C., Brussee, R.: Instanced-based mapping between thesauri and folksonomies. In: Sheth, A., et al. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 356–370. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88564-1_23
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI 1999, pp. 668–673. Morgan Kaufmann Publishers Inc., San Francisco (1999). http://dl.acm.org/citation.cfm?id=646307.687591
Turney, P.D.: Learning algorithms for keyphrase extraction. Inf. Retr. 2(4), 303–336 (2000). https://doi.org/10.1023/A:1009976227802
Mihalcea, R., Tarau, P.: Textrank: bringing order into text. In: Proceedings of 9th Conference on Empirical Methods in Natural Language Processing (EMNLP 2004) (2004). http://ci.nii.ac.jp/naid/20001460576/
Leong, C.W., Mihalcea, R., Hassan, S.: Text mining for automatic image tagging. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, COLING 2010, pp. 647–655. Association for Computational Linguistics, Stroudsburg (2010). http://dl.acm.org/citation.cfm?id=1944566.1944640
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM. 18(11), 613–620 (1975). https://doi.org/10.1145/361219.361220
Wartena, C., Brussee, R., Slakhorst, W.: Keyword extraction using word co-occurrence. In: TIR 2010–7th International Workshop on Text-Based Information Retrieval, in Conjunction with DEXA 2010, pp. 54–58, October 2010
Wartena, C., Sommer, M.: Automatic classification of scientific records using the German Subject Heading Authority File (SWD), October 2012. https://serwiss.bib.hs-hannover.de/frontdoor/index/index/docId/328
Voss, J., et al.: Normdaten in Wikidata, May 2014. https://serwiss.bib.hs-hannover.de/frontdoor/index/index/docId/438
Wikimedia Foundation: Wikipedia:Categorization, page Version ID: 821464874, January 2018. https://en.wikipedia.org/w/index.php?title=Wikipedia:Categorization&oldid=821464874
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python, 1st edn. O’Reilly and Associates, Beijing (2009)
English Penn Treebank tagset with modifications—Sketch Engine. https://www.sketchengine.eu/english-treetagger-pipeline-2/
Charbonnier, J., Wartena, C.: Using word embeddings for unsupervised acronym disambiguation. In: Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe (2018, to appear)
Gazendam, L., Wartena, C., Malais, V., Schreiber, G., de Jong, A., Brugman, H.: Automatic annotation suggestions for audiovisual archives: evaluation aspects. Interdiscipl. Sci. Rev. 34(2–3), 172–188 (2009). https://doi.org/10.1179/174327909X441090
Iivonen, M., Consistency in the selection of search concepts and search terms. Inf. Process. Manag. 31(2), 173–190 (1995). http://linkinghub.elsevier.com/retrieve/pii/030645739580034Q
Schlötterer, J., Seifert, C., Granitzer, M.: Supporting web surfers in finding related material in digital library repositories. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds.) TPDL 2016. LNCS, vol. 9819, pp. 434–437. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43997-6_38
Acknowledgment
The presented work was developed within the NOA Project - Automatic Harvesting, Indexing and Provision of Open Access Figures from the Fields of Engineering and Technology Using the Infrastructure of Wikimedia Commons and Wikidata - funded by the DFG under grant number 315976924. NOA is a cooperative project of the Hochschule Hannover and the Technische Informationsbibliothek Hannover. We would like to thank the NOA project team.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Josi, F., Wartena, C., Charbonnier, J. (2018). Text-Based Annotation of Scientific Images Using Wikimedia Categories. In: Elloumi, M., et al. Database and Expert Systems Applications. DEXA 2018. Communications in Computer and Information Science, vol 903. Springer, Cham. https://doi.org/10.1007/978-3-319-99133-7_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-99133-7_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99132-0
Online ISBN: 978-3-319-99133-7
eBook Packages: Computer ScienceComputer Science (R0)