The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles

Salatino, Angelo A.; Osborne, Francesco; Thanapalasingam, Thiviyan; Motta, Enrico

doi:10.1007/978-3-030-30760-8_26

Angelo A. Salatino¹³,
Francesco Osborne¹³,
Thiviyan Thanapalasingam¹³ &
…
Enrico Motta¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11799))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1979 Accesses
41 Citations
4 Altmetric

Abstract

Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this paper, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of re-search areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Medical Subject Headings: https://www.nlm.nih.gov/mesh/.
2.
PhySH - Physics Subject Headings: https://physh.aps.org.
3.
STW Thesaurus for Economics: http://zbw.eu/stw.
4.
Scopus - https://www.scopus.com.
5.
Dimensions - https://www.dimensions.ai.
6.
Semantic Scholar - https://www.semanticscholar.org.
7.
CSO is available for download at https://w3id.org/cso/downloads.
8.
CSO Data Model - https://cso.kmi.open.ac.uk/schema/cso.
9.
SKOS Simple Knowledge Organization System - http://www.w3.org/2004/02/skos.
10.
Computer Science Ontology Portal - https://cso.kmi.open.ac.uk .
11.
Microsoft Academic Graph - https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/ .
12.
In particular, for the collocation analysis, we used min-count = 5 and threshold = 10.
13.
The final parameters of the word2vec model are: method = skipgram, embedding-size = 128, window-size = 10, min-count-cutoff = 10, max-iterations = 5.
14.
These three fields are well covered by CSO, which includes a total of 35 sub-topics for the Semantic Web, 173 for Natural Language Processing, and 396 for Data Mining.
15.
Medline dataset: https://www.nlm.nih.gov/bsd/medline.html .

References

Salatino, A.A., Osborne, F., Motta, E.: AUGUR: forecasting the emergence of new research topics. In: Joint Conference on Digital Libraries 2018, Fort Worth, Texas, pp. 1–10 (2018)
Google Scholar
Osborne, F., Salatino, A., Birukou, A., Motta, E.: Automatic classification of springer nature proceedings with smart topic miner. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 383–399. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46547-0_33
Chapter Google Scholar
Bolelli, L., Ertekin, Ş., Giles, C.L.: Topic and trend detection in text collections using latent dirichlet allocation. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 776–780. Springer, Heidelberg (2009)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. U. S. A. 101(1), 5228–5235 (2004)
Article Google Scholar
Osborne, F., Motta, E.: Mining semantic relations between research areas. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. Lecture Notes in Computer Science, vol. 7649, pp. 410–426. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35176-1_26
Chapter Google Scholar
Salatino, A.A., Thanapalasingam, T., Mannocci, A., Osborne, F., Motta, E.: The computer science ontology: a large-scale taxonomy of research areas. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 187–205. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6_12
Chapter Google Scholar
Song, Y., Roth, D.: Unsupervised sparse vector densification for short text similarity. In: Human Language Technologies: Annual Conference of the North American Chapter of the ACL, pp. 1275–80 (2015)
Google Scholar
Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and Word2vec for text classification with semantic features. In: 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), pp. 136–140. IEEE (2015)
Google Scholar
Salatino, A.A., Thanapalasingam, T., Mannocci, A., Osborne, F., Motta, E.: Classifying research papers with the computer science ontology. In: ISWC-P&D-Industry-BlueSky 2018 (2018)
Google Scholar
Decker, S.L., Aleman-meza, B., Cameron, D., Arpinar, I.B.: Detection of Bursty and Emerging Trends towards Identification of Researchers at the Early Stage of Trends (2007)
Google Scholar
Mai, F., Galke, L., Scherp, A.: Using deep learning for title-based semantic subject indexing to reach competitive performance to full-text. In: JCDL 2018 Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. pp. 169–178. ACM, New York (2018)
Google Scholar
Chernyak, E.: An approach to the problem of annotation of research publications. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining - WSDM ’15, pp. 429–434. ACM Press, New York (2015)
Google Scholar
Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.: Topic Detection and Tracking Pilot Study Final Report (1998)
Google Scholar
Osborne, F., Scavo, G., Motta, E.: Identifying diachronic topic-based research communities by clustering shared research trajectories. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 114–129. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07443-6_9
Chapter Google Scholar
Small, H., Boyack, K.W., Klavans, R.: Identifying emerging topics in science and technology. Res. Policy 43, 1450–1467 (2014)
Article Google Scholar
Caragea, C., Bulgarov, F., Mihalcea, R.: Co-Training for Topic Classification of Scholarly Data. Association for Computational Linguistics (2015)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Bhatia, S., Lau, J.H., Baldwin, T.: Automatic labelling of topics with neural embeddings. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 953–963. The COLING 2016, December (2016)
Google Scholar
Duvvuru, A., Radhakrishnan, S., More, D., Kamarthi, S.: Analyzing structural & temporal characteristics of keyword system in academic research articles. Procedia - Procedia Comput. Sci. 20, 439–445 (2013)
Article Google Scholar
Zhang, Y., Lu, J., Liu, F., Liu, Q., Porter, A., Chen, H., Zhang, G.: Does deep learning help topic extraction? A kernel k-means clustering method with word embedding. J. Informetr. 12, 1099–1117 (2018)
Article Google Scholar
Osborne, F., Motta, E.: Klink-2: integrating multiple web sources to generate semantic topic networks. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 408–424. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_24
Chapter Google Scholar
Osborne, F., Motta, E., Mulholland, P.: exploring scholarly data with rexplore. In: Alani, H., et al. (eds.) ISWC 2013. LNCS, vol. 8218, pp. 460–477. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41335-3_29
Chapter Google Scholar
Thanapalasingam, T., Osborne, F., Birukou, A., Motta, E.: Ontology-based recommendation of editorial products. In: Vrandečić, D., et al. (eds.) ISWC 2018. Lecture Notes in Computer Science, vol. 11137. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6_21
Chapter Google Scholar
Narayanan, A., Shmatikov, V.: De-anonymizing social networks. In: 30th IEEE Symposium on Security and Privacy, pp. 173–187. IEEE (2009)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Satopää, V., Albrecht, J., Irwin, D., Raghavan, B.: Finding a “Kneedle” in a haystack: detecting knee points in system behavior. In: ICDCSW 2011 Proceedings of the 2011 31st International Conference on Distributed Computing Systems, pp. 166–171. IEEE Computer Society Washington (2011)
Google Scholar
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK
Angelo A. Salatino, Francesco Osborne, Thiviyan Thanapalasingam & Enrico Motta

Authors

Angelo A. Salatino
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Osborne
View author publications
You can also search for this author in PubMed Google Scholar
Thiviyan Thanapalasingam
View author publications
You can also search for this author in PubMed Google Scholar
Enrico Motta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Angelo A. Salatino .

Editor information

Editors and Affiliations

University of La Rochelle, La Rochelle, France
Antoine Doucet
VU University Amsterdam, Amsterdam, The Netherlands
Antoine Isaac
Linnaeus University, Växjö, Sweden
Koraljka Golub
OsloMet – Oslo Metropolitan University, Oslo, Norway
Trond Aalberg
Kyoto University, Kyoto, Japan
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Salatino, A.A., Osborne, F., Thanapalasingam, T., Motta, E. (2019). The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles. In: Doucet, A., Isaac, A., Golub, K., Aalberg, T., Jatowt, A. (eds) Digital Libraries for Open Knowledge. TPDL 2019. Lecture Notes in Computer Science(), vol 11799. Springer, Cham. https://doi.org/10.1007/978-3-030-30760-8_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-30760-8_26
Published: 30 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30759-2
Online ISBN: 978-3-030-30760-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics