Non-Parametric Subject Prediction

Wang, Shenghui; Koopman, Rob; Englebienne, Gwenn

doi:10.1007/978-3-030-30760-8_27

Shenghui Wang¹³,
Rob Koopman¹³ &
Gwenn Englebienne¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11799))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1672 Accesses

Abstract

Automatic subject prediction is a desirable feature for modern digital library systems, as manual indexing can no longer cope with the rapid growth of digital collections. This is an “extreme multi-label classification” problem, where the objective is to assign a small subset of the most relevant subjects from an extremely large label set. Data sparsity and model scalability are the major challenges we need to address to solve it automatically. In this paper, we describe an efficient and effective embedding method that embeds terms, subjects and documents into the same semantic space, where similarity can be computed easily. We then propose a novel Non-Parametric Subject Prediction (NPSP) method and show how effectively it predicts even very specialised subjects, which are associated with few documents in the training set and are not predicted by state-of-the-art classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://manikvarma.org/downloads/XC/XMLRepository.html.
2.
In general, a document could therefore be a sentence, a paragraph, a fixed-size window, a bibliographic record, etc.; in our case, documents are scientific publications.
3.
Terms could be words, n-grams or phrases. In our work, common phrases are automatically detected using the method described in [19].
4.
We use projection rather than subtracting \(\varvec{v}_a\) to prevent orthogonal vectors from gaining undue importance.
5.
http://www.worldcat.org/.
6.
We extracted terms from titles and abstracts and removed those that occurred in less than 10 articles.
7.
https://www.ncbi.nlm.nih.gov/pubmed/14670424.

References

Achlioptas, D.: Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66(4), 671–687 (2003). https://doi.org/10.1016/S0022-0000(03)00025-4
Article MathSciNet MATH Google Scholar
Arash, J., Abdulhussain, E.M.: Classification of scientific publications according to library controlled vocabularies: a new concept matching-based approach. Libr. Hi Tech 31, 725–747 (2013). https://doi.org/10.1108/LHT-03-2013-0030
Article Google Scholar
Bhatia, K., Jain, H., Kar, P., Varma, M., Jain, P.: Sparse local embeddings for extreme multi-label classification. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 730–738. Curran Associates, Inc. (2015)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017). https://doi.org/10.1162/tacl_a_00051
Article Google Scholar
Foster, D.V., Grassberger, P.: Lower bounds on mutual information. Phys. Rev. E 83, 010101 (2011). https://doi.org/10.1103/PhysRevE.83.010101
Article Google Scholar
Frommholz, I., Abbasi, M.K.: Automated text categorization and clustering. In: Golub, K. (ed.) Subject Access to Information: An Interdisciplinary Approach: An Interdisciplinary Approach, pp. 117–131. ABC-CLIO (2014)
Google Scholar
Godby, J., Reighart, R.: The wordsmith indexing system. J. Libr. Adm. 34(3–4), 375–385 (2001). https://doi.org/10.1300/J111v34n03_18
Article Google Scholar
Godby, J., Smith, D.: Scorpion. https://www.oclc.org/research/activities/scorpion.html. Accessed Apr 2019
Golub, K.: Automatic subject indexing of text. In: ISKO Encyclopedia of Knowledge Organization. http://www.isko.org/cyclo/automatic. Version 07 Mar 2019
Jain, H., Prabhu, Y., Varma, M.: Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 935–944. ACM, New York (2016). https://doi.org/10.1145/2939672.2939756
Johnson, W., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26, 189–206 (1984)
Article MathSciNet Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
Koopman, R., Wang, S., Englebienne, G.: Fast and discriminative semantic embedding. In: Proceedings of the 13th International Conference on Computational Semantics - Long Papers, Gothenburg, Sweden, 23–27 May 2019, pp. 235–246. ACL (2019)
Google Scholar
Koopman, R., Wang, S., Scharnhorst, A.: Contextualization of topics: browsing through the universe of bibliographic information. Scientometrics 111(2), 1119–1139 (2017). https://doi.org/10.1007/s11192-017-2303-4
Article Google Scholar
Koopman, R., Wang, S., Scharnhorst, A., Englebienne, G.: Ariadne’s thread: interactive navigation in a world of networked information. In: Proceedings of the ACM Conference Extended Abstracts on Human Factors in Computing Systems, pp. 1833–1838 (2015)
Google Scholar
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning - ICML 2014, vol. 32, pp. 1188–1196, March 2014. https://doi.org/10.1145/2740908.2742760
Liu, J., Chang, W.C., Wu, Y., Yang, Y.: Deep learning for extreme multi-label text classification. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2017, pp. 115–124. ACM, New York (2017). https://doi.org/10.1145/3077136.3080834
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS 2013, pp. 3111–3119. Curran Associates Inc., USA (2013)
Google Scholar
Pagliardini, M., Gupta, P., Jaggi, M.: Unsupervised learning of sentence embeddings using compositional n-gram features. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 528–540. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/N18-1049
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). https://doi.org/10.3115/v1/D14-1162
Prabhu, Y., Kag, A., Harsola, S., Agrawal, R., Varma, M.: Parabel: partitioned label trees for extreme classification with application to dynamic search advertising. In: Proceedings of the International World Wide Web Conference, April 2018
Google Scholar
Prabhu, Y., Varma, M.: FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014, pp. 263–272. ACM, New York (2014). https://doi.org/10.1145/2623330.2623651
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002). https://doi.org/10.1145/505282.505283
Article Google Scholar
Tagami, Y.: AnnexML: approximate nearest neighbor search for extreme multi-label classification. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017, pp. 455–464. ACM, New York (2017). https://doi.org/10.1145/3097983.3097987
Wang, S., Koopman, R.: Semantic embedding for information retrieval. In: Proceedings of the Fifth Workshop on Bibliometric-enhanced Information Retrieval, pp. 122–132 (2017)
Google Scholar
Weston, J., Bengio, S., Usunier, N.: WSABIE: Scaling up to large vocabulary image annotation. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Three, IJCAI 2011, pp. 2764–2770. AAAI Press (2011). https://doi.org/10.5591/978-1-57735-516-8/IJCAI11-460

Download references

Author information

Authors and Affiliations

OCLC Research, Schipholweg 99, 2316XA, Leiden, The Netherlands
Shenghui Wang & Rob Koopman
University of Twente, Hallenweg 19, 7522NH, Enschede, The Netherlands
Gwenn Englebienne

Authors

Shenghui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Rob Koopman
View author publications
You can also search for this author in PubMed Google Scholar
Gwenn Englebienne
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shenghui Wang .

Editor information

Editors and Affiliations

University of La Rochelle, La Rochelle, France
Antoine Doucet
VU University Amsterdam, Amsterdam, The Netherlands
Antoine Isaac
Linnaeus University, Växjö, Sweden
Koraljka Golub
OsloMet – Oslo Metropolitan University, Oslo, Norway
Trond Aalberg
Kyoto University, Kyoto, Japan
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, S., Koopman, R., Englebienne, G. (2019). Non-Parametric Subject Prediction. In: Doucet, A., Isaac, A., Golub, K., Aalberg, T., Jatowt, A. (eds) Digital Libraries for Open Knowledge. TPDL 2019. Lecture Notes in Computer Science(), vol 11799. Springer, Cham. https://doi.org/10.1007/978-3-030-30760-8_27

Download citation

DOI: https://doi.org/10.1007/978-3-030-30760-8_27
Published: 30 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30759-2
Online ISBN: 978-3-030-30760-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics