Multi-label Wikipedia Classification with Textual and Link Features

Chidlovskii, Boris

doi:10.1007/978-3-642-14556-8_38

Boris Chidlovskii¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6203))

Included in the following conference series:

International Workshop of the Initiative for the Evaluation of XML Retrieval

579 Accesses
2 Citations

Abstract

We address the problem of categorizing a large set of linked documents with important content and structure aspects, in particular, from the Wikipedia collection proposed at the INEX 2009 XML Mining challenge. We analyze the network of collection pages and turn it into valuable features for the classification. We combine the content-based and link-based features of pages to train an accurate categorizer for unlabelled pages. In the multi-label setting, we revise a number of existing techniques and test some which show a good scalability. We report evaluation results obtained with a variety of learning methods and techniques on the training set of the Wikipedia corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brandes, U.: A faster algorithm for betweenness centrality. Journal of Mathematical Sociology 25, 163–177 (2001)
Article MATH Google Scholar
Bron, C., Kerbosch, J.: Algorithm 457: finding all cliques of an undirected graph. Communications of the ACM 16(9), 575–577 (1973)
Article MATH Google Scholar
Chidlovskii, B.: Semi-supervised categorization of wikipedia collection by label expansion. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2008. LNCS, vol. 5631, pp. 412–419. Springer, Heidelberg (2009)
Chapter Google Scholar
Getoor, L., Diehl, C.P.: Link mining: a survey. SIGKDD Explorations 7(2), 3–12 (2005)
Article Google Scholar
Ghamrawi, N., McCallum, A.: Collective multi-label classification. In: CIKM 2005: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 195–200. ACM, New York (2005)
Google Scholar
Gleich, D.: MatlabBGL: a Matlab Graph Library (2008), http://www.stanford.edu/~dgleich/programs/matlab_bgl
Joachims, T.: A statistical learning model of text classification for Support Vector Machines. In: Proc. 24th International ACM SIGIR Conf., pp. 128–136. ACM Press, New York (2001)
Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)
Article MATH MathSciNet Google Scholar
Newman, M.E.J.: The structure and function of complex networks. SIAM Review 45, 167–256 (2003)
Article MATH MathSciNet Google Scholar
Riehle, D.: How and why Wikipedia works: an interview with Angela Beesley, Elisabeth Bauer, and Kizu Naoko. In: WikiSym 2006: Proceedings of the 2006 international symposium on Wikis, pp. 3–8. ACM, New York (2006)
Google Scholar
Rowe, R., Creamer, G., Hershkop, S., Stolfo, S.J.: Automated social hierarchy detection through email network analysis. In: Proc. 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, pp. 109–117. ACM, New York (2007)
Google Scholar
Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: WWW 2009: Proceedings of the 18th international conference on World Wide Web, pp. 211–220. ACM, New York (2009)
Google Scholar
Tsoumakas, G., Katakis, I.: Multi-label classification: An overview. International Journal of Data Warehousing and Mining 3(3), 1–13 (2007)
Article Google Scholar
Yu, K., Yu, S., Tresp, V.: Multi-label informed latent semantic indexing. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 258–265. ACM, New York (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Xerox Research Centre Europe, 6, chemin de Maupertuis, F–38240, Meylan, France
Boris Chidlovskii

Authors

Boris Chidlovskii
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Science and Technology, Queensland University of Technology, GPO Box 2434, 4001, Brisbane, Qld, Australia
Shlomo Geva
Archives and Information Studies/Humanities, University of Amsterdam, Turfdraagsterpad 9, 1012 XT, Amsterdam, The Netherlands
Jaap Kamps
Department of Computer Science, University of Otago, P.O. Box 56,, 9054, Dunedin, New Zealand
Andrew Trotman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chidlovskii, B. (2010). Multi-label Wikipedia Classification with Textual and Link Features. In: Geva, S., Kamps, J., Trotman, A. (eds) Focused Retrieval and Evaluation. INEX 2009. Lecture Notes in Computer Science, vol 6203. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14556-8_38

Download citation

DOI: https://doi.org/10.1007/978-3-642-14556-8_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14555-1
Online ISBN: 978-3-642-14556-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics