A Supervised Method of Feature Weighting for Measuring Semantic Relatedness

Kennedy, Alistair; Szpakowicz, Stan

doi:10.1007/978-3-642-21043-3_27

Alistair Kennedy²¹ &
Stan Szpakowicz^21,22

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6657))

Included in the following conference series:

Canadian Conference on Artificial Intelligence

1595 Accesses
1 Citations

Abstract

The clustering of related words is crucial for a variety of Natural Language Processing applications. Many known techniques of word clustering use the context of a word to determine its meaning. Words which frequently appear in similar contexts are assumed to have similar meanings. Word clustering usually applies the weighting of contexts, based on some measure of their importance. One of the most popular measures is Pointwise Mutual Information. It increases the weight of contexts where a word appears regularly but other words do not, and decreases the weight of contexts where many words may appear. Essentially, it is unsupervised feature weighting. We present a method of supervised feature weighting. It identifies contexts shared by pairs of words known to be semantically related or unrelated, and then uses Pointwise Mutual Information to weight these contexts on how well they indicate closely related words. We use Roget’s Thesaurus as a source of training and evaluation data. This work is as a step towards adding new terms to Roget’s Thesaurus automatically, and doing so with high confidence.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Hagiwara, M., Ogawa, Y., Toyama, K.: Supervised synonym acquisition using distributional features and syntactic patterns. Journal of Natural Language Processing 16, 59–83 (2005)
Article Google Scholar
Broda, B., Jaworski, D., Piasecki, M.: Parallel, Massive Processing in SuperMatrix – a General Tool for Distributional Semantic Analysis of Corpus. In: Proceedings of the International Multiconference on Computer Science and Information Technology, pp. 373–379 (2010)
Google Scholar
Snow, R., Jurafsky, D., Ng, A.Y.: Semantic Taxonomy Induction from Heterogenous Evidence. In: Proceedings of COLING/ACL 2006, Sydney, Australia (2006)
Google Scholar
Fellbaum, C. (ed.): WordNet: an Electronic Lexical Database. MIT Press, Cambridge (1998)
MATH Google Scholar
Turney, P.D., Pantel, P.: From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research 37, 141–188 (2010)
MATH Google Scholar
Crouch, C.J.: A Cluster-Based Approach to Thesaurus Construction. In: SIGIR 1988: Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 309–320. ACM, New York (1988)
Google Scholar
Ruge, G.: Automatic Detection of Thesaurus relations for Information Retrieval Applications. In: Foundations of Computer Science: Potential - Theory - Cognition, to Wilfried Brauer on the Occasion of his Sixtieth Birthday, pp. 499–506. Springer, London (1997)
Chapter Google Scholar
Lin, D.: Automatic retrieval and Clustering of Similar Words. In: Proceedings of the 17th International Conference on Computational Linguistics, pp. 768–774. Association for Computational Linguistics, Morristown (1998)
Chapter Google Scholar
Curran, J.R., Moens, M.: Improvements in Automatic Thesaurus Extraction. In: Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), pp. 59–66 (2002)
Google Scholar
Yang, D., Powers, D.M.: Automatic Thesaurus Construction. In: Dobbie, G., Mans, B. (eds.) Thirty-First Australasian Computer Science Conference (ACSC 2008). CRPIT, vol. 74, pp. 147–156. ACS, Wollongong (2008)
Google Scholar
Rychlý, P., Kilgarriff, A.: An Efficient Algorithm for Building a Distributional Thesaurus (and other Sketch Engine Developments). In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 41–44. Association for Computational Linguistics, Prague (2007)
Google Scholar
Weeds, J., Weir, D.: Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity. Comput. Linguist. 31(4), 439–475 (2005)
Article MATH Google Scholar
Yih, W.-t.: Learning term-weighting functions for similarity measures. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, vol. 2, pp. 793–802. Association for Computational Linguistics, Morristown (2009)
Google Scholar
Hajishirzi, H., Yih, W.-t., Kolcz, A.: Adaptive near-duplicate detection via similarity learning. In: Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 419–426. ACM, New York (2010)
Google Scholar
Connor, M., Roth, D.: Context sensitive paraphrasing with a global unsupervised classifier. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 104–115. Springer, Heidelberg (2007)
Chapter Google Scholar
Turney, P., Littman, M.: Unsupervised Learning of Semantic Orientation from a Hundred-Billion-Word Corpus. Technical report NRC technical report ERB-1094, Institute for Information Technology, National Research Council Canada (2002)
Google Scholar
Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 91–134 (2005)
Article Google Scholar
Kang, I.H., Kim, G.: Query type classification for web document retrieval. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR 2003, pp. 64–71. ACM, New York (2003)
Chapter Google Scholar
Pantel, P.A.: Clustering by Committee. PhD thesis, University of Alberta (2003)
Google Scholar
Evert, S.: The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD thesis, Universität Stuttgart (2004)
Google Scholar
Piasecki, M., Szpakowicz, S., Broda, B.: Automatic Selection of Heterogeneous Syntactic Features in Semantic Similarity of Polish Nouns. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 99–106. Springer, Heidelberg (2007)
Chapter Google Scholar
Broda, B., Derwojedowa, M., Piasecki, M., Szpakowicz, S.: Corpus-based Semantic Relatedness for the Construction of Polish WordNet. In: Calzolari, N., (Conference Chair), Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008). European Language Resources Association (ELRA), Marrakech (2008)
Google Scholar
Lin, D.: Dependency-Based Evaluation of MINIPAR. In: Proceedings of the Workshop on the Evaluation of Parsing Systems, First International Conference on Language Resources and Evaluation (1998)
Google Scholar
Kennedy, A., Szpakowicz, S.: Evaluating Roget’s Thesauri. In: Proceedings of ACL 2008: HLT, pp. 416–424. Association for Computational Linguistics, Morristown (2008)
Google Scholar
Kirkpatrick, B. (ed.): Roget’s Thesaurus of English Words and Phrases . Longman, Harlow (1987)
Google Scholar
Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

SITE, University of Ottawa, Ottawa, Ontario, Canada
Alistair Kennedy & Stan Szpakowicz
Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
Stan Szpakowicz

Authors

Alistair Kennedy
View author publications
You can also search for this author in PubMed Google Scholar
Stan Szpakowicz
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Regina, 3737 Wascana Parkway, Regina, S4S 0A2, Saskatchewan, Canada
Cory Butz
Department of Mathematics and Computing Science, Saint Mary’s University, B3H 3C3, Halifax, Nova Scotia, Canada
Pawan Lingras

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kennedy, A., Szpakowicz, S. (2011). A Supervised Method of Feature Weighting for Measuring Semantic Relatedness. In: Butz, C., Lingras, P. (eds) Advances in Artificial Intelligence. Canadian AI 2011. Lecture Notes in Computer Science(), vol 6657. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21043-3_27

Download citation

DOI: https://doi.org/10.1007/978-3-642-21043-3_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21042-6
Online ISBN: 978-3-642-21043-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics