Abstract
The clustering of related words is crucial for a variety of Natural Language Processing applications. Many known techniques of word clustering use the context of a word to determine its meaning. Words which frequently appear in similar contexts are assumed to have similar meanings. Word clustering usually applies the weighting of contexts, based on some measure of their importance. One of the most popular measures is Pointwise Mutual Information. It increases the weight of contexts where a word appears regularly but other words do not, and decreases the weight of contexts where many words may appear. Essentially, it is unsupervised feature weighting. We present a method of supervised feature weighting. It identifies contexts shared by pairs of words known to be semantically related or unrelated, and then uses Pointwise Mutual Information to weight these contexts on how well they indicate closely related words. We use Roget’s Thesaurus as a source of training and evaluation data. This work is as a step towards adding new terms to Roget’s Thesaurus automatically, and doing so with high confidence.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Hagiwara, M., Ogawa, Y., Toyama, K.: Supervised synonym acquisition using distributional features and syntactic patterns. Journal of Natural Language Processing 16, 59–83 (2005)
Broda, B., Jaworski, D., Piasecki, M.: Parallel, Massive Processing in SuperMatrix – a General Tool for Distributional Semantic Analysis of Corpus. In: Proceedings of the International Multiconference on Computer Science and Information Technology, pp. 373–379 (2010)
Snow, R., Jurafsky, D., Ng, A.Y.: Semantic Taxonomy Induction from Heterogenous Evidence. In: Proceedings of COLING/ACL 2006, Sydney, Australia (2006)
Fellbaum, C. (ed.): WordNet: an Electronic Lexical Database. MIT Press, Cambridge (1998)
Turney, P.D., Pantel, P.: From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research 37, 141–188 (2010)
Crouch, C.J.: A Cluster-Based Approach to Thesaurus Construction. In: SIGIR 1988: Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 309–320. ACM, New York (1988)
Ruge, G.: Automatic Detection of Thesaurus relations for Information Retrieval Applications. In: Foundations of Computer Science: Potential - Theory - Cognition, to Wilfried Brauer on the Occasion of his Sixtieth Birthday, pp. 499–506. Springer, London (1997)
Lin, D.: Automatic retrieval and Clustering of Similar Words. In: Proceedings of the 17th International Conference on Computational Linguistics, pp. 768–774. Association for Computational Linguistics, Morristown (1998)
Curran, J.R., Moens, M.: Improvements in Automatic Thesaurus Extraction. In: Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), pp. 59–66 (2002)
Yang, D., Powers, D.M.: Automatic Thesaurus Construction. In: Dobbie, G., Mans, B. (eds.) Thirty-First Australasian Computer Science Conference (ACSC 2008). CRPIT, vol. 74, pp. 147–156. ACS, Wollongong (2008)
Rychlý, P., Kilgarriff, A.: An Efficient Algorithm for Building a Distributional Thesaurus (and other Sketch Engine Developments). In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 41–44. Association for Computational Linguistics, Prague (2007)
Weeds, J., Weir, D.: Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity. Comput. Linguist. 31(4), 439–475 (2005)
Yih, W.-t.: Learning term-weighting functions for similarity measures. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, vol. 2, pp. 793–802. Association for Computational Linguistics, Morristown (2009)
Hajishirzi, H., Yih, W.-t., Kolcz, A.: Adaptive near-duplicate detection via similarity learning. In: Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 419–426. ACM, New York (2010)
Connor, M., Roth, D.: Context sensitive paraphrasing with a global unsupervised classifier. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 104–115. Springer, Heidelberg (2007)
Turney, P., Littman, M.: Unsupervised Learning of Semantic Orientation from a Hundred-Billion-Word Corpus. Technical report NRC technical report ERB-1094, Institute for Information Technology, National Research Council Canada (2002)
Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 91–134 (2005)
Kang, I.H., Kim, G.: Query type classification for web document retrieval. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR 2003, pp. 64–71. ACM, New York (2003)
Pantel, P.A.: Clustering by Committee. PhD thesis, University of Alberta (2003)
Evert, S.: The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD thesis, Universität Stuttgart (2004)
Piasecki, M., Szpakowicz, S., Broda, B.: Automatic Selection of Heterogeneous Syntactic Features in Semantic Similarity of Polish Nouns. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 99–106. Springer, Heidelberg (2007)
Broda, B., Derwojedowa, M., Piasecki, M., Szpakowicz, S.: Corpus-based Semantic Relatedness for the Construction of Polish WordNet. In: Calzolari, N., (Conference Chair), Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008). European Language Resources Association (ELRA), Marrakech (2008)
Lin, D.: Dependency-Based Evaluation of MINIPAR. In: Proceedings of the Workshop on the Evaluation of Parsing Systems, First International Conference on Language Resources and Evaluation (1998)
Kennedy, A., Szpakowicz, S.: Evaluating Roget’s Thesauri. In: Proceedings of ACL 2008: HLT, pp. 416–424. Association for Computational Linguistics, Morristown (2008)
Kirkpatrick, B. (ed.): Roget’s Thesaurus of English Words and Phrases . Longman, Harlow (1987)
Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kennedy, A., Szpakowicz, S. (2011). A Supervised Method of Feature Weighting for Measuring Semantic Relatedness. In: Butz, C., Lingras, P. (eds) Advances in Artificial Intelligence. Canadian AI 2011. Lecture Notes in Computer Science(), vol 6657. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21043-3_27
Download citation
DOI: https://doi.org/10.1007/978-3-642-21043-3_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21042-6
Online ISBN: 978-3-642-21043-3
eBook Packages: Computer ScienceComputer Science (R0)