Measuring Term Representativeness

Hisamitsu, Toru; Tsujii, Jun-ichi

doi:10.1007/978-3-540-45092-4_3

Measuring Term Representativeness

Toru Hisamitsu² &
Jun-ichi Tsujii³

Conference paper
First Online: 01 January 2003

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2700))

Abstract

This report introduces several measures of term representativeness and a scheme called the baseline method for defining the measures. The representativeness of a term T is measured by a normalized characteristic value which indicates the bias of the distribution of words in D(T), the set of all documents that contain the term. Dist(D(T)), the distance between the distribution of words in D(T) and in a whole corpus was, after normalization, found to be effective as a characteristic value for the bias of the distribution of words in D(T). Experiments showed that the measure based on the normalized value of Dist(D(∙)) strongly outperforms existing measures in evaluating the representativeness of terms in newspaper articles. The measure was also effective, in combination with term frequency, as a means for automatically extracting terms from abstracts of papers on artificial intelligence.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Softcover Book: USD 49.95; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aizawa, A.: The Feature Quantity: An Information Theoretic Perspective of Tf-id-like Measure. In: Proc. of ACM SIGIR 2000, pp. 104–111 (2000)
Google Scholar
Bessé, B.: Terminological Definitions. In: Sager, J.C. (ed.) (transl.) Handbook of Terminology Management, pp. 69–80. John Benjamins, Amsterdam (1996)
Google Scholar
Caraballo, S.A., Charniak, E.: Determining the specificity of nouns from text. In: Proc. of EMNLP 1999, pp. 63–70 (1999)
Google Scholar
Church, K.W., Hanks, P.: Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics 6(1), 22–29 (1990)
Google Scholar
Cohen, J.D.: Highlights: Language- and Domain-independent Automatic Indexing Terms for Abstracting. J. of American Society for Information Science 46(3), 162–174 (1995)
Article Google Scholar
Daille, B., Gaussier, E., Lange, J.: Towards automatic extraction of monolingual and bilingual terminology. In: Proc. of COLING 1994, pp. 515–521 (1994)
Google Scholar
Damerau, F.J.: Evaluating Domain-oriented Multi-word Terms from Texts. Information Processing and Management 29(4), 433–477 (1993)
Article Google Scholar
Dunning, T.: Accurate Method for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)
Google Scholar
Frantzi, K.T., Ananiadou, S.: Statistical Measures for Terminological Expression. In: Proc. of the Third International Conference on Statistical Analysis of Textual Data, pp. 297–308. N.p, Rome (1995)
Google Scholar
Frantzi, K.T., Ananiadou, S., Tsujii, J.: Extracting Terminological Expressions. Information Processing Society of Japan, Technical Report of SIGNL, NL112-12, 83–88 (1996)
Google Scholar
Frantzi, K.T., Ananiadou, S., Tsujii, J.: The C-value/NC-value Method of Automatic Recognition for Multi-Word Terms. In: Proc. of European Conference on Digital Libraries, pp. 585–604 (1999)
Google Scholar
Fukushige, Y., Noguchi, N.: Statistical and Linguistics Approaches to automatic term recognition: NTCIR experiments at Matsushita. Terminology 6(2), 257–286 (2000)
Article Google Scholar
Hisamitsu, T., Niwa, Y., Nishioka, S., Sakurai, H., Imaichi, O., Iwayama, M., Takano, A.: Term Extraction Using A New Measure of Term Representativeness. In: Proc. of NTCIR Workshop, vol. 1, pp. 475–481 (1999)
Google Scholar
Hisamitsu, T., Niwa, Y., Tsujii, J.: A Method of Measuring Term Representativeness – Baseline Method Using Co-occurrence Distribution. In: Proc. of COLING 2000, pp. 320–326 (2000)
Google Scholar
Hisamitsu, T., Niwa, Y., Nishioka, S., Sakurai, H., Imaichi, O., Iwayama, M., Takano, A.: Extracting Terms by a Combination of Term Frequency and a Measure of Term Representativeness. Terminology, 6(2), 211–232 (2000)
Article Google Scholar
Hisamitsu, T., Niwa, Y.: A Measure of Term Representativeness Based on the Number of Co-occurring Salient Words. In: Proc. of COLING 2002 (2002) (to appear)
Google Scholar
Jacquemin, C.: Spotting and Discovering Terms through NLP. MIT Press, Cambridge (2001)
Google Scholar
Kageura, K., Umino, B.: Methods of automatic term recognition: A review. Terminology 3(2), 259–289 (1996)
Article Google Scholar
Kageura, K., Yoshioka, M., Takeuchi, K., Koyama, T., Tsuji, K., Yoshikane, Y., Okada, M.: Overview of TMREC Tasks. In: Proc. of NTCIR Workshop, vol. 1 (1999)
Google Scholar
Kageura, K., Yoshioka, M., Tsuji, K., Yoshikane, Y., Takeuchi, K., Koyama, T.: Evaluation of the Term Recognition Task. In: Proc. of NTCIR Workshop, vol. 1, pp. 417–434 (1999)
Google Scholar
Kageura, K., Yoshioka, M., Takeuchi, K., Koyama, T., Tsuji, K., Yoshikane, Y.: Recent Advances in automatic term recognition: Experiences from the NTCIR workshop on information retrieval and term recognition. Terminology 6(2), 151–174 (2000)
Article Google Scholar
Kando, N., Kuriyama, K., Nozue, T.: NACSIS test collection workshop (NTCIR-1). In: Proc. of the 22nd Annual International ACM SIGIR Conf. on Research and Development in IR, pp. 299–300 (1999)
Google Scholar
Kit, C.: Reduction of Indexing Term Space for Phrase-based Information Retrieval. Internal memo of Computational Linguistics Program. Carnegie Mellon University, Pittsburgh (1994)
Google Scholar
Luhn, H.P.: A Statistical Approach to Mechanized Encoding and Searching Literary Information. IBM J. of Research and Development 2(2), 159–165 (1957)
Article MathSciNet Google Scholar
Maron, M.E.: Automatic Indexing: An Experimental inquiry. J. of the Association for Computer Machinery 8(3), 404–417 (1961)
Article Google Scholar
Mima, H., Ananiadou, S.: An application and evaluation of the C/NC value approach for the automatic term recognition of multi-word units in Japanese. Terminology 6(2), 175–194 (2000)
Article Google Scholar
Nagao, M., Mizutani, M., Ikeda, H.: An Automated Method of the Extraction of Important Words from Japanese Scientific Documents. Trans. of Information Processing Society of Japan 17(2), 110–117 (1976) (in Japanese)
Google Scholar
Nakagawa, H., Mori, T.: Nested Collocation and Compound Noun For Term Extraction. In: Proc. of Computerm 1998, pp. 64-70 (1998)
Google Scholar
Nakagawa, H.: Automatic term recognition based on statistics of compound nouns. Terminology 6(2), 195–210 (2000)
Article Google Scholar
Niwa, Y., Nishioka, S., Iwayama, M., Takano, A.: Topic graph generation for query navigation: Use of frequency classes for topic extraction. In: Proc. of NLPRS 1997, pp. 95–100 (1997)
Google Scholar
Noreault, T., McGill, M., Koll, M.B.: A Performance Evaluation of Similarity Measure, Document Term Weighting Schemes and Representation in a Boolean Environment. In: Oddey, R.N. (ed.) Information Retrieval Research, pp. 57–76. Butterworths, London (1977)
Google Scholar
Robertson, S.E., Walker, S., Beaulieu, M.: Experimentation as a way of life: Okapi at TREC. Information Processing and Management 36(1), 95–108 (2000)
Article Google Scholar
Sakurai, H., Hisamitsu, T.: A Data Structure for Fast Lookup of Grammatically Connectable Word Pairs in Japanese Morphological Analysis. In: Proc. of ICCPOL 1999, pp. 467–471 (1999)
Google Scholar
Salton, G., Yang, C.S.: On the Specification of Term. Values in Automatic Indexing. Journal of Documentation 29(4), 351–372 (1973)
Article Google Scholar
Salton, G., Yang, C.S., Yu, C.T.: A Theory of Term Importance in Automatic Text Analysis. J. of the American Society for Information Science 26(1), 33–44 (1975)
Article Google Scholar
Salton, G.: Automatic Text Processing. Addison-Wesley, Reading (1988)
Google Scholar
Singhal, A., Buckley, C., Cochrane, P.A.: Pivoted Document Length Normalization. In: Proc. of ACM SIGIR 1996, pp. 126–133 (1996)
Google Scholar
Sparck-Jones, K.: Index Term Weighting. Information Storage and Retrieval 9(11), 619–633 (1973)
Article Google Scholar
Takano, A., Niwa, Y., Nishioka, S., Iwayama, M., Hisamitsu, T., Imaichi, O., Sakurai, H.: Information Access Based on Associative Calculation. In: Jeffery, K., Hlaváč, V., Wiedermann, J. (eds.) SOFSEM 2000. LNCS, vol. 1963, pp. 187–201. Springer, Heidelberg (2000)
Chapter Google Scholar
Teramoto, Y., Miyahara, Y., Matsumoto, S.: Word weight calculation for document retrieval by analyzing the distribution of co-occurrence words. In: Proc. of the 59^th Annual Meeting of IPSJ. IP-06 (1999) (in Japanese)
Google Scholar
Terminology 6(2) (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Central Research Laboratory, Hitachi, Ltd., 1-280, Higashi-koigakubo, Kokubunji, Tokyo, 185-8601, Japan
Toru Hisamitsu
Graduate School of Science, The University of Tokyo and CREST, (Japan Science and Technology Corporation), 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8654, Japan
Jun-ichi Tsujii

Authors

Toru Hisamitsu
View author publications
You can also search for this author in PubMed Google Scholar
Jun-ichi Tsujii
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

DISP, University of Tor Vergata, Via del Politecnico 1, Rome, Italy
Maria Teresa Pazienza

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hisamitsu, T., Tsujii, Ji. (2003). Measuring Term Representativeness. In: Pazienza, M.T. (eds) Information Extraction in the Web Era. SCIE 2002. Lecture Notes in Computer Science(), vol 2700. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45092-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-540-45092-4_3
Published: 28 August 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40579-5
Online ISBN: 978-3-540-45092-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics