Skip to main content

Measuring Term Representativeness

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2700))

Abstract

This report introduces several measures of term representativeness and a scheme called the baseline method for defining the measures. The representativeness of a term T is measured by a normalized characteristic value which indicates the bias of the distribution of words in D(T), the set of all documents that contain the term. Dist(D(T)), the distance between the distribution of words in D(T) and in a whole corpus was, after normalization, found to be effective as a characteristic value for the bias of the distribution of words in D(T). Experiments showed that the measure based on the normalized value of Dist(D(∙)) strongly outperforms existing measures in evaluating the representativeness of terms in newspaper articles. The measure was also effective, in combination with term frequency, as a means for automatically extracting terms from abstracts of papers on artificial intelligence.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   34.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   49.95
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aizawa, A.: The Feature Quantity: An Information Theoretic Perspective of Tf-id-like Measure. In: Proc. of ACM SIGIR 2000, pp. 104–111 (2000)

    Google Scholar 

  2. Bessé, B.: Terminological Definitions. In: Sager, J.C. (ed.) (transl.) Handbook of Terminology Management, pp. 69–80. John Benjamins, Amsterdam (1996)

    Google Scholar 

  3. Caraballo, S.A., Charniak, E.: Determining the specificity of nouns from text. In: Proc. of EMNLP 1999, pp. 63–70 (1999)

    Google Scholar 

  4. Church, K.W., Hanks, P.: Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics 6(1), 22–29 (1990)

    Google Scholar 

  5. Cohen, J.D.: Highlights: Language- and Domain-independent Automatic Indexing Terms for Abstracting. J. of American Society for Information Science 46(3), 162–174 (1995)

    Article  Google Scholar 

  6. Daille, B., Gaussier, E., Lange, J.: Towards automatic extraction of monolingual and bilingual terminology. In: Proc. of COLING 1994, pp. 515–521 (1994)

    Google Scholar 

  7. Damerau, F.J.: Evaluating Domain-oriented Multi-word Terms from Texts. Information Processing and Management 29(4), 433–477 (1993)

    Article  Google Scholar 

  8. Dunning, T.: Accurate Method for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)

    Google Scholar 

  9. Frantzi, K.T., Ananiadou, S.: Statistical Measures for Terminological Expression. In: Proc. of the Third International Conference on Statistical Analysis of Textual Data, pp. 297–308. N.p, Rome (1995)

    Google Scholar 

  10. Frantzi, K.T., Ananiadou, S., Tsujii, J.: Extracting Terminological Expressions. Information Processing Society of Japan, Technical Report of SIGNL, NL112-12, 83–88 (1996)

    Google Scholar 

  11. Frantzi, K.T., Ananiadou, S., Tsujii, J.: The C-value/NC-value Method of Automatic Recognition for Multi-Word Terms. In: Proc. of European Conference on Digital Libraries, pp. 585–604 (1999)

    Google Scholar 

  12. Fukushige, Y., Noguchi, N.: Statistical and Linguistics Approaches to automatic term recognition: NTCIR experiments at Matsushita. Terminology 6(2), 257–286 (2000)

    Article  Google Scholar 

  13. Hisamitsu, T., Niwa, Y., Nishioka, S., Sakurai, H., Imaichi, O., Iwayama, M., Takano, A.: Term Extraction Using A New Measure of Term Representativeness. In: Proc. of NTCIR Workshop, vol. 1, pp. 475–481 (1999)

    Google Scholar 

  14. Hisamitsu, T., Niwa, Y., Tsujii, J.: A Method of Measuring Term Representativeness – Baseline Method Using Co-occurrence Distribution. In: Proc. of COLING 2000, pp. 320–326 (2000)

    Google Scholar 

  15. Hisamitsu, T., Niwa, Y., Nishioka, S., Sakurai, H., Imaichi, O., Iwayama, M., Takano, A.: Extracting Terms by a Combination of Term Frequency and a Measure of Term Representativeness. Terminology, 6(2), 211–232 (2000)

    Article  Google Scholar 

  16. Hisamitsu, T., Niwa, Y.: A Measure of Term Representativeness Based on the Number of Co-occurring Salient Words. In: Proc. of COLING 2002 (2002) (to appear)

    Google Scholar 

  17. Jacquemin, C.: Spotting and Discovering Terms through NLP. MIT Press, Cambridge (2001)

    Google Scholar 

  18. Kageura, K., Umino, B.: Methods of automatic term recognition: A review. Terminology 3(2), 259–289 (1996)

    Article  Google Scholar 

  19. Kageura, K., Yoshioka, M., Takeuchi, K., Koyama, T., Tsuji, K., Yoshikane, Y., Okada, M.: Overview of TMREC Tasks. In: Proc. of NTCIR Workshop, vol. 1 (1999)

    Google Scholar 

  20. Kageura, K., Yoshioka, M., Tsuji, K., Yoshikane, Y., Takeuchi, K., Koyama, T.: Evaluation of the Term Recognition Task. In: Proc. of NTCIR Workshop, vol. 1, pp. 417–434 (1999)

    Google Scholar 

  21. Kageura, K., Yoshioka, M., Takeuchi, K., Koyama, T., Tsuji, K., Yoshikane, Y.: Recent Advances in automatic term recognition: Experiences from the NTCIR workshop on information retrieval and term recognition. Terminology 6(2), 151–174 (2000)

    Article  Google Scholar 

  22. Kando, N., Kuriyama, K., Nozue, T.: NACSIS test collection workshop (NTCIR-1). In: Proc. of the 22nd Annual International ACM SIGIR Conf. on Research and Development in IR, pp. 299–300 (1999)

    Google Scholar 

  23. Kit, C.: Reduction of Indexing Term Space for Phrase-based Information Retrieval. Internal memo of Computational Linguistics Program. Carnegie Mellon University, Pittsburgh (1994)

    Google Scholar 

  24. Luhn, H.P.: A Statistical Approach to Mechanized Encoding and Searching Literary Information. IBM J. of Research and Development 2(2), 159–165 (1957)

    Article  MathSciNet  Google Scholar 

  25. Maron, M.E.: Automatic Indexing: An Experimental inquiry. J. of the Association for Computer Machinery 8(3), 404–417 (1961)

    Article  Google Scholar 

  26. Mima, H., Ananiadou, S.: An application and evaluation of the C/NC value approach for the automatic term recognition of multi-word units in Japanese. Terminology 6(2), 175–194 (2000)

    Article  Google Scholar 

  27. Nagao, M., Mizutani, M., Ikeda, H.: An Automated Method of the Extraction of Important Words from Japanese Scientific Documents. Trans. of Information Processing Society of Japan 17(2), 110–117 (1976) (in Japanese)

    Google Scholar 

  28. Nakagawa, H., Mori, T.: Nested Collocation and Compound Noun For Term Extraction. In: Proc. of Computerm 1998, pp. 64-70 (1998)

    Google Scholar 

  29. Nakagawa, H.: Automatic term recognition based on statistics of compound nouns. Terminology 6(2), 195–210 (2000)

    Article  Google Scholar 

  30. Niwa, Y., Nishioka, S., Iwayama, M., Takano, A.: Topic graph generation for query navigation: Use of frequency classes for topic extraction. In: Proc. of NLPRS 1997, pp. 95–100 (1997)

    Google Scholar 

  31. Noreault, T., McGill, M., Koll, M.B.: A Performance Evaluation of Similarity Measure, Document Term Weighting Schemes and Representation in a Boolean Environment. In: Oddey, R.N. (ed.) Information Retrieval Research, pp. 57–76. Butterworths, London (1977)

    Google Scholar 

  32. Robertson, S.E., Walker, S., Beaulieu, M.: Experimentation as a way of life: Okapi at TREC. Information Processing and Management 36(1), 95–108 (2000)

    Article  Google Scholar 

  33. Sakurai, H., Hisamitsu, T.: A Data Structure for Fast Lookup of Grammatically Connectable Word Pairs in Japanese Morphological Analysis. In: Proc. of ICCPOL 1999, pp. 467–471 (1999)

    Google Scholar 

  34. Salton, G., Yang, C.S.: On the Specification of Term. Values in Automatic Indexing. Journal of Documentation 29(4), 351–372 (1973)

    Article  Google Scholar 

  35. Salton, G., Yang, C.S., Yu, C.T.: A Theory of Term Importance in Automatic Text Analysis. J. of the American Society for Information Science 26(1), 33–44 (1975)

    Article  Google Scholar 

  36. Salton, G.: Automatic Text Processing. Addison-Wesley, Reading (1988)

    Google Scholar 

  37. Singhal, A., Buckley, C., Cochrane, P.A.: Pivoted Document Length Normalization. In: Proc. of ACM SIGIR 1996, pp. 126–133 (1996)

    Google Scholar 

  38. Sparck-Jones, K.: Index Term Weighting. Information Storage and Retrieval 9(11), 619–633 (1973)

    Article  Google Scholar 

  39. Takano, A., Niwa, Y., Nishioka, S., Iwayama, M., Hisamitsu, T., Imaichi, O., Sakurai, H.: Information Access Based on Associative Calculation. In: Jeffery, K., Hlaváč, V., Wiedermann, J. (eds.) SOFSEM 2000. LNCS, vol. 1963, pp. 187–201. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  40. Teramoto, Y., Miyahara, Y., Matsumoto, S.: Word weight calculation for document retrieval by analyzing the distribution of co-occurrence words. In: Proc. of the 59th Annual Meeting of IPSJ. IP-06 (1999) (in Japanese)

    Google Scholar 

  41. Terminology 6(2) (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hisamitsu, T., Tsujii, Ji. (2003). Measuring Term Representativeness. In: Pazienza, M.T. (eds) Information Extraction in the Web Era. SCIE 2002. Lecture Notes in Computer Science(), vol 2700. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45092-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-45092-4_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40579-5

  • Online ISBN: 978-3-540-45092-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics