Skip to main content

Discovering Characteristic Expressions from Literary Works: a New Text Analysis Method beyond N-Gram Statistics and KWIC

  • Conference paper
  • First Online:
Discovery Science (DS 2000)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1967))

Included in the following conference series:

Abstract

We attempt to extract characteristic expressions from literary works. That is, our problem is, given literary works by a particular writer as positive examples and works by another writer as negative examples, to find expressions that appear frequently in the positive examples but do not so in the negative examples. It is considered as a special case of the optimal pattern discovery from textual data, in which only the substring patterns are considered. One reasonable approach is to create a list of substrings arranged in the descending order of their goodness, and to examine a first part of the list by a human expert. Since there is no word boundary in Japanese texts, a substring is often a fragment of a word or a phrase. How to assist the human expert is a key to success in discovery. In this paper, we propose (1) to restrict to the prime substrings in order to remove redundancy from the list, and (2) a way of browsing the neighbor of a focused string as well as its context. Using this method, we report successful results against two pairs of anthologies of classical Japanese poems. We expect that the extracted expressions will possibly lead to discovering overlooked aspects of individual poets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. H. Arimura. Text data mining with optimized pattern discovery. In Proc. 17th Workshop on Machine Intelligence, Cambridge, July 2000.

    Google Scholar 

  2. A. Blumer, J. Blumer, D. Haussler, R. Mcconnell, and A. Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. J. ACM, 34(3):578–595, 1987. Previous version in: STOC’84.

    Article  MathSciNet  Google Scholar 

  3. M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.

    Google Scholar 

  4. L. Devroye, L. Gy orfi, and G. Lugosi. A Probablistic Theory of Pattern Recognition. Springer, 1997.

    Google Scholar 

  5. U. M. Fayyad, G. P.-Shapiro, and P. Smyth. From data mining to knowledge discovery: an overview. In Advances in Knowledge Discovery and Data Mining, pages 1–34. The AAAI Press, 1996.

    Google Scholar 

  6. T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using twodimensional optimized association rules. In Proc. 1996 SIGMOD, pages 13–23, 1996.

    Google Scholar 

  7. M. Kondo. Studies on classical Japanese literature based on string analysis using n-gram statistics. Technical report, Chiba University, March 2000. (in Japanese).

    Google Scholar 

  8. H. Luhn. Keyword-in-context index for technical literature (KWICindex). American Documentation, 11:288–295, 1960.

    Article  Google Scholar 

  9. M. Murakami and Y. Imanishi. On a quantitative analysis of auxiliary verbs used in Genji Monogatari. Transactions of Information Processing Society of Japan, 40(3):774–782, 1999. (in Japanese).

    Google Scholar 

  10. S. Shimozono, H. Arimura, and S. Arikawa. Efficient discovery of optimal wordassociation patterns in large databases. New Gener. Comput., 18(1):49–60, 2000.

    Google Scholar 

  11. M. Takeda, T. Fukuda, I. Nanri, M. Yamasaki, and K. Tamari. Discovering similar poems from anthologies of classical Japanese poems. Proceedings of the Institute of Statistical Mathematics, 48(2), 2000. to appear (in Japanese).

    Google Scholar 

  12. K. Tamari, M. Yamasaki, T. Kida, M. Takeda, T. Fukuda, and I. Nanri. Discovering poetic allusion in anthologies of classical Japaneses poems. In Proc. 2nd Int. Conf. Discovery Science, LNAI 1721, pages 128–138. Springer-Verlag, 1999.

    Google Scholar 

  13. M. Yamasaki, M. Takeda, T. Fukuda, and I. Nanri. Discovering characteristic patterns from collections of classical Japanese poems. New Gener. Comput., 18(1):61–73, 2000. Previous version in: DS’98, LNAI 1532.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Takeda, M., Matsumoto, T., Fukuda, T., Nanri, I. (2000). Discovering Characteristic Expressions from Literary Works: a New Text Analysis Method beyond N-Gram Statistics and KWIC. In: Arikawa, S., Morishita, S. (eds) Discovery Science. DS 2000. Lecture Notes in Computer Science(), vol 1967. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44418-1_10

Download citation

  • DOI: https://doi.org/10.1007/3-540-44418-1_10

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41352-3

  • Online ISBN: 978-3-540-44418-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics