Discovering Characteristic Expressions from Literary Works: a New Text Analysis Method beyond N-Gram Statistics and KWIC

Takeda, Masayuki; Matsumoto, Tetsuya; Fukuda, Tomoko; Nanri, Ichirō

doi:10.1007/3-540-44418-1_10

Masayuki Takeda³,
Tetsuya Matsumoto³,
Tomoko Fukuda⁴ &
…
Ichirō Nanri⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1967))

Included in the following conference series:

International Conference on Discovery Science

365 Accesses
1 Citations

Abstract

We attempt to extract characteristic expressions from literary works. That is, our problem is, given literary works by a particular writer as positive examples and works by another writer as negative examples, to find expressions that appear frequently in the positive examples but do not so in the negative examples. It is considered as a special case of the optimal pattern discovery from textual data, in which only the substring patterns are considered. One reasonable approach is to create a list of substrings arranged in the descending order of their goodness, and to examine a first part of the list by a human expert. Since there is no word boundary in Japanese texts, a substring is often a fragment of a word or a phrase. How to assist the human expert is a key to success in discovery. In this paper, we propose (1) to restrict to the prime substrings in order to remove redundancy from the list, and (2) a way of browsing the neighbor of a focused string as well as its context. Using this method, we report successful results against two pairs of anthologies of classical Japanese poems. We expect that the extracted expressions will possibly lead to discovering overlooked aspects of individual poets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

H. Arimura. Text data mining with optimized pattern discovery. In Proc. 17th Workshop on Machine Intelligence, Cambridge, July 2000.
Google Scholar
A. Blumer, J. Blumer, D. Haussler, R. Mcconnell, and A. Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. J. ACM, 34(3):578–595, 1987. Previous version in: STOC’84.
Article MathSciNet Google Scholar
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.
Google Scholar
L. Devroye, L. Gy orfi, and G. Lugosi. A Probablistic Theory of Pattern Recognition. Springer, 1997.
Google Scholar
U. M. Fayyad, G. P.-Shapiro, and P. Smyth. From data mining to knowledge discovery: an overview. In Advances in Knowledge Discovery and Data Mining, pages 1–34. The AAAI Press, 1996.
Google Scholar
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using twodimensional optimized association rules. In Proc. 1996 SIGMOD, pages 13–23, 1996.
Google Scholar
M. Kondo. Studies on classical Japanese literature based on string analysis using n-gram statistics. Technical report, Chiba University, March 2000. (in Japanese).
Google Scholar
H. Luhn. Keyword-in-context index for technical literature (KWICindex). American Documentation, 11:288–295, 1960.
Article Google Scholar
M. Murakami and Y. Imanishi. On a quantitative analysis of auxiliary verbs used in Genji Monogatari. Transactions of Information Processing Society of Japan, 40(3):774–782, 1999. (in Japanese).
Google Scholar
S. Shimozono, H. Arimura, and S. Arikawa. Efficient discovery of optimal wordassociation patterns in large databases. New Gener. Comput., 18(1):49–60, 2000.
Google Scholar
M. Takeda, T. Fukuda, I. Nanri, M. Yamasaki, and K. Tamari. Discovering similar poems from anthologies of classical Japanese poems. Proceedings of the Institute of Statistical Mathematics, 48(2), 2000. to appear (in Japanese).
Google Scholar
K. Tamari, M. Yamasaki, T. Kida, M. Takeda, T. Fukuda, and I. Nanri. Discovering poetic allusion in anthologies of classical Japaneses poems. In Proc. 2nd Int. Conf. Discovery Science, LNAI 1721, pages 128–138. Springer-Verlag, 1999.
Google Scholar
M. Yamasaki, M. Takeda, T. Fukuda, and I. Nanri. Discovering characteristic patterns from collections of classical Japanese poems. New Gener. Comput., 18(1):61–73, 2000. Previous version in: DS’98, LNAI 1532.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Kyushu University 33, 812-8581, Fukuoka, Japan
Masayuki Takeda & Tetsuya Matsumoto
Fukuoka Jo Gakuin College, 838-0141, Ogōri, Japan
Tomoko Fukuda
Junshin Women’s Junior College, 815-0036, Fukuoka, Japan
Ichirō Nanri

Authors

Masayuki Takeda
View author publications
You can also search for this author in PubMed Google Scholar
Tetsuya Matsumoto
View author publications
You can also search for this author in PubMed Google Scholar
Tomoko Fukuda
View author publications
You can also search for this author in PubMed Google Scholar
Ichirō Nanri
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Information Science and Electrical Engineering, Department of Informatics, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, 812-8581, Fukuoka, Japan
Setsuo Arikawa
Faculty of Science Department of Information Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, 113-0033, Tokyo, Japan
Shinichi Morishita

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Takeda, M., Matsumoto, T., Fukuda, T., Nanri, I. (2000). Discovering Characteristic Expressions from Literary Works: a New Text Analysis Method beyond N-Gram Statistics and KWIC. In: Arikawa, S., Morishita, S. (eds) Discovery Science. DS 2000. Lecture Notes in Computer Science(), vol 1967. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44418-1_10

Download citation

DOI: https://doi.org/10.1007/3-540-44418-1_10
Published: 19 October 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41352-3
Online ISBN: 978-3-540-44418-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics