Handling Orthographic Varieties in Japanese IR: Fusion of Word-, N-Gram-, and Yomi-Based Indices Across Different Document Collections

Kummer, Nina; Womser-Hacker, Christa; Kando, Noriko

doi:10.1007/11562382_65

Nina Kummer^20,21,
Christa Womser-Hacker²⁰ &
Noriko Kando²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3689))

Included in the following conference series:

Asia Information Retrieval Symposium

1001 Accesses
1 Citations

Abstract

Orthographic varieties are common in the Japanese language and represent a serious problem for Japanese information retrieval (IR), as IR systems run the risk of missing documents that contain variant forms of the search term. We propose two different strategies for handling orthographic varieties: pronunciation or yomi-based indexing and “Fuzzy Querying”, comparing katakana terms based on edit distance. Both strategies were integrated into our multiple index and fusion system [1] and tested using two different test collections, newspaper articles (Mainichi Shimbun ’98) and scientific abstracts (NTCIR-1), to compare their performance across text genres.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Information Retrieval with Hindi, Bengali, and Marathi Languages: Evaluation and Analysis

The Effects of Word Frequency Distortions Occasioned by Compounding on the Automatic Indexing of Yorùbá Text

Performance of Turkish Information Retrieval: Evaluating the Impact of Linguistic Parameters and Compound Nouns

References

Womser-Hacker, C.: An Information Retrieval Prototype for Research and Teaching. In: Eibl, M., Wolff, C., Womser-Hacker, C. (eds.) To appear in Designing Information Systems. Festschrift für Jürgen Krause. Konstanz: Universitätsverlag [Schriften zur Informationswissenschaft] (2005)
Google Scholar
Halpern, J.: Lexicon-Based Orthographic Disambiguation in CJK Intelligent Information Retrieval. In: Proceedings of the 19th Conference on Computational Linguistics, COLING 2002, Taipei, Taiwan, August 24–September 1 (2002)
Google Scholar
Halpern, J.: The Challenges of Intelligent Japanese Searching. In: Working paper. The CJK Dictionary Institute, Saitama (2000), www.cjk.org/cjk/joa/joapaper.htm (revised 2003)
Kummer, N., Womser-Hacker, C., Kando, N.: Handling Orthographic Varieties in Japanese Information Retrieval: Fusion of Word-, N-gram-, and Yomi-Based Indices across Different Document Collections. NII Technical Report (2005)
Google Scholar
Gospodnetić, O., Hatcher, E.: Lucene in Action. Manning, Canada (2004)
Google Scholar
Yoshioka, M., Kuriyama, K., Kando, N.: Analysis of the Usage of Japanese Segmented Texts in NTCIR Workshop 2. In: Proceedings of the Second NTCIR Workshop on Research in Chinese and Japanese Text Retrieval and Text Summarization, pp. 291–296. National Institute of Informatics, Tokyo (2002)
Google Scholar
Ozawa, T., Yamamoto, M., Umemura, K., Church, K.W.: Japanese Word Segmentation Using Similarity Measure for IR. In: Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition, Tokyo, Japan, August 30–September 1, pp. 89–96 (1999)
Google Scholar
Jones, G.J.F., Sakai, T., Kajiura, M., Sumita, K.: Experiments in Japanese Text Retrieval and Routing Using the NEAT System. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 197–205 (1998)
Google Scholar
Sakai, T., Shibazaki, Y., Suzuki, M., Kajiura, M., Manabe, T., Sumita, K.: Cross-Language Information Retrieval for NTCIR at Toshiba. In: Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition, Tokyo, Japan, August 30–September 1, pp. 137–144 (1999)
Google Scholar
Vines, P., Wilkinson, R.: Experiments with Japanese Text Retrieval Using mg. In: Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition, Tokyo, Japan, August 30–September 1, pp. 97–100 (1999)
Google Scholar
Chow, K.C.W., Luk, R.W.P., Wong, K.-F., Kwok, K.-L.: Hybrid Term Indexing for Different IR Models. In: Proceedings of the Fifth International Workshop on Information Retrieval with Asian Languages, Hong Kong, China, pp. 49–54 (2000)
Google Scholar
Luk, R.W.P., Wong, K.-F., Kwok, K.-L.: Hybrid Term Indexing: An Evaluation. In: Proceedings of the Second NTCIR Workshop on Research in Chinese and Japanese Text Retrieval and Text Summarization, pp. 130–136. National Institute of Informatics, Tokyo (2001)
Google Scholar
Savoy, J.: Report on CLIR Task for the NTCIR-4 Evaluation Campaign. In: Proceedings of the Fourth NTCIR Workshop on Research in Information Retrieval, Automatic Text Summarization and Question Answering, pp. 178–185 (2004)
Google Scholar
Kummer, N., Womser-Hacker, C., Kando, N.: Re-Examination of Japanese Indexing: Fusion of Word-, N-gram- and Yomi-Based Indices. In: Proceedings of the 11th Annual Meeting of The Association for Natural Language Processing, March 14–18, pp. 221–224. University of Kagawa, Kagawa Prefecture (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Universität Hildesheim, Germany
Nina Kummer & Christa Womser-Hacker
National Institute of Informatics, Tokyo, Japan
Nina Kummer & Noriko Kando

Authors

Nina Kummer
View author publications
You can also search for this author in PubMed Google Scholar
Christa Womser-Hacker
View author publications
You can also search for this author in PubMed Google Scholar
Noriko Kando
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-dong, Nam-gu, 790-784, Pohang, Korea
Gary Geunbae Lee
Computer and Communication Media Research, NEC Corp., Miyazaki 4-1-1, Miyamae-ku, 216-8555, Kawasaki, Japan
Akio Yamada
Human-Computer Communications Laboratory, Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong
Helen Meng
School of Engineering, Information and Communications University, 119, Munjiro, Yuseong-gu, 305-732, Daejeon, Korea
Sung Hyon Myaeng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kummer, N., Womser-Hacker, C., Kando, N. (2005). Handling Orthographic Varieties in Japanese IR: Fusion of Word-, N-Gram-, and Yomi-Based Indices Across Different Document Collections. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds) Information Retrieval Technology. AIRS 2005. Lecture Notes in Computer Science, vol 3689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562382_65

Download citation

DOI: https://doi.org/10.1007/11562382_65
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29186-2
Online ISBN: 978-3-540-32001-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Handling Orthographic Varieties in Japanese IR: Fusion of Word-, N-Gram-, and Yomi-Based Indices Across Different Document Collections

Abstract

Access this chapter

Preview

Similar content being viewed by others

Information Retrieval with Hindi, Bengali, and Marathi Languages: Evaluation and Analysis

The Effects of Word Frequency Distortions Occasioned by Compounding on the Automatic Indexing of Yorùbá Text

Performance of Turkish Information Retrieval: Evaluating the Impact of Linguistic Parameters and Compound Nouns

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Handling Orthographic Varieties in Japanese IR: Fusion of Word-, N-Gram-, and Yomi-Based Indices Across Different Document Collections

Abstract

Access this chapter

Preview

Similar content being viewed by others

Information Retrieval with Hindi, Bengali, and Marathi Languages: Evaluation and Analysis

The Effects of Word Frequency Distortions Occasioned by Compounding on the Automatic Indexing of Yorùbá Text

Performance of Turkish Information Retrieval: Evaluating the Impact of Linguistic Parameters and Compound Nouns

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation