Abstract
Orthographic varieties are common in the Japanese language and represent a serious problem for Japanese information retrieval (IR), as IR systems run the risk of missing documents that contain variant forms of the search term. We propose two different strategies for handling orthographic varieties: pronunciation or yomi-based indexing and “Fuzzy Querying”, comparing katakana terms based on edit distance. Both strategies were integrated into our multiple index and fusion system [1] and tested using two different test collections, newspaper articles (Mainichi Shimbun ’98) and scientific abstracts (NTCIR-1), to compare their performance across text genres.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Womser-Hacker, C.: An Information Retrieval Prototype for Research and Teaching. In: Eibl, M., Wolff, C., Womser-Hacker, C. (eds.) To appear in Designing Information Systems. Festschrift für Jürgen Krause. Konstanz: Universitätsverlag [Schriften zur Informationswissenschaft] (2005)
Halpern, J.: Lexicon-Based Orthographic Disambiguation in CJK Intelligent Information Retrieval. In: Proceedings of the 19th Conference on Computational Linguistics, COLING 2002, Taipei, Taiwan, August 24–September 1 (2002)
Halpern, J.: The Challenges of Intelligent Japanese Searching. In: Working paper. The CJK Dictionary Institute, Saitama (2000), www.cjk.org/cjk/joa/joapaper.htm (revised 2003)
Kummer, N., Womser-Hacker, C., Kando, N.: Handling Orthographic Varieties in Japanese Information Retrieval: Fusion of Word-, N-gram-, and Yomi-Based Indices across Different Document Collections. NII Technical Report (2005)
Gospodnetić, O., Hatcher, E.: Lucene in Action. Manning, Canada (2004)
Yoshioka, M., Kuriyama, K., Kando, N.: Analysis of the Usage of Japanese Segmented Texts in NTCIR Workshop 2. In: Proceedings of the Second NTCIR Workshop on Research in Chinese and Japanese Text Retrieval and Text Summarization, pp. 291–296. National Institute of Informatics, Tokyo (2002)
Ozawa, T., Yamamoto, M., Umemura, K., Church, K.W.: Japanese Word Segmentation Using Similarity Measure for IR. In: Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition, Tokyo, Japan, August 30–September 1, pp. 89–96 (1999)
Jones, G.J.F., Sakai, T., Kajiura, M., Sumita, K.: Experiments in Japanese Text Retrieval and Routing Using the NEAT System. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 197–205 (1998)
Sakai, T., Shibazaki, Y., Suzuki, M., Kajiura, M., Manabe, T., Sumita, K.: Cross-Language Information Retrieval for NTCIR at Toshiba. In: Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition, Tokyo, Japan, August 30–September 1, pp. 137–144 (1999)
Vines, P., Wilkinson, R.: Experiments with Japanese Text Retrieval Using mg. In: Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition, Tokyo, Japan, August 30–September 1, pp. 97–100 (1999)
Chow, K.C.W., Luk, R.W.P., Wong, K.-F., Kwok, K.-L.: Hybrid Term Indexing for Different IR Models. In: Proceedings of the Fifth International Workshop on Information Retrieval with Asian Languages, Hong Kong, China, pp. 49–54 (2000)
Luk, R.W.P., Wong, K.-F., Kwok, K.-L.: Hybrid Term Indexing: An Evaluation. In: Proceedings of the Second NTCIR Workshop on Research in Chinese and Japanese Text Retrieval and Text Summarization, pp. 130–136. National Institute of Informatics, Tokyo (2001)
Savoy, J.: Report on CLIR Task for the NTCIR-4 Evaluation Campaign. In: Proceedings of the Fourth NTCIR Workshop on Research in Information Retrieval, Automatic Text Summarization and Question Answering, pp. 178–185 (2004)
Kummer, N., Womser-Hacker, C., Kando, N.: Re-Examination of Japanese Indexing: Fusion of Word-, N-gram- and Yomi-Based Indices. In: Proceedings of the 11th Annual Meeting of The Association for Natural Language Processing, March 14–18, pp. 221–224. University of Kagawa, Kagawa Prefecture (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kummer, N., Womser-Hacker, C., Kando, N. (2005). Handling Orthographic Varieties in Japanese IR: Fusion of Word-, N-Gram-, and Yomi-Based Indices Across Different Document Collections. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds) Information Retrieval Technology. AIRS 2005. Lecture Notes in Computer Science, vol 3689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562382_65
Download citation
DOI: https://doi.org/10.1007/11562382_65
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29186-2
Online ISBN: 978-3-540-32001-2
eBook Packages: Computer ScienceComputer Science (R0)