Generating Search Term Variants for Text Collections with Historic Spellings

Ernst-Gerlach, Andrea; Fuhr, Norbert

doi:10.1007/11735106_6

Generating Search Term Variants for Text Collections with Historic Spellings

Andrea Ernst-Gerlach²² &
Norbert Fuhr²²

Conference paper

1586 Accesses
14 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3936))

Abstract

In this paper, we describe a new approach for retrieval in texts with non-standard spelling, which is important for historic texts in English or German. For this purpose, we present a new algorithm for generating search term variants in ancient orthography. By applying a spell checker on a corpus of historic texts, we generate a list of candidate terms for which the contemporary spellings have to be assigned manually. Then our algorithm produces a set of probabilistic rules. These probabilities can be considered for ranking in the retrieval stage. An experimental comparison shows that our approach outperforms competing methods.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Biella, D., Dyllong, E., Kaiser, H., Luther, W., Mittmann, T.: Edition électronique de la réception de Nietzsche des années 1865 à 1945. In: Proc. ICHIM 2003, Paris (2003)
Google Scholar
Biella, D., Dyllong, E.H., Luther, W., Pilz, T.: An On-line Literature Research System with Rule-Based Search. In: Proc. of the 4th European Conference on e-Learning (ECEL 2005), Amsterdam (2005)
Google Scholar
Camps, R., Daudé, J.: Improving the efficacy of approximate personal name matching. In: Proc. 8th International Conference on Applications of Natural Language to Information Systems (NLDB 2003) (2003), http://www.lsi.upc.es/dept/techreps/ps/R03-9.ps.gz
Cendrowska, J.: PRISM: An algorithm for inducing modular rules. International Journal of Man-Machine Studies 27(4), 349–370 (1987)
Article MATH Google Scholar
Cohen, W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Trans. Inf. Syst. 17(2), 141–173 (1999)
Article Google Scholar
De Roux, E.: 19 bibliothèques en Europe signent un manifeste pour contrer le projet de Google. Le Monde, Paris (28.04.2005)
Google Scholar
Frakes, W.B., Baeza-Yates, R.A.: Information Retrieval: Data Structures & Algorithms Context-sensitive learning methods for text categorization. Prentice-Hall, Englewood Cliffs (1992), DBLP, http://dblp.uni-trier.de
Keller, R.: Die Deutsche Sprache und ihre historische Entwicklung. Helmut Buske Verlage, Hamburg (1986)
Google Scholar
Nottelmann, H.: PIRE: An extensible IR engine based on probabilistic Datalog. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 260–274. Springer, Heidelberg (2005)
Chapter Google Scholar
Pfeifer, U., Poersch, T., Fuhr, N.: Retrieval Effectiveness of Proper Name Search Methods. Information Processing and Management 32(6), 667–669 (1996)
Article Google Scholar
Pilz, T.: Unscharfe Suche in Textdatenbanken mit nichtstandardisierter Rechtschreibung am Beispiel von Frakturtexten zur Nietzsche-Rezeption. Staatsexamensarbeit. Universität Duisburg-Essen (2003)
Google Scholar
Peters, C. (Hrsg.): CLEF 2000. LNCS, vol. 2069. Springer, Heidelberg (2001)
MATH Google Scholar
Quasthoff, U.: Projekt Der Deutsche Wortschatz. In: Heyer, G., Wolff, C. (eds.) Proc. from the GLDV-Tagung, Linguistig und neue Medien, März 17-19 (1997), pp. 93–99. Deutscher Universitätsverlag, Leipzig (1998)
Google Scholar
Rayson, P., Archer, D., Smith, N.: VARD versus Word. A comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In: Proceedings of the Corpus Linguistics 2005 conference, Proc. from the Corpus Linguistics Conference Series on-line e-journal, Birmingham, UK, vol. 1(1) (2005)
Google Scholar
Strunk, J.: Information Retrieval for Languages that lack a fixed orthography (2003), http://www.linguistics.ruhr-uni-bochum.de/~strunk/LSreport.pdf
Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, San Francisco (2000)
Google Scholar
Zobel, J., Dart, P.: Phonetic String Matching: Lessons from Information Retrieval. In: Frei, H.-P., Harman, D., Schäuble, P., Wilkinson, R. (eds.) Proc. 19th Inter. Conf. on Research and Development in Information Retrieval (SIGIR), New York, pp. 166–172 (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Duisburg-Essen, Germany
Andrea Ernst-Gerlach & Norbert Fuhr

Authors

Andrea Ernst-Gerlach
View author publications
You can also search for this author in PubMed Google Scholar
Norbert Fuhr
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Queen Mary, University of London, London, UK
Mounia Lalmas
Department of Information Science, City University, Northampton Square, EC1V OHB, London, UK
Andy MacFarlane
Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK
Stefan Rüger
Queen Mary University of London, UK
Anastasios Tombros
CWI, Amsterdam, The Netherlands
Theodora Tsikrika
Department of Computing, Imperial College London, South Kensington Campus, SW7 2AZ, London, UK
Alexei Yavlinsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ernst-Gerlach, A., Fuhr, N. (2006). Generating Search Term Variants for Text Collections with Historic Spellings. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds) Advances in Information Retrieval. ECIR 2006. Lecture Notes in Computer Science, vol 3936. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11735106_6

Download citation

DOI: https://doi.org/10.1007/11735106_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33347-0
Online ISBN: 978-3-540-33348-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics