Skip to main content

Generating Search Term Variants for Text Collections with Historic Spellings

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3936))

Abstract

In this paper, we describe a new approach for retrieval in texts with non-standard spelling, which is important for historic texts in English or German. For this purpose, we present a new algorithm for generating search term variants in ancient orthography. By applying a spell checker on a corpus of historic texts, we generate a list of candidate terms for which the contemporary spellings have to be assigned manually. Then our algorithm produces a set of probabilistic rules. These probabilities can be considered for ranking in the retrieval stage. An experimental comparison shows that our approach outperforms competing methods.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Biella, D., Dyllong, E., Kaiser, H., Luther, W., Mittmann, T.: Edition électronique de la réception de Nietzsche des années 1865 à 1945. In: Proc. ICHIM 2003, Paris (2003)

    Google Scholar 

  2. Biella, D., Dyllong, E.H., Luther, W., Pilz, T.: An On-line Literature Research System with Rule-Based Search. In: Proc. of the 4th European Conference on e-Learning (ECEL 2005), Amsterdam (2005)

    Google Scholar 

  3. Camps, R., Daudé, J.: Improving the efficacy of approximate personal name matching. In: Proc. 8th International Conference on Applications of Natural Language to Information Systems (NLDB 2003) (2003), http://www.lsi.upc.es/dept/techreps/ps/R03-9.ps.gz

  4. Cendrowska, J.: PRISM: An algorithm for inducing modular rules. International Journal of Man-Machine Studies 27(4), 349–370 (1987)

    Article  MATH  Google Scholar 

  5. Cohen, W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Trans. Inf. Syst. 17(2), 141–173 (1999)

    Article  Google Scholar 

  6. De Roux, E.: 19 bibliothèques en Europe signent un manifeste pour contrer le projet de Google. Le Monde, Paris (28.04.2005)

    Google Scholar 

  7. Frakes, W.B., Baeza-Yates, R.A.: Information Retrieval: Data Structures & Algorithms Context-sensitive learning methods for text categorization. Prentice-Hall, Englewood Cliffs (1992), DBLP, http://dblp.uni-trier.de

  8. Keller, R.: Die Deutsche Sprache und ihre historische Entwicklung. Helmut Buske Verlage, Hamburg (1986)

    Google Scholar 

  9. Nottelmann, H.: PIRE: An extensible IR engine based on probabilistic Datalog. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 260–274. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  10. Pfeifer, U., Poersch, T., Fuhr, N.: Retrieval Effectiveness of Proper Name Search Methods. Information Processing and Management 32(6), 667–669 (1996)

    Article  Google Scholar 

  11. Pilz, T.: Unscharfe Suche in Textdatenbanken mit nichtstandardisierter Rechtschreibung am Beispiel von Frakturtexten zur Nietzsche-Rezeption. Staatsexamensarbeit. Universität Duisburg-Essen (2003)

    Google Scholar 

  12. Peters, C. (Hrsg.): CLEF 2000. LNCS, vol. 2069. Springer, Heidelberg (2001)

    MATH  Google Scholar 

  13. Quasthoff, U.: Projekt Der Deutsche Wortschatz. In: Heyer, G., Wolff, C. (eds.) Proc. from the GLDV-Tagung, Linguistig und neue Medien, März 17-19 (1997), pp. 93–99. Deutscher Universitätsverlag, Leipzig (1998)

    Google Scholar 

  14. Rayson, P., Archer, D., Smith, N.: VARD versus Word. A comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In: Proceedings of the Corpus Linguistics 2005 conference, Proc. from the Corpus Linguistics Conference Series on-line e-journal, Birmingham, UK, vol. 1(1) (2005)

    Google Scholar 

  15. Strunk, J.: Information Retrieval for Languages that lack a fixed orthography (2003), http://www.linguistics.ruhr-uni-bochum.de/~strunk/LSreport.pdf

  16. Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, San Francisco (2000)

    Google Scholar 

  17. Zobel, J., Dart, P.: Phonetic String Matching: Lessons from Information Retrieval. In: Frei, H.-P., Harman, D., Schäuble, P., Wilkinson, R. (eds.) Proc. 19th Inter. Conf. on Research and Development in Information Retrieval (SIGIR), New York, pp. 166–172 (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ernst-Gerlach, A., Fuhr, N. (2006). Generating Search Term Variants for Text Collections with Historic Spellings. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds) Advances in Information Retrieval. ECIR 2006. Lecture Notes in Computer Science, vol 3936. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11735106_6

Download citation

  • DOI: https://doi.org/10.1007/11735106_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-33347-0

  • Online ISBN: 978-3-540-33348-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics