Skip to main content

A Cross-Language Approach to Historic Document Retrieval

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3936))

Abstract

Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience: queries involving modern words may not be very effective for retrieving documents that contain many historic terms. We propose a cross-language approach to historic document retrieval, and investigate (1) the automatic construction of translation resources for historic languages, and (2) the retrieval of historic documents using cross-language information retrieval techniques. Our experimental evidence is based on a collection of 17th century Dutch documents and a set of 25 known-item topics in modern Dutch. Our main findings are as follows: First, we are able to automatically construct rules for modernizing historic language based on comparing (a) phonetic sequence similarity, (b) the relative frequency of consonant and vowel sequences, and (c) the relative frequency of character n-gram sequences, of historic and modern corpora. Second, modern queries are not very effective for retrieving historic documents, but the historic language tools lead to a substantial improvement in retrieval effectiveness. The improvements are above and beyond the improvement due to using a modern stemming algorithm (whose effectiveness actually goes up when the historic language is modernized).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brabants recht. Costumen van Antwerpen (2005), http://www.kulak.ac.be/facult/rechten/Monballyu/Rechtlagelanden/Brabantsrecht/brabantsrechtindex.htm

  2. Braschler, M., Peters, C.: Cross-language evaluation forum: Objectives, results, achievements. Information Retrieval 7, 7–31 (2004)

    Article  Google Scholar 

  3. Braun, L.: Information retrieval from Dutch historical corpora. Master’s thesis, Maastricht University (2002)

    Google Scholar 

  4. CLEF. Cross language evaluation forum (2005), http://www.clef-campaign.org/

  5. Craswell, N., Hawking, D.: Overview of the TREC 2004 web track. In: The Thirteenth Text REtrieval Conference (TREC 2004). National Institute for Standards and Technology. NIST Special Publication 500-251 (2005)

    Google Scholar 

  6. DBNL. Digitale bibliotheek voor de Nederlandse letteren (2005), http://www.dbnl.nl

  7. DigiCULT. Technology challenges for digital culture (2005), http://www.digicult.info/

  8. Efron, B.: Bootstrap methods: Another look at the jackknife. Annals of Statistics 7, 1–26 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  9. Gelders recht. Gelders Land- en Stadsrecht (2005), http://www.kulak.ac.be/facult/rechten/Monballyu/Rechtlagelanden/Geldersrecht/geldersrechtindex.htm

  10. Hollink, V., Kamps, J., Monz, C., de Rijke, M.: Monolingual document retrieval for European languages. Information Retrieval 7, 33–52 (2004)

    Article  Google Scholar 

  11. Hüning, M.: Geschiedenis van het Nederlands (1996), http://www.ned.univie.ac.at/publicaties/taalgeschiedenis/nl/

  12. Kukich, K.: Technique for automatically correcting words in text. ACM Computing Surveys 24, 377–439 (1992)

    Article  Google Scholar 

  13. Lesk, M.: Understanding Digital Libraries, 2nd edn. The Morgan Kaufmann series in multimedia information and systems. Morgan Kaufmann, San Francisco (2005)

    Google Scholar 

  14. Lucene. The Lucene search engine (2005), http://jakarta.apache.org/lucene/

  15. NeXTeNS. Text-to-speech for Dutch (2005), http://nextens.uvt.nl/

  16. O’Rourke, A.J., Robertson, A.M., Willett, P., Eley, P., Simons, P.: Word variant identification in old french. Information Research 2 (1996), http://informationr.net/ir/2-4/paper22.html

  17. Robertson, A.M., Willett, P.: Searching for historical word-forms in a database of 17th-century English text using spelling-correction methods. In: Proceedings ACM SIGIR 1992, pp. 256–265. ACM Press, New York (1992)

    Google Scholar 

  18. Rogers, H.J., Willett, P.: Searching for historical word forms in text databases using spelling-correction methods. Journal of Documentation 7, 333–353 (1991)

    Article  Google Scholar 

  19. Russell, R.C.: Specification of Letters, volume 1,261,167 of Patent Number. United States Patent Office, A Cross-Language Approach to Historic Document Retrieval 419 (1918)

    Google Scholar 

  20. Sankoff, D., Kruskal, J.: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley Publishing Co., Reading (1983)

    MATH  Google Scholar 

  21. Savoy, J.: Statistical inference in retrieval effectiveness evaluation. Information Processing and Management 33, 495–512 (1997)

    Article  Google Scholar 

  22. Savoy, J.: Combining multiple strategies for effective monolingual and crosslanguage retrieval. Information Retrieval 7, 121–148 (2004)

    Article  Google Scholar 

  23. Snowball. A language for stemming algorithms (2005), http://snowball.tartarus.org/

  24. Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the ACM 21, 168–173 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  25. Wikipedia. Indo-european languages languages (2005), http://en.wikipedia.org/wiki/Indo-European

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Koolen, M., Adriaans, F., Kamps, J., de Rijke, M. (2006). A Cross-Language Approach to Historic Document Retrieval. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds) Advances in Information Retrieval. ECIR 2006. Lecture Notes in Computer Science, vol 3936. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11735106_36

Download citation

  • DOI: https://doi.org/10.1007/11735106_36

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-33347-0

  • Online ISBN: 978-3-540-33348-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics