A Cross-Language Approach to Historic Document Retrieval

Koolen, Marijn; Adriaans, Frans; Kamps, Jaap; de Rijke, Maarten

doi:10.1007/11735106_36

A Cross-Language Approach to Historic Document Retrieval

Marijn Koolen^22,23,
Frans Adriaans^22,24,
Jaap Kamps^22,23 &
…
Maarten de Rijke²²

Conference paper

1589 Accesses
16 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3936))

Abstract

Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience: queries involving modern words may not be very effective for retrieving documents that contain many historic terms. We propose a cross-language approach to historic document retrieval, and investigate (1) the automatic construction of translation resources for historic languages, and (2) the retrieval of historic documents using cross-language information retrieval techniques. Our experimental evidence is based on a collection of 17th century Dutch documents and a set of 25 known-item topics in modern Dutch. Our main findings are as follows: First, we are able to automatically construct rules for modernizing historic language based on comparing (a) phonetic sequence similarity, (b) the relative frequency of consonant and vowel sequences, and (c) the relative frequency of character n-gram sequences, of historic and modern corpora. Second, modern queries are not very effective for retrieving historic documents, but the historic language tools lead to a substantial improvement in retrieval effectiveness. The improvements are above and beyond the improvement due to using a modern stemming algorithm (whose effectiveness actually goes up when the historic language is modernized).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brabants recht. Costumen van Antwerpen (2005), http://www.kulak.ac.be/facult/rechten/Monballyu/Rechtlagelanden/Brabantsrecht/brabantsrechtindex.htm
Braschler, M., Peters, C.: Cross-language evaluation forum: Objectives, results, achievements. Information Retrieval 7, 7–31 (2004)
Article Google Scholar
Braun, L.: Information retrieval from Dutch historical corpora. Master’s thesis, Maastricht University (2002)
Google Scholar
CLEF. Cross language evaluation forum (2005), http://www.clef-campaign.org/
Craswell, N., Hawking, D.: Overview of the TREC 2004 web track. In: The Thirteenth Text REtrieval Conference (TREC 2004). National Institute for Standards and Technology. NIST Special Publication 500-251 (2005)
Google Scholar
DBNL. Digitale bibliotheek voor de Nederlandse letteren (2005), http://www.dbnl.nl
DigiCULT. Technology challenges for digital culture (2005), http://www.digicult.info/
Efron, B.: Bootstrap methods: Another look at the jackknife. Annals of Statistics 7, 1–26 (1979)
Article MathSciNet MATH Google Scholar
Gelders recht. Gelders Land- en Stadsrecht (2005), http://www.kulak.ac.be/facult/rechten/Monballyu/Rechtlagelanden/Geldersrecht/geldersrechtindex.htm
Hollink, V., Kamps, J., Monz, C., de Rijke, M.: Monolingual document retrieval for European languages. Information Retrieval 7, 33–52 (2004)
Article Google Scholar
Hüning, M.: Geschiedenis van het Nederlands (1996), http://www.ned.univie.ac.at/publicaties/taalgeschiedenis/nl/
Kukich, K.: Technique for automatically correcting words in text. ACM Computing Surveys 24, 377–439 (1992)
Article Google Scholar
Lesk, M.: Understanding Digital Libraries, 2nd edn. The Morgan Kaufmann series in multimedia information and systems. Morgan Kaufmann, San Francisco (2005)
Google Scholar
Lucene. The Lucene search engine (2005), http://jakarta.apache.org/lucene/
NeXTeNS. Text-to-speech for Dutch (2005), http://nextens.uvt.nl/
O’Rourke, A.J., Robertson, A.M., Willett, P., Eley, P., Simons, P.: Word variant identification in old french. Information Research 2 (1996), http://informationr.net/ir/2-4/paper22.html
Robertson, A.M., Willett, P.: Searching for historical word-forms in a database of 17th-century English text using spelling-correction methods. In: Proceedings ACM SIGIR 1992, pp. 256–265. ACM Press, New York (1992)
Google Scholar
Rogers, H.J., Willett, P.: Searching for historical word forms in text databases using spelling-correction methods. Journal of Documentation 7, 333–353 (1991)
Article Google Scholar
Russell, R.C.: Specification of Letters, volume 1,261,167 of Patent Number. United States Patent Office, A Cross-Language Approach to Historic Document Retrieval 419 (1918)
Google Scholar
Sankoff, D., Kruskal, J.: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley Publishing Co., Reading (1983)
MATH Google Scholar
Savoy, J.: Statistical inference in retrieval effectiveness evaluation. Information Processing and Management 33, 495–512 (1997)
Article Google Scholar
Savoy, J.: Combining multiple strategies for effective monolingual and crosslanguage retrieval. Information Retrieval 7, 121–148 (2004)
Article Google Scholar
Snowball. A language for stemming algorithms (2005), http://snowball.tartarus.org/
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the ACM 21, 168–173 (1974)
Article MathSciNet MATH Google Scholar
Wikipedia. Indo-european languages languages (2005), http://en.wikipedia.org/wiki/Indo-European

Download references

Author information

Authors and Affiliations

ISLA, University of Amsterdam, The Netherlands
Marijn Koolen, Frans Adriaans, Jaap Kamps & Maarten de Rijke
Archives and Information Studies, University of Amsterdam, The Netherlands
Marijn Koolen & Jaap Kamps
Utrecht Institute of Linguistics OTS, Utrecht University, The Netherlands
Frans Adriaans

Authors

Marijn Koolen
View author publications
You can also search for this author in PubMed Google Scholar
Frans Adriaans
View author publications
You can also search for this author in PubMed Google Scholar
Jaap Kamps
View author publications
You can also search for this author in PubMed Google Scholar
Maarten de Rijke
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Queen Mary, University of London, London, UK
Mounia Lalmas
Department of Information Science, City University, Northampton Square, EC1V OHB, London, UK
Andy MacFarlane
Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK
Stefan Rüger
Queen Mary University of London, UK
Anastasios Tombros
CWI, Amsterdam, The Netherlands
Theodora Tsikrika
Department of Computing, Imperial College London, South Kensington Campus, SW7 2AZ, London, UK
Alexei Yavlinsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Koolen, M., Adriaans, F., Kamps, J., de Rijke, M. (2006). A Cross-Language Approach to Historic Document Retrieval. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds) Advances in Information Retrieval. ECIR 2006. Lecture Notes in Computer Science, vol 3936. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11735106_36

Download citation

DOI: https://doi.org/10.1007/11735106_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33347-0
Online ISBN: 978-3-540-33348-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics