Skip to main content

Information Retrieval Strategies for Digitized Handwritten Medieval Documents

  • Conference paper
Information Retrieval Technology (AIRS 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7097))

Included in the following conference series:

Abstract

This paper describes and evaluates different IR models and search strategies for digitized manuscripts. Written during the thirteenth century, these manuscripts were digitized using an imperfect recognition system with a word error rate of around 6%. Having access to the internal representation during the recognition stage, we were able to produce four automatic transcriptions, each introducing some form of spelling correction as an attempt to improve the retrieval effectiveness. We evaluated the retrieval effectiveness for each of these versions using three text representations combined with five IR models, three stemming strategies and two query formulations. We employed a manually-transcribed error-free version to define the ground-truth. Based on our experiments, we conclude that taking account of the single best recognition word or all possible top-k recognition alternatives does not provide the best performance. Selecting all possible words each having a log-likelihood close to the best alternative yields the best text surrogate. Within this representation, different retrieval strategies tend to produce similar performance levels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Toni, M., Manmatha, R., Lavrenko, V.: A Search Engine for Historical Manuscript Images. In: Proceedings of the ACM-SIGIR, pp. 369–376. The ACM Press, New York (2004)

    Google Scholar 

  2. Nicholas, R., Toni, M., Manmatha, R.: Boosted Decision Trees for Word Recognition in Handwritten Document Retrieval. In: Proceedings of the ACM-SIGIR, pp. 377–383. The ACM Press, New York (2005)

    Google Scholar 

  3. Callan, J., Kantor, P., Grossman, D.: Information Retrieval and OCR: From Converting Content to Grasping Meaning. SIGIR Forum 36(2), 58–61 (2002)

    Article  Google Scholar 

  4. Voorhees, E.M., Garofolo, J.S.: Retrieving Noisy Text. In: Voorhees, E.M., Harman, D.K. (eds.) TREC, Experiment and Evaluation in Information Retrieval, pp. 183–197. The MIT Press, Cambridge (2005)

    Google Scholar 

  5. Buckley, C., Voorhees, E.: Retrieval System Evaluation. In: Voorhees, E.M., Harman, D.K. (eds.) TREC, Experiment and Evaluation in Information Retrieval, pp. 53–75. The MIT Press, Cambridge (2005)

    Google Scholar 

  6. Ballerini, J.P., Büchel, M., Domering, R., Knaus, D., Mateev, B., Mittendorf, E., Schäuble, P., Sheridan, P., Wechsler, M.: SPIDER Retrieval System at TREC-5. In: Proceedings of TREC-5, pp. 217–228. NIST Publication #500-238 (1997)

    Google Scholar 

  7. Tagva, K., Borsack, J., Condit, A.: Results of Applying Probabilistic IR to OCR Text. In: Proceedings of the ACM-SIGIR, pp. 202–211. The ACM Press, New York (1994)

    Google Scholar 

  8. Craig, H., Whipp, R.: Old Spellings, New Methods: Automated Procedures for Indeterminate Linguistic Data. Literary & Linguistic Computing 25(1), 37–52 (2010)

    Article  Google Scholar 

  9. Pilz, T., Luther, W., Fuhr, N., Ammon, U.: Rule-Based Search in Text Databases with Nonstandard Orthography. Literacy & Linguistic Computing 21(2), 179–186 (2006)

    Article  Google Scholar 

  10. Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.): CLEF 2003. LNCS, vol. 3237. Springer, Heidelberg (2004)

    Google Scholar 

  11. Gardt, A., Hauss-Zumkehr, U., Roelcke, T.: Sprachgeschichte als Kulturgeschichte. Walter de Gruyter, Berlin (1999)

    Book  Google Scholar 

  12. Fischer, A., Wüthrich, M., Liwicki, M., Frinken, V., Bunke, H., Viehhauser, G., Stolz, M.: Automatic Transcription of Handwritten Medieval Documents. In: 15th International Conference on Virtual Systems and Multimedia (2007)

    Google Scholar 

  13. Azzopardi, L., de Rijke, M.: Automatic Construction of Known-Item Finding Test Beds. In: Proceeding ACM SIGIR, pp. 603–604. The ACM Press, New York (2006)

    Google Scholar 

  14. Callan, J., Connell, M.: Query-Based Sampling of Text Databases. Information Systems 19(2), 97–130 (2001)

    Google Scholar 

  15. Jordan, C., Watters, C., Gao, Q.: Using Controlled Query Generation to Evaluate Blind Relevance Feedback Algorithms. In: Proceedings of the Sixth ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 286–295. The ACM Press, New York (2006)

    Chapter  Google Scholar 

  16. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  17. Harman, D.: How Effective is Suffixing. Journal of the American Society for Information Science 42(1), 7–15 (1991)

    Article  MathSciNet  Google Scholar 

  18. McNamee, P., Mayfield, J.: Character n-gram Tokenization for European Language Text Retrieval. IR Journal 7(1-2), 73–97 (2004)

    Google Scholar 

  19. Buckley, C., Singhal, A., Mitra, M., Salton, G.: New Retrieval Approaches using SMART. In: Proceedings of TREC-4, pp. 25–48. NIST Publication #500-236 (1996)

    Google Scholar 

  20. Robertson, S.E., Walker, S., Beaulieu, M.: Experimentation as a Way of Life: Okapi at TREC. Information Processing & Management 36(1), 95–108 (2000)

    Article  Google Scholar 

  21. Amati, G., van Rijsbergen, C.J.: Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness. ACM Transactions on Information Systems 20(4), 357–389 (2002)

    Article  Google Scholar 

  22. Hiemstra, D.: Using Language Models for Information Retrieval. CTIT Ph.D. Thesis (2000)

    Google Scholar 

  23. Eguchi, K., Oyama, K., Ishida, E., Kando, N., Kuriyama, K.: Overview of the Web Retrieval Task at the Third NTCIR Workshop. NII Publication (2003)

    Google Scholar 

  24. Sakai, T.: Bootstrap-Based Comparisons of IR Metrics for Finding One Relevant Document. In: Ng, H.T., Leong, M.-K., Kan, M.-Y., Ji, D. (eds.) AIRS 2006. LNCS, vol. 4182, pp. 374–389. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Naji, N., Savoy, J. (2011). Information Retrieval Strategies for Digitized Handwritten Medieval Documents. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds) Information Retrieval Technology. AIRS 2011. Lecture Notes in Computer Science, vol 7097. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25631-8_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25631-8_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25630-1

  • Online ISBN: 978-3-642-25631-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics