Skip to main content

Retrieving Time from Scanned Books

  • Conference paper
Advances in Information Retrieval (ECIR 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9022))

Included in the following conference series:

Abstract

While millions of scanned books have become available in recent years, this vast collection of data remains under-utilized. Book search is often limited to summaries or metadata, and connecting information to primary sources can be a challenge.

Even though digital books provide rich historical information on all subjects, leveraging this data is difficult. To explore how we can access this historical information, we study the problem of identifying relevant times for a given query. That is - given a user query or a description of an event, we attempt to use historical sources to locate that event in time.

We use state-of-the-art NLP tools to identify and extract mentions of times present in our corpus, and then propose a number of models for organizing this historical information.

Since no truth data is readily available for our task, we automatically derive dated event descriptions from Wikipedia, leveraging the both the wisdom of the crowd and the wisdom of experts. Using 15,000 events from between the years 1000 and 1925 as queries, we evaluate our approach on a collection of 50,000 books from the Internet Archive. We discuss the tradeoffs between context, retrieval performance, and efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alonso, O., Strötgen, J., Baeza-Yates, R.A., Gertz, M.: Temporal information retrieval: Challenges and opportunities. TWAW 11, 1–8 (2011)

    Google Scholar 

  2. Brucato, M., Montesi, D.: Metric spaces for temporal information retrieval. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C., de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 385–397. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  3. Campos, R., Dias, G., Jorge, A., Nunes, C.: Gte: A distributional second-order co-occurrence approach to improve the identification of top relevant dates in web snippets. In: CIKM 2012, New York, NY, USA, pp. 2035–2039 (2012)

    Google Scholar 

  4. Dang, H.T., Owczarzak, K.: Overview of the TAC 2008 opinion question answering and summarization tasks. In: Proc. of the First Text Analysis Conference (2008)

    Google Scholar 

  5. Daoud, M., Huang, J.: Exploiting temporal term specificity into a probabilistic ranking model (2011)

    Google Scholar 

  6. Jong, F.d., Rode, H., Hiemstra, D.: Temporal language models for the disclosure of historical text. Royal Netherlands Academy of Arts and Sciences (2005)

    Google Scholar 

  7. Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: A spatially and temporally enhanced knowledge base from wikipedia. Artificial Intelligence 194, 28–61 (2013)

    Article  MATH  MathSciNet  Google Scholar 

  8. Järvelin, K., Kekäläinen, J.: IR evaluation methods for retrieving highly relevant documents. In: SIGIR 2000, pp. 41–48. ACM (2000)

    Google Scholar 

  9. Jatowt, A., Au Yeung, C.-M., Tanaka, K.: Estimating document focus time. In: CIKM 2013, pp. 2273–2278. ACM, New York (2013)

    Google Scholar 

  10. Kanhabua, N., Nørvåg, K.: A comparison of time-aware ranking methods. In: SIGIR 2011, pp. 1257–1258 (2011)

    Google Scholar 

  11. Kanhabua, N., Nørvåg, K.: Improving temporal language models for determining time of non-timestamped documents. In: Christensen-Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds.) ECDL 2008. LNCS, vol. 5173, pp. 358–370. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  12. Kanhabua, N., Nørvåg, K.: Determining time of queries for re-ranking search results. In: Lalmas, M., Jose, J., Rauber, A., Sebastiani, F., Frommholz, I. (eds.) ECDL 2010. LNCS, vol. 6273, pp. 261–272. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  13. Kazai, G., Koolen, M., Kamps, J., Doucet, A., Landoni, M.: Overview of the INEX 2010 book track: Scaling up the evaluation using crowdsourcing. In: Geva, S., Kamps, J., Schenkel, R., Trotman, A. (eds.) INEX 2010. LNCS, vol. 6932, pp. 98–117. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  14. Kumar, A., Baldridge, J., Lease, M., Ghosh, J.: Dating texts without explicit temporal cues. arXiv preprint arXiv:1211.2290 (2012)

    Google Scholar 

  15. Li, X., Croft, W.B.: Time-based language models. In: CIKM 2003, pp. 469–475. ACM (2003)

    Google Scholar 

  16. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)

    Google Scholar 

  17. Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In: SIGIR 2005, pp. 472–479. ACM (2005)

    Google Scholar 

  18. Metzler, D., Jones, R., Peng, F., Zhang, R.: Improving search relevance for implicitly temporal queries. In: SIGIR 2009, pp. 700–701. ACM (2009)

    Google Scholar 

  19. Nunes, S., Ribeiro, C., David, G.: Use of temporal expressions in web search. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 580–584. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  20. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: SIGIR 1998, pp. 275–281. ACM (1998)

    Google Scholar 

  21. Pustejovsky, J., Castano, J.M., Ingria, R., Sauri, R., Gaizauskas, R.J., Setzer, A., Katz, G., Radev, D.R.: TimeML: Robust specification of event and temporal expressions in text. New Directions in Question Answering 3, 28–34 (2003)

    Google Scholar 

  22. Smith, D.A.: Detecting and browsing events in unstructured text. In: SIGIR 2002, pp. 73–80. ACM (2002)

    Google Scholar 

  23. Sylvester, H.M.: Indian Wars of New England, vol. 2 (1910), https://archive.org/details/indianwarsneweng02sylvrich

  24. Voorhees, E.M., et al.: The TREC-8 Question Answering Track Report. TREC 99, 77–82 (1999)

    Google Scholar 

  25. Willis, C., Efron, M.: Finding information in books: characteristics of full-text searches in a collection of 10 million books. Proceedings of the American Society for Information Science and Technology 50(1), 1–10 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Foley, J., Allan, J. (2015). Retrieving Time from Scanned Books. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds) Advances in Information Retrieval. ECIR 2015. Lecture Notes in Computer Science, vol 9022. Springer, Cham. https://doi.org/10.1007/978-3-319-16354-3_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-16354-3_24

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-16353-6

  • Online ISBN: 978-3-319-16354-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics