Skip to main content

Using LDA and Time Series Analysis for Timestamping Documents

  • Conference paper
  • First Online:
Advances in Time Series Analysis and Forecasting (ITISE 2016)

Part of the book series: Contributions to Statistics ((CONTRIB.STAT.))

Included in the following conference series:

  • 2074 Accesses

Abstract

Identifying the moment of time when a book was published is an important problem that might help solving the problem of authorship identification and could also shed some light into identifying the realities of the human society during different periods of time. In this paper, we present an attempt to estimate the publication date of books based on the time series analysis of their content. The main assumption of this experiment is that the subject of a book is often specific to a time period. Therefore, it is likely to use topic modeling to learn a model that might be used to timestamp different books, given for training many books from similar periods of time. To validate the assumption, we built a corpus of 10 thousand books and used LDA to extract the topics from them. Then, we extracted the time series of particular terms from each topic using Google Books N-gram Corpus. By heuristically combining the words’ time series and the topics from a document, we have built that document’s time series. Finally, we applied peak detection algorithms to timestamp the document.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    Google Scholar 

  2. Chen, E.: Introduction to Latent Dirichlet Allocation. http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/22 Aug 2011

  3. AlSumait, L., Barbará, D., Domeniconi, C.: On-line lda: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Data Mining, 2008. ICDM’08, pp. 3–12 (2008)

    Google Scholar 

  4. Michel, J.-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., The Google Books Team, Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2011)

    Google Scholar 

  5. Sparavigna, A.C., Marazzato, R.: Using Google Ngram viewer for scientific referencing and history of science. arXiv preprint arXiv:1512.01364 (2015)

  6. Montagne, M., Morgan, M.: Drugs on the internet, part IV: Google’s Ngram viewer analytic tool applied to drug literature. Subst. Use Misuse 48(5), 415–419 (2013)

    Article  Google Scholar 

  7. Patrick, J.: Using the Google N-Gram corpus to measure cultural complexity. Literary Linguist. Comput. 28(4), 668–675 (2013)

    Article  Google Scholar 

  8. Koplenig, A.: The impact of lacking metadata for the measurement of cultural and linguistic change using the Google ngram data set—reconstructing the composition of the german corpus in times of WWII. In: Digital Scholarship in the Humanities, fqv037 (2015)

    Google Scholar 

  9. Islam, A., Mei, J., Milios, E.E., Keselj, V.: When was macbeth written? mapping book to time. In: Computational Linguistics and Intelligent Text Processing. Springer International Publishing, pp. 73–84 (2015)

    Google Scholar 

  10. Szymanski, T., Lynch, G.: UCD: Diachronic Text Classification with Character, Word, and Syntactic N-grams. SemEval 2015, 879–883 (2015)

    Google Scholar 

  11. Garcia-Fernandez, A., Ligozat, A.-L., Dinarelli, M., Bernhard, D.: When was it written? automatically determining publication dates. In: String Processing and Information Retrieval, pp. 221–236 (2011)

    Google Scholar 

  12. Popa, T., Rebedea, T., Chiru, C.: Detecting and describing historical periods in a large corpora. ICTAI 2014, 764–770 (2014)

    Google Scholar 

  13. Yusuke, S.: PDFMiner. http://euske.github.io/pdfminer/index.html (2008)

  14. Digital Research Infrastructure for the Arts and Humanities: Topic modeling with MALLET. https://de.dariah.eu/tatom/topic_model_mallet.html#topic-model-mallet (2015)

  15. Ankarloo, B., Clark, S., Monter, W.: Witchcraft and magic in Europe. The Athlone Press (2002)

    Google Scholar 

Download references

Acknowledgements

This work has been funded by University Politehnica of Bucharest, through the “Excellence Research Grants” Program, UPB – GEX. Identifier: UPB–EXCELENȚĂ–2016 Aplicarea metodelor de învățare automată în analiza seriilor de timp (Applying machine learning techniques in time series analysis), Contract number 09/26.09.2016.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Costin-Gabriel Chiru .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Chiru, CG., Sarker, B. (2017). Using LDA and Time Series Analysis for Timestamping Documents. In: Rojas, I., Pomares, H., Valenzuela, O. (eds) Advances in Time Series Analysis and Forecasting. ITISE 2016. Contributions to Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-55789-2_4

Download citation

Publish with us

Policies and ethics