Skip to main content

Sub-document Timestamping: A Study on the Content Creation Dynamics of Web Documents

  • Conference paper
  • First Online:
  • 1487 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9819))

Abstract

The creation time of documents is an important kind of information in temporal information retrieval, especially for document clustering, timeline construction and search engine improvements. Considering the manner in which content on the Web is created, updated & deleted, the common assumption that each document has only one creation time is not suitable for Web documents. In this paper, we investigate to what extent this assumption is wrong. We introduce two methods to timestamp individual parts (sub-documents) of Web documents and analyze in detail the creation & update dynamics of three classes of Web documents.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://archive.org/.

  2. 2.

    This time range was chosen due to our experimental data, cf. Sect. 4.

  3. 3.

    http://nlp.stanford.edu/software/corenlp.shtml.

  4. 4.

    http://www.biography.com/people/barack-obama-12782369.

  5. 5.

    CRF++: https://taku910.github.io/crfpp/.

  6. 6.

    http://www.lemurproject.org/clueweb12.php/.

  7. 7.

    Specifically, we sampled from Disk1 of the ClueWeb12 corpus.

  8. 8.

    We mean here all versions available on IA, not just those with changed content.

  9. 9.

    http://www.mansci.uwaterloo.ca/~msmucker/cw12spam/.

  10. 10.

    \(max\_features\) is 3 and 6, C-value is \({9 \times 10^{-6}}\).

  11. 11.

    McNemar’s test was employed for statistical significance testing, with \(p< 0.01\).

  12. 12.

    \(max\_features\) is 5 and 13, C-value is \({9 \times 10^{-5}}\).

  13. 13.

    \(max\_features\) is 7 and 11, C-value is \(1 \times 10^{-6}\).

References

  1. Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The web changes everything: understanding the dynamics of web content. In: WSDM 2009, pp. 282–291 (2009)

    Google Scholar 

  2. Baeza-Yates, R., Pereira, Á., Ziviani, N.: Genealogical trees on the web: a search engine user perspective. In: WWW 2008, pp. 367–376. ACM (2008)

    Google Scholar 

  3. Bernard, S., Heutte, L., Adam, S.: Influence of hyperparameters on random forest accuracy. In: Benediktsson, J.A., Kittler, J., Roli, F. (eds.) MCS 2009. LNCS, vol. 5519, pp. 171–180. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  4. Campos, R., Dias, G., Jorge, A.M., Jatowt, A.: Survey of temporal information retrieval and related applications. ACM Comput. Surv. (CSUR) 47(2), 15 (2015)

    Google Scholar 

  5. Chambers, N.: Labeling documents with timestamps: learning from their time expressions. In: ACL 2012, pp. 98–106 (2012)

    Google Scholar 

  6. Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler (1999)

    Google Scholar 

  7. Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)

    Google Scholar 

  8. Cormack, G., Smucker, M., Clarke, C.: Efficient & effective spam filtering & re-ranking for large web datasets. Inf. Retrieval 14(5), 441–465 (2011)

    Article  Google Scholar 

  9. de Jong, F., Rode, H., Hiemstra, D.: Temporal language models for the disclosure of historical text. Royal Netherlands Academy of Arts and Sciences (2005)

    Google Scholar 

  10. Döhling, L., Leser, U.: Extracting and aggregating temporal events from text. In: WWW 2014, pp. 839–844 (2014)

    Google Scholar 

  11. Ge, T., Chang, B., Li, S., Sui, Z.: Event-based time label propagation for automatic dating of news articles. In: EMNLP 2013, pp. 1–11 (2013)

    Google Scholar 

  12. Jatowt, A., Kawai, Y., Ohshima, H., Tanaka, K.: What can history tell us?: towards different models of interaction with document histories. In: ACM HyperText 2008, pp. 5–14 (2008)

    Google Scholar 

  13. Jatowt, A., Kawai, Y., Tanaka, K.: Detecting age of page content. In: Proceedings of the 9th Annual ACM International Workshop on Web Information and Data Management, pp. 137–144. ACM (2007)

    Google Scholar 

  14. Jones, R., Diaz, F.: Temporal profiles of queries. ACM Trans. Inf. Syst. 25(3), 14 (2007)

    Article  Google Scholar 

  15. Kanhabua, N., Nørvåg, K.: Improving temporal language models for determining time of non-timestamped documents. In: Christensen-Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds.) ECDL 2008. LNCS, vol. 5173, pp. 358–370. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  16. Kanhabua, N., Nørvåg, K.: Using temporal language models for document dating. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part II. LNCS, vol. 5782, pp. 738–741. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  17. Kumar, A., Lease, M., Baldridge, J.: Supervised language modeling for temporal resolution of texts. In: CIKM 2011, pp. 2069–2072 (2011)

    Google Scholar 

  18. Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)

    Google Scholar 

  19. Li, X., Croft, W.B.: Time-based language models. In: CIKM 2003, pp. 469–475 (2003)

    Google Scholar 

  20. Ntoulas, A., Cho, J., Olston, C.: What’s new on the web?: the evolution of the web from a search engine perspective. In: WWW 2004, pp. 1–12 (2004)

    Google Scholar 

  21. Oshiro, T.M., Perez, P.S., Baranauskas, J.A.: How many trees in a random forest? In: Perner, P. (ed.) MLDM 2012. LNCS, vol. 7376, pp. 154–168. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  22. Swan, R., Jensen, D.: Timemines: constructing timelines with statistical models of word usage. In: KDD Workshop on Text Mining, pp. 73–80 (2000)

    Google Scholar 

  23. Zhao, Y., Hauff, C.: Sub-document timestamping of web documents. In: SIGIR 2015, pp. 1023–1026 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yue Zhao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhao, Y., Hauff, C. (2016). Sub-document Timestamping: A Study on the Content Creation Dynamics of Web Documents. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2016. Lecture Notes in Computer Science(), vol 9819. Springer, Cham. https://doi.org/10.1007/978-3-319-43997-6_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-43997-6_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-43996-9

  • Online ISBN: 978-3-319-43997-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics