Sub-document Timestamping: A Study on the Content Creation Dynamics of Web Documents

Zhao, Yue; Hauff, Claudia

doi:10.1007/978-3-319-43997-6_16

Sub-document Timestamping: A Study on the Content Creation Dynamics of Web Documents

Yue Zhao¹⁷ &
Claudia Hauff¹⁷

Conference paper
First Online: 10 August 2016

1487 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9819))

Abstract

The creation time of documents is an important kind of information in temporal information retrieval, especially for document clustering, timeline construction and search engine improvements. Considering the manner in which content on the Web is created, updated & deleted, the common assumption that each document has only one creation time is not suitable for Web documents. In this paper, we investigate to what extent this assumption is wrong. We introduce two methods to timestamp individual parts (sub-documents) of Web documents and analyze in detail the creation & update dynamics of three classes of Web documents.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://archive.org/.
2.
This time range was chosen due to our experimental data, cf. Sect. 4.
3.
http://nlp.stanford.edu/software/corenlp.shtml.
4.
http://www.biography.com/people/barack-obama-12782369.
5.
CRF++: https://taku910.github.io/crfpp/.
6.
http://www.lemurproject.org/clueweb12.php/.
7.
Specifically, we sampled from Disk1 of the ClueWeb12 corpus.
8.
We mean here all versions available on IA, not just those with changed content.
9.
http://www.mansci.uwaterloo.ca/~msmucker/cw12spam/.
10.
\(max\_features\) is 3 and 6, C-value is \({9 \times 10^{-6}}\).
11.
McNemar’s test was employed for statistical significance testing, with \(p< 0.01\).
12.
\(max\_features\) is 5 and 13, C-value is \({9 \times 10^{-5}}\).
13.
\(max\_features\) is 7 and 11, C-value is \(1 \times 10^{-6}\).

References

Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The web changes everything: understanding the dynamics of web content. In: WSDM 2009, pp. 282–291 (2009)
Google Scholar
Baeza-Yates, R., Pereira, Á., Ziviani, N.: Genealogical trees on the web: a search engine user perspective. In: WWW 2008, pp. 367–376. ACM (2008)
Google Scholar
Bernard, S., Heutte, L., Adam, S.: Influence of hyperparameters on random forest accuracy. In: Benediktsson, J.A., Kittler, J., Roli, F. (eds.) MCS 2009. LNCS, vol. 5519, pp. 171–180. Springer, Heidelberg (2009)
Chapter Google Scholar
Campos, R., Dias, G., Jorge, A.M., Jatowt, A.: Survey of temporal information retrieval and related applications. ACM Comput. Surv. (CSUR) 47(2), 15 (2015)
Google Scholar
Chambers, N.: Labeling documents with timestamps: learning from their time expressions. In: ACL 2012, pp. 98–106 (2012)
Google Scholar
Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler (1999)
Google Scholar
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)
Google Scholar
Cormack, G., Smucker, M., Clarke, C.: Efficient & effective spam filtering & re-ranking for large web datasets. Inf. Retrieval 14(5), 441–465 (2011)
Article Google Scholar
de Jong, F., Rode, H., Hiemstra, D.: Temporal language models for the disclosure of historical text. Royal Netherlands Academy of Arts and Sciences (2005)
Google Scholar
Döhling, L., Leser, U.: Extracting and aggregating temporal events from text. In: WWW 2014, pp. 839–844 (2014)
Google Scholar
Ge, T., Chang, B., Li, S., Sui, Z.: Event-based time label propagation for automatic dating of news articles. In: EMNLP 2013, pp. 1–11 (2013)
Google Scholar
Jatowt, A., Kawai, Y., Ohshima, H., Tanaka, K.: What can history tell us?: towards different models of interaction with document histories. In: ACM HyperText 2008, pp. 5–14 (2008)
Google Scholar
Jatowt, A., Kawai, Y., Tanaka, K.: Detecting age of page content. In: Proceedings of the 9th Annual ACM International Workshop on Web Information and Data Management, pp. 137–144. ACM (2007)
Google Scholar
Jones, R., Diaz, F.: Temporal profiles of queries. ACM Trans. Inf. Syst. 25(3), 14 (2007)
Article Google Scholar
Kanhabua, N., Nørvåg, K.: Improving temporal language models for determining time of non-timestamped documents. In: Christensen-Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds.) ECDL 2008. LNCS, vol. 5173, pp. 358–370. Springer, Heidelberg (2008)
Chapter Google Scholar
Kanhabua, N., Nørvåg, K.: Using temporal language models for document dating. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part II. LNCS, vol. 5782, pp. 738–741. Springer, Heidelberg (2009)
Chapter Google Scholar
Kumar, A., Lease, M., Baldridge, J.: Supervised language modeling for temporal resolution of texts. In: CIKM 2011, pp. 2069–2072 (2011)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)
Google Scholar
Li, X., Croft, W.B.: Time-based language models. In: CIKM 2003, pp. 469–475 (2003)
Google Scholar
Ntoulas, A., Cho, J., Olston, C.: What’s new on the web?: the evolution of the web from a search engine perspective. In: WWW 2004, pp. 1–12 (2004)
Google Scholar
Oshiro, T.M., Perez, P.S., Baranauskas, J.A.: How many trees in a random forest? In: Perner, P. (ed.) MLDM 2012. LNCS, vol. 7376, pp. 154–168. Springer, Heidelberg (2012)
Chapter Google Scholar
Swan, R., Jensen, D.: Timemines: constructing timelines with statistical models of word usage. In: KDD Workshop on Text Mining, pp. 73–80 (2000)
Google Scholar
Zhao, Y., Hauff, C.: Sub-document timestamping of web documents. In: SIGIR 2015, pp. 1023–1026 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Delft University of Technology, Delft, The Netherlands
Yue Zhao & Claudia Hauff

Authors

Yue Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Hauff
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yue Zhao .

Editor information

Editors and Affiliations

Universität Duisburg-Essen , Duisburg, Germany
Norbert Fuhr
Hungarian Academy of Science , Budapest, Hungary
László Kovács
Leibniz Universität Hannover , Hannover, Germany
Thomas Risse
Leibniz Universität Hannover , Hannover, Germany
Wolfgang Nejdl

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, Y., Hauff, C. (2016). Sub-document Timestamping: A Study on the Content Creation Dynamics of Web Documents. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2016. Lecture Notes in Computer Science(), vol 9819. Springer, Cham. https://doi.org/10.1007/978-3-319-43997-6_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-43997-6_16
Published: 10 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43996-9
Online ISBN: 978-3-319-43997-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics