Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

Nørvåg, Kjetil; Nybø, Albert Overskeid

doi:10.1007/3-540-33880-2_18

Kjetil Nørvåg⁷ &
Albert Overskeid Nybø⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 23))

639 Accesses

Abstract

In research in web archives, large temporal document collections are necessary in order to be able to compare and evaluate new strategies and algorithms. Large temporal document collections are not easily available, and an alternative is to create synthetic document collections. In this paper we will describe how to generate synthetic temporal document collections, how this is realized in the TDocGen temporal document generator, and we will also present a study of the quality of the document collections created by TDocGen.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

D. Barbosa et al. ToXgene: a template-based data generator for XML. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, 2002.
Google Scholar
B. E. Brewington and G. Cybenko. How dynamic is the Web? Computer Networks, 33(1–6):257–276, 2000.
Article Google Scholar
G. Cobena, S. Abiteboul, and A. Marian. Detecting changes in XML documents. In Proceedings of the 18th International Conference on Data Engineering, 2002.
Google Scholar
D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of Web pages. Software — Practice and Experience, 34(2):213–237, 1996.
Article Google Scholar
H. S. Heaps. Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., 1978.
Google Scholar
G. Kazai et al. The INEX evaluation initiative. In Intelligent Search on XML Data, Applications, Languages, Models, Implementations, and Benchmarks, 2003.
Google Scholar
W. Li. Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6), 1992.
Google Scholar
K. Nørvåg. The design, implementation, and performance of the V2 temporal document database system. Journal of Information and Software Technology, 46(9):557–574, 2004.
Google Scholar
K. Runapongsa et al. The Michigan Benchmark: A microbenchmark for XML query processing systems. In Efficiency and Effectiveness of XML Tools and Techniques and Data Integration over the Web, 2002.
Google Scholar
A. Schmidt et al. XMark: a benchmark for XML data management. In Proceedings of VLDB’2002, 2002.
Google Scholar
G. K. Zipf. Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology. Addison-Wesley, 1949.
Google Scholar

Download references

Author information

Authors and Affiliations

Norwegian University of Science and Technology, 7491, Trondheim, Norway
Kjetil Nørvåg & Albert Overskeid Nybø

Authors

Kjetil Nørvåg
View author publications
You can also search for this author in PubMed Google Scholar
Albert Overskeid Nybø
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel
Mark Last
Institute of Computer Sciences, Technical University of Lodz, ul. Wolczanska 215, 93-1005, Lodz, Poland
Piotr S. Szczepaniak
Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01-447, Warsaw, Poland
Piotr S. Szczepaniak
Department of Software Engineering, ORT Braude College, POB. 78, 21982, Karmiel, Israel
Zeev Volkovich
Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., ENB 118, Tampa, FL, 33620, USA
Abraham Kandel

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nørvåg, K., Nybø, A.O. (2006). Creating Synthetic Temporal Document Collections for Web Archive Benchmarking. In: Last, M., Szczepaniak, P.S., Volkovich, Z., Kandel, A. (eds) Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol 23. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-33880-2_18

Download citation

DOI: https://doi.org/10.1007/3-540-33880-2_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33879-6
Online ISBN: 978-3-540-33880-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics