Abstract
In research in web archives, large temporal document collections are necessary in order to be able to compare and evaluate new strategies and algorithms. Large temporal document collections are not easily available, and an alternative is to create synthetic document collections. In this paper we will describe how to generate synthetic temporal document collections, how this is realized in the TDocGen temporal document generator, and we will also present a study of the quality of the document collections created by TDocGen.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
D. Barbosa et al. ToXgene: a template-based data generator for XML. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, 2002.
B. E. Brewington and G. Cybenko. How dynamic is the Web? Computer Networks, 33(1–6):257–276, 2000.
G. Cobena, S. Abiteboul, and A. Marian. Detecting changes in XML documents. In Proceedings of the 18th International Conference on Data Engineering, 2002.
D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of Web pages. Software — Practice and Experience, 34(2):213–237, 1996.
H. S. Heaps. Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., 1978.
G. Kazai et al. The INEX evaluation initiative. In Intelligent Search on XML Data, Applications, Languages, Models, Implementations, and Benchmarks, 2003.
W. Li. Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6), 1992.
K. Nørvåg. The design, implementation, and performance of the V2 temporal document database system. Journal of Information and Software Technology, 46(9):557–574, 2004.
K. Runapongsa et al. The Michigan Benchmark: A microbenchmark for XML query processing systems. In Efficiency and Effectiveness of XML Tools and Techniques and Data Integration over the Web, 2002.
A. Schmidt et al. XMark: a benchmark for XML data management. In Proceedings of VLDB’2002, 2002.
G. K. Zipf. Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology. Addison-Wesley, 1949.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Nørvåg, K., Nybø, A.O. (2006). Creating Synthetic Temporal Document Collections for Web Archive Benchmarking. In: Last, M., Szczepaniak, P.S., Volkovich, Z., Kandel, A. (eds) Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol 23. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-33880-2_18
Download citation
DOI: https://doi.org/10.1007/3-540-33880-2_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33879-6
Online ISBN: 978-3-540-33880-2
eBook Packages: EngineeringEngineering (R0)