Skip to main content

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

  • Chapter
Advances in Web Intelligence and Data Mining

Part of the book series: Studies in Computational Intelligence ((SCI,volume 23))

  • 639 Accesses

Abstract

In research in web archives, large temporal document collections are necessary in order to be able to compare and evaluate new strategies and algorithms. Large temporal document collections are not easily available, and an alternative is to create synthetic document collections. In this paper we will describe how to generate synthetic temporal document collections, how this is realized in the TDocGen temporal document generator, and we will also present a study of the quality of the document collections created by TDocGen.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. D. Barbosa et al. ToXgene: a template-based data generator for XML. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, 2002.

    Google Scholar 

  2. B. E. Brewington and G. Cybenko. How dynamic is the Web? Computer Networks, 33(1–6):257–276, 2000.

    Article  Google Scholar 

  3. G. Cobena, S. Abiteboul, and A. Marian. Detecting changes in XML documents. In Proceedings of the 18th International Conference on Data Engineering, 2002.

    Google Scholar 

  4. D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of Web pages. Software — Practice and Experience, 34(2):213–237, 1996.

    Article  Google Scholar 

  5. H. S. Heaps. Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., 1978.

    Google Scholar 

  6. G. Kazai et al. The INEX evaluation initiative. In Intelligent Search on XML Data, Applications, Languages, Models, Implementations, and Benchmarks, 2003.

    Google Scholar 

  7. W. Li. Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6), 1992.

    Google Scholar 

  8. K. Nørvåg. The design, implementation, and performance of the V2 temporal document database system. Journal of Information and Software Technology, 46(9):557–574, 2004.

    Google Scholar 

  9. K. Runapongsa et al. The Michigan Benchmark: A microbenchmark for XML query processing systems. In Efficiency and Effectiveness of XML Tools and Techniques and Data Integration over the Web, 2002.

    Google Scholar 

  10. A. Schmidt et al. XMark: a benchmark for XML data management. In Proceedings of VLDB’2002, 2002.

    Google Scholar 

  11. G. K. Zipf. Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology. Addison-Wesley, 1949.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Nørvåg, K., Nybø, A.O. (2006). Creating Synthetic Temporal Document Collections for Web Archive Benchmarking. In: Last, M., Szczepaniak, P.S., Volkovich, Z., Kandel, A. (eds) Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol 23. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-33880-2_18

Download citation

  • DOI: https://doi.org/10.1007/3-540-33880-2_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-33879-6

  • Online ISBN: 978-3-540-33880-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics