Abstract
Conventional Web archives are created by periodically crawling a Web site and archiving the responses from the Web server. Although easy to implement and commonly deployed, this form of archiving typically misses updates and may not be suitable for all preservation scenarios, for example a site that is required (perhaps for records compliance) to keep a copy of all pages it has served. In contrast, transactional archives work in conjunction with a Web server to record all content that has been served. Los Alamos National Laboratory has developed SiteStory, an open-source transactional archive written in Java that runs on Apache Web servers, provides a Memento compatible access interface, and WARC file export features. We used Apache’s ApacheBench utility on a pre-release version of SiteStory to measure response time and content delivery time in different environments. The performance tests were designed to determine the feasibility of SiteStory as a production-level solution for high fidelity automatic Web archiving. We found that SiteStory does not significantly affect content server performance when it is performing transactional archiving. Content server performance slows from 0.076 seconds to 0.086 seconds per Web page access when the content server is under load, and from 0.15 seconds to 0.21 seconds when the resource has many embedded and changing resources.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Adar, E., Dontcheva, M., Fogarty, J., Weld, D.: Zoetrope: interacting with the ephemeral web. In: Proceedings of the 21st Annual ACM Symposium on User Interface Software and Technology, pp. 239–248. ACM (2008)
Ainsworth, S., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the Web is archived? In. In: JCDL 2011: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, pp. 133–136 (2011)
Brewington, B., Cybenko, G., Coll, D., Hanover, N.: Keeping up with the changing Web. IEEE Computer 33(5), 52–58 (2000)
Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: Proceedings of the 26th International Conference on Very Large Data Bases, pp. 200–209 (2000)
Dyreson, C.E., Lin, H.-L., Wang, Y.: Managing versions of Web documents in a transaction-time Web server. In: Proceedings of the 13th International Conference on World Wide Web, WWW 2004 (2004)
Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. Software: Practice and Experience 34(2), 213–237 (2004)
Fitch, K.: Web site archiving: An approach to recording every materially different response produced by a Website. In: 9th Australasian World Wide Web Conference, pp. 5–9 (July 2003)
Hagedorn, K., Sentelli, J.: Google Still Not Indexing Hidden Web URLs. D-Lib Magazine 14(7) (August 2008), http://dlib.org/dlib/july08/hagedorn/07hagedorn.html
Jatowt, A., Kawai, Y., Nakamura, S., Kidawara, Y., Tanaka, K.: Journey to the past: Proposal of a framework for past web browser. In: Proceedings of the Seventeenth Conference on Hypertext and Hypermedia, pp. 135–144. ACM (2006)
Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Proceeding of the 17th International Conference on World Wide Web, pp. 437–446. ACM (2008)
Sanderson, R., Shankar, H., Ainsworth, S., McCown, F., Adams, S.: Implementing Time Travel for the Web. Code4Lib Journal 13 (2011)
Teevan, J., Dumais, S.T., Liebling, D.J.: A longitudinal study of how highlighting web content change affects people’s web interactions. In: Proceedings of the 28th International Conference on Human Factors in Computing Systems, CHI 2010 (2010)
Teevan, J., Dumais, S.T., Liebling, D.J., Hughes, R.L.: Changing how people view changes on the web. In: UIST 2009: Proceedings of the 22nd Annual ACM Symposium on User Interface Software and Technology, pp. 237–246 (2009)
Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP framework for time-based access to resource states – Memento draft-vandesompel-memento-06 (2013), http://tools.ietf.org/pdf/draft-vandesompel-memento-06.pdf
Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: Time Travel for the Web. Technical Report arXiv:0911.1112 (2009)
Van de Sompel, H., Sanderson, R., Nelson, M.L., Balakireva, L.L., Shankar, H., Ainsworth, S.: An HTTP-Based Versioning Mechanism for Linked Data. In: Proceedings of the Linked Data on the Web Workshop (LDOW 2010) (Also available as arXiv:1003.3661) (2010)
Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: WWW 2002: Proceedings of the 11th International Conference on World Wide Web, pp. 136–147 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Brunelle, J.F., Nelson, M.L., Balakireva, L., Sanderson, R., Van de Sompel, H. (2013). Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2013. Lecture Notes in Computer Science, vol 8092. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40501-3_20
Download citation
DOI: https://doi.org/10.1007/978-3-642-40501-3_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40500-6
Online ISBN: 978-3-642-40501-3
eBook Packages: Computer ScienceComputer Science (R0)