Abstract
With the popularity of the World Wide Web and the recognition of its worthiness of being archived we find numerous projects aiming at creating large-scale repositories containing excerpts and snapshots of Web data. Interfaces are being created that allow users to surf through time, analyzing the evolution of Web pages, or retrieving information using search interfaces. Yet, with the timeline and metadata available in such a Web archive, additional analyzes that go beyond mere information exploration, become possible. In this paper we present the AOLAP project building a Data Warehouse of such a Web archive, allowing its analysis and exploration from different points of view using OLAP technologies. Specifically, technological aspects such as operating systems and Web servers used, geographic location, and Web technology such as the use of file types, forms or scripting languages, may be used to infer e.g. technology maturation or impact.
Part of this work was done while the author was an ERCIM Research Fellow at IEI, Consiglio Nazionale delle Ricerche (CNR), Pisa, Italy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. Arvidson, K. Persson, and J. Mannerheim. The Kulturarw3 project—The Royal Swedish Web Archiw3e—An example of “complete” collection of web pages. In Proceedings of the 66th IFLA Council and General Conference, Jerusalem, Israel, August 13–18 2000. http://www.ifla.org/IV/ifla66/papers/154-157e.htm.
S. Bhowmick, N. Keong, and S. Madria. Web schemas in WHOWEDA. In Proceedings of the ACM 3rd International Workshop on Data Warehousing and OLAP, Washington, DC, November 10 2000. ACM.
R. Bruckner and A. Tjoa. Managing time consistency for active data warehouse environments. In Proceedings of the Third International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2001), LNCS 2114, pages 254–263, Munich, Germany, September 2001. Springer. http://link.springer.de/link/service/series/0558/papers/2114/21140219.pdf.
Computer Knowledge (CKNOW). FILExt: The file extension source. Webpage, June 2002. http://filext.com.
A. Crespo and H. Garcia-Molin. Cost-driven design for archival repositories. In E. Fox and C. Borgman, editors, Proceedings of the First ACM/IEEE Joint Conference on Digital Libraries (JCDL’01), pages 363–372, Roanoke, VA, June 24–28 2001. ACM. http://www.acm.org/dl.
M. Day. Metadata for digital preservation: A review of recent developments. In Proceedings of the 5. European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2001), Springer Lecture Notes in Computer Science, Darmstadt, Germany, Sept. 4–8 2001. Springer.
J. Ding, L. Gravano, and N. Shivakumar. Computing geographical scopes of web resources. In Proceedings of the 26th International Conference on Very Large Databases, VLDB 2000, pages 545–556, Cairo, Egypt, September 10–14 2000.
J. Hakala. Collecting and preserving the web: Developing and testing the NEDLIB harvester. RLG DigiNews, 5(2), April 15 2001. http://www.rlg.org/preserv/diginews/diginews5-2.html.
J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke. Webbase: A repositoru of web pages. In Proceedings of the 9th International World Wide Web Conference (WWW9), Amsterdam, The Netherlands, May 15–19 2000. Elsevir Science. http://www9.org/w9cdrom/296/296.html.
The Internet Archive. Website. http://www.archive.org.
B. Kahle. Preserving the internet. Scientific American, March 1997. http://www.sciam.com/0397issue/0397kahle.html.
R. Kimball. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. John Wiley & Sons, 2 edition, 2002.
S. Leung, S. Perl, R. Stata, and J. Wiener. Towards web-scale web archeology. Research Report 174, Compaq Systems Research Center, Palo Alto, CA, September 10 2001. http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/SRC-174.pdf.
T. Pedersen and C. Jensen. Multidimensional database technology. IEEE Computer, 34(12):40–46, December 2001.
A. Rauber. Austrian on-line archive: Current status and next steps. Presentation given at the ECDL Workshop on Digital Deposit Libraries (ECDL 2001) Darmstadt, Germany, September 8 2001.
A. Rauber and A. Aschenbrenner. Part of our culture is born digital-On efforts to preserve it for future generations. TRANS. On-line Journal for Cultural Studies (Internet-Zeitschrift für Kulturwissenschaften), 10, July 2001. http://www.inst.at/trans/10Nr/inhalt10.htm.
T. Werf-Davelaar. Long-term preservation of electronic publications: The NEDLIB project. D-Lib Magazine, 5(9), September 1999. http://www.dlib.org/dlib/september99/vanderwerf/09vanderwerf.html.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rauber, A., Aschenbrenner, A., Witvoet, O. (2002). Austrian Online Archive Processing: Analyzing Archives of the World Wide Web. In: Agosti, M., Thanos, C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2002. Lecture Notes in Computer Science, vol 2458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45747-X_2
Download citation
DOI: https://doi.org/10.1007/3-540-45747-X_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44178-6
Online ISBN: 978-3-540-45747-3
eBook Packages: Springer Book Archive