Abstract
Electronic journals are becoming a major source of scientific information. Researchers interested only in certain topics do not have time to scan all possibly relevant journals on a regular basis. A digital library can assist them by providing a uniform, search-able interface for electronic journals. To this purpose, a catalogue of metadata on the available journals such as authors and titles of articles must be established by the digital library. If there is no cooperation with journal publishers, this metadata must be extracted from the publishers’ Web Sites, overcoming the intrinsic heterogeneity problems.
Within the framework of the ongoing Natural Sciences Digital Library project at the Free University of Berlin, we have designed a wrapper-mediator mechanism that copes with the heterogeneity problems of automatic metadata acquisition. It is based on our generic HyperView methodology for integration ofWeb Sites. From this methodology it inherits two elegant and effective features. First, the structure of the publisher site is specified with abstract graph-schemata, instead of being hard-coded in scripts for data acquisition. Second, a powerful view concept based on declarative graph-transformation rules is used for information extraction.
Supported by the German Research Society, Berlin-Brandenburg Graduate School on Distributed Information Systems (DFG grant no. GRK 316)
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
B. Adelberg. NoDoSE: A tool for semi-automatically extracting semi-structured data from text documents-brad adelberg. In SIGMOD Conference 1998, 1998.
G. Arocena and A. Mendelzon. WebOQL: Restructuring documents, databases and webs. In Proc. of 14th. Intl. Conf. on Data Engineering (ICDE 98), 1998.
N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. In Proc. Workshop on Management of Semistructured Data, Tucson, 1997.
P. Atzeni and G. Mecca. Cut & paste. In PODS’97, pages 12–15, Tucson, Arizona, 1997.
M. Baldonado, C.K. Chang, L. Gravano, and A. Paepcke. The stanford digital library metadata architecture. International Journal on Digital Libraries, 1(2):108–121, 1997.
BUBL (British National Information Service for the higher education community). http://bubl.ac.uk/admin/purpose.htm.
S. Cluet, C. Delobel, J. Siméon, and K. Smaga. Your mediators need data conversion! In SIGMOD Conference 1998, pages 177–188, 1998.
M. Dreger et al. Medoc information broker-harnessing the information in leterature and full text databases. In N. Fuhr J. Callan, editor, Proc. SIGIR workshop on Networked Information Retrieval, 1996.
D. Faensen, A. Hinze, and H. Schweppe. Alerting in a digital library environment-do channels meet the requirements. In ECDL’98, 1998.
L.C. Faulstich, M. Spiliopoulou, and V. Linnemann. WIND: A warehouse for internet data. In Advances in Databases-Proceedings BNCOD 15, number 1271 in LNCS, pages 169–183. Springer, 1997.
Lukas C. Faulstich. Integrating web sites using HyperView. Submitted for publication., 1998.
Mary Fernandez, Daniela Florescu, Jaewoo Kang, Alon Levy, and Dan Suciu. Catching the boat with Strudel: experiences with a web-site management system. In SIGMOD, pages 414–425, 1998.
H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. Integrating and accessing heterogeneous information sources in TSIM-MIS. In AAAI Symposium on Information Gathering, pages 61–64, 1995.
A. Gupta and I. S. Mumick. Maintenance of materialized views: Problems, techniques and applications. IEEE Quarterly Bulletin on Data Engineering; Special Issue on Materialized Views and Data Warehousing, 18(2):3–18, 1995.
J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the web. In Proc. Workshop on Management of Semistructured Data, Tucson, 1997.
JSTOR. http://www.jstor.org/.
D. Konopnicki and O. Shmueli. W3QS: A system for WWW querying. In ICDE’97, pages 586–586, April 1997.
L. V. S. Lakshmanan, F. Sadri, and I. N. Subramanian. A declarative language for querying and restructuring the Web. In IEEE, editor, RIDE’96, pages 12–21. IEEE Computer Society Press, 1996.
M. Ley. Die Trierer Informatik-Bibliographie DBLP. In GI Jahrestagung 1997, pages 257–266, 1997. http://dblp.uni-trier.de.
C.A. Lynch. The Z39-50 information retrieval protocol: An overview and status report. ACM Computer Communication Review, 21(1):58–70, 1991.
A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the World Wide Web. International Journal on Digital Libraries, 1(1):54–67, 1997.
P. Merialdo P. Atzeni, G. Mecca. To weave the web. In VLDB’ 97, pages 206–215, 1997.
PHP3 manual. http://www.php.net/manual/, 1998.
B.R. Schatz, W.H. Mischo, T.W. Cole, J.B. Hardin, A.P. Bishop, and H. Chen. Federating diverse collections of scientific literature. IEEE Computer, 29(5), 1996.
Simon Fraser University Electronic Library in Computing Science. http://fas.sfu.ca/projects/ElectronicLibrary/Collections/CMPT/.
D. Smith and M. Lopez. Information extraction for semi-structured documents. In Proc. Workshop on Management of Semistructured Data, Tucson, 1997.
Stanford University Libraries-Electronic Journals Collection. http://www-sul.stanford.edu/collect/ejourns.html.
Elektronische Zeitschriftenbibliothek, Universität Regensburg. http://www.bibliothek.uniegensburg.de/ezeit/ezb.phtml.
Stony Brook University Libraries-electronic journals. http://www.sunysb.edu/library/ldeljour.htm.
J. Widom. Research problems in data warehousing. In 4th International Conference on Information and Knowledge Management, pages 25–30, 1995.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Faulstich, L.C., Spiliopoulou, M. (1998). Building HyperView Wrappers for Publisher Web Sites. In: Nikolaou, C., Stephanidis, C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 1998. Lecture Notes in Computer Science, vol 1513. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49653-X_8
Download citation
DOI: https://doi.org/10.1007/3-540-49653-X_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65101-7
Online ISBN: 978-3-540-49653-3
eBook Packages: Springer Book Archive