Abstract
The driving force behind the technology revolution has always been just one thing: information. Almost every invention related to the computer since the transistor has been made to aid in the transferring of a piece of information, or data, from one place to another. Despite the existence of a primitive form of what we now know of as the Internet, less than one generation ago digital information mostly needed to be carried around on magnetic devices such as tapes and disks. Fortunately, the prominent rise of the Internet and the World Wide Web in the mid-1990s removed the barrier that physical transportation of data placed on us.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Bibliography
Charles Allen. WIDL: Application Integration with XML. World Wide Web Journal 2(4), November 1997.
Maria Luisa Barja, Tore Bratvold, Jussi Myllymaki, and Gabriele Sonnenberger. Informia: a Mediator for Integrated Access to Heterogeneous Information Sources. Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), Washington, DC, November 1998.
CGI: Common Gateway Interface. October 1999. http://www.w3.org/CGI/.
Compaq Computer. Compaq’s Web Language. http://www.research.digital.com/SRC/WebL/index. html.
Erik Espe. Blockade of site aims to keep firms from “deep linking.” Silicon Valley/San Jose Business Journal. http://sanjose.bizjournals.com/sanjose/stories/1999/11/15/story3.html
Dayne Freitag. Information Extraction from HTML: Application of a General Machine Learning Approach. Proceedings of the Conference on Artificial Intelligence (AAAI), pp. 517–523, September 1998.
Ashish Gupta, Venky Harinarayan, and Anand Rajaraman. Virtual Database Technology. ACM SIGMOD Record, vol. 26, no. 4, December 1997.
Joachim Hammer, Hector Garcia-Molina, Junghoo Cho, Arturo Crespo, and Rohan Aranha. Extracting Semistructured Information from the Web. Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, 1997.
HTML 4.01 Specification, W3C Recommendation, December 1999. http://www.w3.org/TR/htmI4/.
Hypertext Transfer Protocol-HTTP/1.1, RFC 2616, The Internet Society, June 1999. ftp://ftp.isi.edu/in-notes/rfc2616.txt.
HTTP State Management Mechanism, RFC 2109, http://www.ietf.org/rfc/rfc2109.txt.
International Business Machines. DB2 XML Extender. http://www.ibm.com/software/data/db2/ extenders/xmlext/index. html.
Jared Jackson and Jussi Myllymaki. Web-Based Data Mining. IBM developerWorks, June 2001. http://www-l06.ibm.com/developerworks/Web/library/wa-wbdm/?dwzone=web.
Jared Jackson. Use Recursion Effectively in XSL. IBM developerWorks, October http://www-l06.ibm. com/developerworks/xml/library/x-xslrecur/?dwzone=xml.
JavaServer Pages 2.0 Specification OSR-000152), Java Community Process, http://jcp.org/aboutJava/ communityprocess/first/jsr152/ index3.html.
Craig A. Knoblock, Kristina Lerman, Steven Minton, and Ion Muslea. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. IEEE Data Engineering Bulletin, vol. 23, no. 4, pp. 33–41, 2000.
David Konopnicki and Oded Shmueli. W3QS: A Query System for the World Wide Web. Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 54–65, Zurich, Switzerland, September 1995.
Martijn Koster. A Standard for Robot Exclusion. http://www.robotstxt.org/wc/norobots.html
Nicholas Kushmerick. Gleaning the Web. IEEE Intelligent Systems, vol. 14, no. 2, pp. 20–22, March/April 1999.
Laks V. S. Lakshmanan, Fereidoon Sadri, and Iyer N. Subramanian. A Declarative Language for Querying and Restructuring the Web. Proceedings of the 6th International Workshop on Research Issues in Data Engineering (RIDE), February 1996.
Ling Liu, Calton Pu, and Wei Han. XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources. Proceedings of the International Conference on Data Engineering (ICDE), San Diego, California, February 2000.
Alberto Mendelzon, George Mihaila, and Tova Milo. Querying the World Wide Web. International Journal on Digital Libraries, vol. 1, no. 1, pp. 54–67, 1997.
Jussi Myllymaki. Effective Web Data Extraction with Standard XML Technologies. Proceedings of the Tenth International World Wide Web Conference, Hong Kong, May 2001.
Jussi Myllymaki and Jared Jackson. Robust Web Data Extraction with XML Path Expressions. IBM Research Report RJ 10245, May 2002.
Lucian Popa, Mauricio A. Hernández, Yannis Velegrakis and R. J. Miller. Mapping XML and Relational Schemas with CLIO. System Demonstration, IEEE Data Engineering Conference, 2002.
Sriram Raghavan and Hector Garcia-Molina. Crawling the Hidden Web. Proceedings of the International Conference on Very Large Databases (VLDB), 2001.
Berthier Ribeiro-Neto, Alberto H.F. Laender, and Altigran S. pa Silva. Extracting Semi-Structured Data Through Examples. Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), Kansas City, Missouri, November 1999.
Arnaud Sahuguet and Fabien Azavant. Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F. Proceedings of the International Conference on Very Large Data Bases (VLDB), Edinburgh, Scotland, September 1999.
Stephen Soderland. Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, vol. 34, no. 1, pp. 233–272, 1999.
Simple Object Access Protocol (SOAP) 1.1, W3C Note, May 2000. http://www.w3.org/TR/SOAP/.
Marc Songini. IBM: All Searches Start at Grand Central, Network World, November 11, 1997.
HTML Tidy. http://www.w3.org/People/Raggett/tidy/.
Web Content Accessibility Guidelines 1.0. W3C Recommendation, May 1999. http://www.w3.org/ TR/WAI-WEBCONTENT/.
XHTML: The Extensible HyperText Markup Language,W3C Recommendation.january 2000. http://www.w3.org/TR/xhtml1.
Extensible Markup Language (XML), W3C Recommendation, February 1998. http://www.w3.org/ TR/REC-xml.
XQuery 1.0: An XML Query Language. W3C Working Draft, November 2002. http://www.w3.org/TR/xquery/.
XML Schema Part 0: Primer, W3C Working Draft, April 2000. http://www.w3.org/TR/xmlschema0/.
XML Path Language (XPath), W3C Recommendation, November 1999. http://www.w3.org/TR/ xpath.html.
XSL Transformations (XSLT), W3C Recommendation, November 1999. http://www.w3.org/TR/ xslt.html.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Kluwer Academic Publishers
About this chapter
Cite this chapter
Myllymaki, J., Jackson, J. (2005). Web Data Extraction Techniques and Applications Using the Extensible Markup Language (XML). In: Leondes, C.T. (eds) Intelligent Knowledge-Based Systems. Springer, Boston, MA. https://doi.org/10.1007/978-1-4020-7829-3_18
Download citation
DOI: https://doi.org/10.1007/978-1-4020-7829-3_18
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-7746-3
Online ISBN: 978-1-4020-7829-3
eBook Packages: Computer ScienceComputer Science (R0)