Strategies for Extracting Data from HTML and XML Content
In this chapter, we compare different approaches to parsing XML and HTML documents and extracting data from these documents into R. We illustrate these with comprehensive, real-world examples that illustrate XPath and R functions for processing XML documents. We also introduce event-driven parsing where we use a collection of R functions to respond to events in the XML parser. These work for both tree-based (DOM) parsing and SAX parsing where we avoid building the tree. At the end of the chapter, the reader should have a good understanding of the various different strategies that can be used in R to parse XML documents and extract content.
Unable to display preview. Download preview PDF.
- Elliotte Rusty Harold andW. Scott Means. XML in a Nutshell. O’Reilly Media, Inc., Sebastopol, CA, 2004.Google Scholar
- David Hunter, Jeff Rafter, Joe Fawcett, Eric van der Vlist, Danny Ayers, Jon Duckett, Andrew Watt, and Linda McKinnon. Beginning XML. Wiley Publishing, Inc., Indianapolis, IN, fourth edition, 2007.Google Scholar
- Michel Rodriguez. XML::Twig: A PERL module for processing huge XML documents in tree mode. http://search.cpan.org/dist/XML-Twig/, 2012.
- Duncan Temple Lang. XML: Tools for parsing and generating XML within R and S-PLUS. http://www.omegahat.org/RSXML, 2011. R package version 3.4.
- Duncan Temple Lang. RCurl: General network (HTTP, FTP, etc.) client interface for R. http://www.omegahat.org/RCurl, 2012. R package version 1.95-3.
- Duncan Temple Lang. XMLSchema: R facilities to read XML schema. http://www.omegahat.org/XMLSchema, 2012. R package version 0.7-0.