Parsing XML Content
In this chapter, we explore approaches to parsing XML content within R and extracting content from the various types of elements in the XML document. The primary approach is to parse an XML document into a hierarchical tree object. We show how the tree representation of an XML document (described in Chapter 2) can be treated as a list in R, which makes it easy to navigate nodes and branches in the XML document. In addition, we demonstrate how to use functions in the XML package that are designed to work with different elements of the tree, e.g., functions for accessing node names, text content, attribute values, namespaces, etc. Subsequent chapters introduce XPath (Chapter 4), a powerful XML technology for locating content in an XML document, and describe more complex strategies for extracting XML content (Chapter 5).
Unable to display preview. Download preview PDF.
- Elliotte Rusty Harold and W. Scott Means. XML in a Nutshell. O’Reilly Media, Inc., Sebastopol, CA, 2004.Google Scholar
- David Hunter, Jeff Rafter, Joe Fawcett, Eric van der Vlist, Danny Ayers, Jon Duckett, Andrew Watt, and Linda McKinnon. Beginning XML. Wiley Publishing, Inc., Indianapolis, IN, fourth edition, 2007.Google Scholar
- Duncan Temple Lang. RTidyHTML: Tidy HTML documents. http://www.omegahat.org/RTidyHTML, 2011. R package version 0.2-1.
- Duncan Temple Lang. XML: Tools for parsing and generating XML within R and S-PLUS. http://www.omegahat.org/RSXML, 2011. R package version 3.4.
- Duncan Temple Lang. Rcompression: In-memory decompression for GNU zip and bzip2 formats. http://www.omegahat.org/Rcompression, 2012. R package version 0.94-0.
- Duncan Temple Lang. RCurl: General network (HTTP, FTP, etc.) client interface for R. http://www.omegahat.org/RCurl, 2012. R package version 1.95-3.
- USGS Earthquakes Hazards Program. Latest earthquakes: feeds and data. http://earthquake.usgs.gov/earthquakes/catalogs/, 2010.
- Daniel Veillard. The XML C parser and toolkit of Gnome. http://www.xmlsoft.org, 2011.