Scraping Data from HTML Forms
In Chapter 5, we saw how to scrape data from HTML pages. In this chapter, we focus on a variation of this where we get the Web page containing the data we want by submitting an HTML form. Rather than using a Web browser, we submit the form from R, providing inputs to parameterize the request from data in R. We use functionality in the RCurl package such as getForm() and post-Form() to make the requests for the Web pages and then we scrape the data using the XML package and XPath. Since the forms are described in HTML documents, in many cases we can programmatically query the HTML form and learn about its parameters and their default and possible values. We can convert this information into an R function that acts as a proxy for the HTML form. The key ideas in this chapter are 1) to be able to get data programmatically via HTML forms as part of a reproducible workflow, and 2) to further automate the creation of the code that we use to get these data. These functions attempt to provide higher-level facilities and concepts relative to getForm() and postForm().
KeywordsData Frame Developer Tool XPath Query Query String Radio Button
Unable to display preview. Download preview PDF.
- ASA Sections on Statistical Computing and Graphics. Data Expo 09: Airline on-time performance. http://stat-computing.org/dataexpo/2009/, 2009.
- Sandrine Dudoit, Sunduz Keles, and Duncan Temple Lang. RHTMLForms: Programmatically create R functions corresponding to Web/HTML forms. http://www.omegahat.org/RHTMLForms, 2012. R package version 0.6-0.
- Ian Hickson. HTML5: A vocabulary and associated APIs for HTML and XHTML. Worldwide Web Consortium, 2011. http://www.w3.org/TR/html5/.
- Mark Pilgrim. HTML5: Up and Running. O’Reilly Media, Inc., Sebastopol, CA, 2010.Google Scholar
- David Raggett. HTML 4.01 specification. Worldwide Web Consortium, 1999. http://www.w3.org/TR/html401.
- Christopher Schmitt and Kyle Simpson. HTML5 Cookbook. O’Reilly Media, Inc., Sebastopol, CA, 2011.Google Scholar
- Duncan Temple Lang. XML: Tools for parsing and generating XML within R and S-PLUS. http://www.omegahat.org/RSXML, 2011. R package version 3.4.
- Duncan Temple Lang. RCurl: General network (HTTP, FTP, etc.) client interface for R. http://www.omegahat.org/RCurl, 2012. R package version 1.95-3.
- US Department of Transportation. Research and innovative technology administration. http://www.rita.dot.gov/, 2011.