Abstract
In this chapter, you will learn how to use Beautiful Soup, a lightweight Python library, to extract and navigate HTML content easily and forget overly complex regular expressions and text parsing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Unless you are lucky. Once I encountered a site where all the links to the remaining pages were there in the HTML code but had been hidden with some JS-magic.
- 2.
OOP: object-oriented programming
- 3.
For example, the Builder or Factory patterns, a constructor with all arguments.
- 4.
- 5.
I have to admit, every time I write CSV files I use spamwriter as my variable’s name. I guess this gives me a global understanding on what’s happening.
- 6.
Set theory: https://en.wikipedia.org/wiki/Union_(set_theory)
- 7.
- 8.
- 9.
Object-relational mapping
- 10.
I have worked since 2007 with ORM tools, and I like the idea, but some queries can become quite complex.
- 11.
- 12.
Hard cache: Get all information from the cache, and if there are attempts to gather anything from the Internet, refuse it. This makes scraping a bit consistent between runs.
- 13.
For more information, visit: https://blake2.net/
- 14.
Alternatively, to be more consistent, you can create a downloader, which hides the cache from the users of your code.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Gábor László Hajba
About this chapter
Cite this chapter
Hajba, G.L. (2018). Using Beautiful Soup. In: Website Scraping with Python. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-3925-4_3
Download citation
DOI: https://doi.org/10.1007/978-1-4842-3925-4_3
Published:
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-3924-7
Online ISBN: 978-1-4842-3925-4
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)