Using Beautiful Soup

Hajba, Gábor László

doi:10.1007/978-1-4842-3925-4_3

Gábor László Hajba²

6252 Accesses
7 Citations
3 Altmetric

Abstract

In this chapter, you will learn how to use Beautiful Soup, a lightweight Python library, to extract and navigate HTML content easily and forget overly complex regular expressions and text parsing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Unless you are lucky. Once I encountered a site where all the links to the remaining pages were there in the HTML code but had been hidden with some JS-magic.
2.
OOP: object-oriented programming
3.
For example, the Builder or Factory patterns, a constructor with all arguments.
4.
https://docs.python.org/3/library/csv.html
5.
I have to admit, every time I write CSV files I use spamwriter as my variable’s name. I guess this gives me a global understanding on what’s happening.
6.
Set theory: https://en.wikipedia.org/wiki/Union_(set_theory)
7.
https://docs.python.org/3/library/json.html
8.
https://github.com/coleifer/peewee
9.
Object-relational mapping
10.
I have worked since 2007 with ORM tools, and I like the idea, but some queries can become quite complex.
11.
https://docs.mongodb.com/getting-started/python/
12.
Hard cache: Get all information from the cache, and if there are attempts to gather anything from the Internet, refuse it. This makes scraping a bit consistent between runs.
13.
For more information, visit: https://blake2.net/
14.
Alternatively, to be more consistent, you can create a downloader, which hides the cache from the users of your code.

Author information

Authors and Affiliations

Sopron, Hungary
Gábor László Hajba

Authors

Gábor László Hajba
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hajba, G.L. (2018). Using Beautiful Soup. In: Website Scraping with Python. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-3925-4_3

Download citation

DOI: https://doi.org/10.1007/978-1-4842-3925-4_3
Published: 15 September 2018
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-3924-7
Online ISBN: 978-1-4842-3925-4
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)

Publish with us

Policies and ethics