MCS 275 Spring 2022
python3 -m pip install beautifulsoup4
HTML is a language for making documents, meant to be displayed to humans. Avoid having programs read HTML if at all possible.
e.g. look for an API that serves the same data in a structured format (CSV, JSON, ...)
We discuss what you can do if there is no API, and you need to extract information from an HTML document.
Some of these ideas generalize to exploring and extracting data from other file formats.
Level 0: Treat the HTML document as a string and use search operations (
str.find or regexes) to locate something you care about, like
HTML is complicated, and this approach is very error-prone.
Level 1: Use a parser that knows how to recognize start/end tags, attributes, etc., and tell it what to do when it finds them (e.g. call this function...)
html.parser is in the standard library.
This approach is event-based. You specify functions to handle things when they are found, but you don't get an overall picture of the entire document.
These frameworks create a data structure that represents the entire document, supporting various kinds of searching, traversal, and extraction.
The Document Object Model or DOM is a language-independent model for representing a HTML document as a tree of nodes.
Each node represents part of the document, such as a tag, an attribute, or text appearing inside a tag.
<html><head><title>My title</title></head><body><h1>A heading</h1> <a href="https://example.com">Link text</a></body></html>
<p>I <strong>really</strong>like Python.</p>
This package provides a module called
bs4 for turning HTML into a DOM-like data structure.
Widely used, e.g. at one point Reddit's backend software used it to select a representative image from a web page when a URL appeared in a post*.
Requires an HTML parser. We'll use
html.parser from the standard library (slow but always available).
* As of 2014. Perhaps they still use it?
Parse HTML file into DOM:
from bs4 import BeautifulSoup with open("lecture40.html") as fobj: soup = BeautifulSoup(fobj,"html.parser")
Parse web page into DOM:
from urllib.request import urlopen from bs4 import BeautifulSoup with urlopen("https://example.com/") as response: soup = BeautifulSoup(response,"html.parser")
Be careful about the ethics of connecting to web servers from programs.
A program that extracts data from HTML is a scraper
A program that visits all pages on a site is a spider.
All forms of automated access should:
Parse string into DOM:
from bs4 import BeautifulSoup soup = BeautifulSoup( "
That was durian?!", "html.parser" )
str(soup) # show as HTML soup.prettify() # prettier HTML soup.title # first (and only) title tag soup.p # first p tag soup.find("p") # first p tag (alternative) soup.p.strong # first strong tag within the first p tag soup.find_all("a") # list of all a tags
str(tag) # HTML for this tag and everything inside it tag.name # name of the tag, e.g. "a" or "ul" tag.attrs # dict of tag's attributes tag["href"] # get a single attribute tag.text # All the text nodes inside tag, concatenated tag.string # If tag has only text inside it, returns that text # But if it has other tags as well, returns None tag.parent # enclosing tag tag.contents # list of the children of this tag tag.children # iterable of children of this tag tag.banana # first descendant banana tag (sub actual tag name!) tag.find(...) # first descendant meeting criteria tag.find_all(...) # descendants meeting criteria tag.find_next_sibling(...) # next sibling tag meeting criteria
Arguments supported by all the
tag.find_all(True) # all descendants tag.find_all("tagname") # descendants by tag name tag.find_all(href="https://example.com/") # by attribute tag.find_all(class_="post") # by class tag.find_all(re.compile("^fig")) # tag name regex match tag.find_all("a",limit=15) # first 15 a tags tag.find_all("a",recursive=False) # all a *children*
Also work with
soup.select(SELECTOR) returns a list of tags that match a CSS selector, e.g.
soup.select(".wide") # all tags of class "wide" # ul tags within divs of class messagebox soup.select("div.messagebox ul")
There are many CSS selectors and functions we haven't discussed, so this gives a powerful alternative search syntax.
# all third elements of unordered lists soup.select("ul > li:nth-of-type(3)")