{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MCS 275 Spring 2022 Worksheet 15\n", "\n", "* Course instructor: David Dumas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Topics\n", "\n", "This worksheet focuses on **urllib** and **Beautiful Soup**.\n", "\n", "## Resources\n", "\n", "These things might be helpful while working on the problems. Remember that for worksheets, we don't strictly limit what resources you can consult, so these are only suggestions.\n", "\n", "\n", "* [Lecture 39 - HTML and CSS](https://dumas.io/teaching/2022/spring/mcs275/slides/lecture39.html)\n", "* [Lecture 40 - Parsing and scraping HTML](https://dumas.io/teaching/2022/spring/mcs275/slides/lecture40.html)\n", "* [Lecture 41 - Beautiful soup](https://dumas.io/teaching/2022/spring/mcs275/slides/lecture41.html)\n", "* [urllib examples notebook](https://dumas.io/teaching/2022/spring/mcs275/nbview/samplecode/scraping/scraping-demos.html)\n", "* [Beautiful Soup example scraper notebook](https://dumas.io/teaching/2022/spring/mcs275/nbview/samplecode/scraping/mathgradscrape.html)\n", "* [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)\n", "* [w3schools HTML tutorial](https://www.w3schools.com/html/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. HTML prettifier and warning utility\n", "\n", "Use Beautiful Soup to write a script that takes an HTML file and writes equivalent HTML that is more nicely indented to an output file. (Recall that Beautiful Soup has a [method to generate nicely indented HTML](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#pretty-printing) for any tag or BeautifulSoup object.)\n", "\n", "Also, if there is no `