{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MCS 275 Spring 2022 Worksheet 15 Solutions\n", "\n", "* Course instructor: Emily Dumas\n", "* Solutions prepared by: Jennifer Vaccaro, Johnny Joyce" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Topics\n", "\n", "This worksheet focuses on **urllib** and **Beautiful Soup**.\n", "\n", "## Resources\n", "\n", "These things might be helpful while working on the problems. Remember that for worksheets, we don't strictly limit what resources you can consult, so these are only suggestions.\n", "\n", "\n", "* [Lecture 39 - HTML and CSS](https://dumas.io/teaching/2022/spring/mcs275/slides/lecture39.html)\n", "* [Lecture 40 - Parsing and scraping HTML](https://dumas.io/teaching/2022/spring/mcs275/slides/lecture40.html)\n", "* [Lecture 41 - Beautiful soup](https://dumas.io/teaching/2022/spring/mcs275/slides/lecture41.html)\n", "* [urllib examples notebook](https://dumas.io/teaching/2022/spring/mcs275/nbview/samplecode/scraping/scraping-demos.html)\n", "* [Beautiful Soup example scraper notebook](https://dumas.io/teaching/2022/spring/mcs275/nbview/samplecode/scraping/mathgradscrape.html)\n", "* [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)\n", "* [w3schools HTML tutorial](https://www.w3schools.com/html/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. HTML prettifier and warning utility\n", "\n", "Use Beautiful Soup to write a script that takes an HTML file and writes a version of it with nicer indentation to an output HTML file. Also, if there is no `` tag in the input HTML file, the script should print a warning.\n", "\n", "The input HTML filename should be expected as the first command line argument, and the filename to which the prettified HTML is written is the second command line argument.\n", "\n", "For example, if the `in.html` contains\n", "```\n", "<!doctype html><html><head></head><body>\n", "<h1>MCS 275 HTML file</h1></body></html>\n", "```\n", "\n", "Then running\n", "```\n", "python3 prettify.py in.html out.html\n", "```\n", "should print a message\n", "```\n", "Warning: This HTML file has no <title>.\n", "```\n", "and should write the following to `out.html`:\n", "```\n", "<!DOCTYPE html>\n", "<html>\n", " <head>\n", " </head>\n", " <body>\n", " <h1>\n", " MCS 275 HTML file\n", " </h1>\n", " </body>\n", "</html>\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MCS 275 Worksheet 15 Problem 1\n", "# J Vaccaro\n", "# I completed this work myself, in accordance with the syllabus.\n", "\n", "from bs4 import BeautifulSoup\n", "import sys\n", "\n", "# Create a beautiful soup from the filename in the first command line arg\n", "with open(sys.argv[1],\"r\") as infile:\n", " soup = BeautifulSoup(infile,\"html.parser\")\n", "\n", "# Check whether the soup has a title in the head section, and print a message\n", "if soup.head.title == None:\n", " print(\"Warning: No html title. Prettifying anyways!\")\n", "else:\n", " print(\"Prettifying html with title...\",soup.head.title.string)\n", "\n", "# Write out the prettified soup to the filename in the 2nd command line arg\n", "with open(sys.argv[2],\"wt\") as outfile:\n", " outfile.write(soup.prettify())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Side note:\n", "\n", "It's also possible to use this code with a live website, as in the following code:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<!DOCTYPE html>\n", "<html>\n", " <head>\n", " <title>\n", " Example Domain\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", "

\n", " Example Domain\n", "

\n", "

\n", " This domain is for use in illustrative examples in documents. You may use this\n", " domain in literature without prior coordination or asking for permission.\n", "

\n", "

\n", " \n", " More information...\n", " \n", "

\n", "

\n", " \n", "\n", "\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "from urllib.request import urlopen\n", "\n", "with urlopen(\"https://example.com/\") as response:\n", " soup = BeautifulSoup(response, \"html.parser\")\n", " print(soup.prettify())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Complex analysis homework scraper\n", "\n", "Consider this web page for a graduate complex analysis class that was taught at UIC in 2016:\n", "* [Math 535 Spring 2016](https://www.dumas.io/teaching/2016/spring/math535/)\n", "\n", "One section of the page lists weekly homework. Each homework assignment has a number, a title, and a list of problems from various sections of the textbook. Write a scraper that downloads this course web site's HTML, parses it with Beautiful Soup, and creates one dictionary for each homework assignment having the following format\n", "```\n", "{\n", " \"number\": 10,\n", " \"title\": \"Harmonic functions\",\n", " \"problems\": \"Sec 4.6.2(p166): 1,2\\nSec 4.6.4(p171): 1,2,3,4\"\n", "}\n", "```\n", "It should then put these dictionaries into a list and save the list to a JSON file called `math535spring2016homework.json`.\n", "\n", "**Note:** If you finish this problem early, you might find it fun to watch this [animation of the UIC logo distortion](https://www.dumas.io/teaching/2016/spring/math535/images/inverted-logo-animation.gif) that appears on the Math 535 course web page, and see if you can figure out what's going on." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# It's good practice to save the html locally during development. \n", "# Here's a short script that saves the html as 'math535.html'\n", "\n", "from urllib.request import urlopen\n", "from bs4 import BeautifulSoup\n", "\n", "with urlopen(\"https://www.dumas.io/teaching/2016/spring/math535/\") as response:\n", " soup = BeautifulSoup(response,\"html.parser\")\n", "\n", "with open(\"math535.html\", \"wt\") as fout:\n", " fout.write(soup.prettify())" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "invalid literal for int() with base 10: 'exercises'\n" ] } ], "source": [ "# MCS 275 Worksheet 15 Problem 2\n", "# J Vaccaro\n", "# I completed this work myself, in accordance with the syllabus.\n", "\n", "from urllib.request import urlopen\n", "from bs4 import BeautifulSoup\n", "\n", "# First, create a beautiful soup, either from the url or a local copy\n", "\n", "## --- Comment this out for the final version ------------\n", "# with open(\"math535.html\", \"rt\") as infile:\n", "# soup = BeautifulSoup(infile, \"html.parser\")\n", "## -------------------------------------------------------\n", "\n", "# --- Comment this out during development ----------------\n", "with urlopen(\"https://www.dumas.io/teaching/2016/spring/math535/\") as response:\n", " soup = BeautifulSoup(response,\"html.parser\")\n", "# --------------------------------------------------------\n", "\n", "# We want to make a list of dictionaries, so start with an empty list\n", "hw_data = []\n", "\n", "# The relevant section is in an unordered list inside the \"homework\" div.\n", "hw_ul_tag = soup.find(\"div\",id=\"homework\").ul\n", "\n", "# Iterate through each bullet item in the homeworks list\n", "for hw in hw_ul_tag.find_all(\"li\"):\n", "\n", " # Not every 535 homework assignment fits the expected format. \n", " # If there's an issue parsing, just skip that assignment and continue.\n", " # A sweeping try/except is not always recommended, but neither\n", " # is parsing html.\n", "\n", " try:\n", " # The problems are inside the contents, on lines without other tags.\n", " problems = \"\"\n", "\n", " for prob in hw.contents:\n", " # Convert to string and strip out starting/ending white space\n", " prob = str(prob).strip()\n", " #If the content line has a tag or is whitespace, then skip\n", " if \"<\" in prob or prob == \"\": \n", " continue\n", " #Otherwise, concatenate to problems\n", " else:\n", " problems += \"\\n\" + prob\n", "\n", " # The assignment number and title are all inside the \"b\" tag\n", " heading = hw.b.string.strip()\n", " words = heading.split()\n", " number = int(words[1])\n", " title = \" \".join(words[7:])\n", "\n", " # Create a dictionary with the fields we collected\n", " d = {\"number\":number, \"title\":title, \"problems\":problems}\n", "\n", " # Append the dictionary to the list of dictionaries\n", " hw_data.append(d)\n", "\n", " except Exception as e:\n", " # Skip the homework assignments that don't have the expected format, \n", " # but print the error message.\n", " print(e)\n", " continue\n", "\n", "# Dump out the list-of-dictionaries into a json file.\n", "import json\n", "with open(\"math535spring2016homework.json\", \"wt\") as outf:\n", " json.dump(hw_data, outf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Capture the tag\n", "\n", "Here is a link to an HTML file:\n", "\n", "* [capture.html](https://www.dumas.io/teaching/2021/spring/mcs275/data/capture.html)\n", "\n", "If you open it in a browser, you won't see anything. The document contains nothing but `` tags, and no text. Some of the `` tags are nested inside other `` tags. How deeply are they nested?\n", "\n", "Every `` tag in this file has an `id` attribute. There is exactly one `` that has greater depth in the the DOM tree than any other. What is its `id` attribute?\n", "\n", "Write a Python script to load the HTML file with Beautiful Soup and tranverse the DOM to answer these questions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# It's good practice to save the html locally during development. \n", "# Here's a short script that saves the html as 'capture.html'\n", "\n", "from urllib.request import urlopen\n", "from bs4 import BeautifulSoup\n", "\n", "with urlopen(\"https://www.dumas.io/teaching/2021/spring/mcs275/data/capture.html\") as response:\n", " soup = BeautifulSoup(response,\"html.parser\")\n", "\n", "with open(\"capture.html\", \"wt\") as fout:\n", " fout.write(soup.prettify())" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Maximum depth: 61\n", "Maximum depth: 61 Leaf id: dec0ded\n" ] } ], "source": [ "# MCS 275 Worksheet 15 Problem 3\n", "# J Vaccaro\n", "# I completed this work myself in accordance with the syllabus.\n", "\n", "from urllib.request import urlopen\n", "from bs4 import BeautifulSoup\n", "\n", "def span_tag_depth(tag):\n", " \"\"\"Recursive function for recursing through the span tree and counting the maximum depth.\n", " Returns the depth.\"\"\"\n", " # Maintain a list of the children's maximum depths\n", " max_span_depth = 0\n", "\n", " # Iterate through the child span tags WITHOUT RECURSING \n", " # i.e. only immediate children, not ancestors\n", " for t in tag.find_all(\"span\", recursive=False):\n", " depth = span_tag_depth(t)\n", " # If the child's depth is the deepest so far, then replace.\n", " if depth>max_span_depth:\n", " max_span_depth = depth\n", "\n", " # Pass up the maximum depth\n", " return 1 + max_span_depth\n", "\n", "def span_tag_depth_id(tag):\n", " \"\"\"Recursive function for recursing through the span tree and counting the maximum depth.\n", " Returns the depth and the leaf's span id.\"\"\"\n", " # Set the default depth\n", " max_span_depth = 0\n", "\n", " # If the current tag is and has an id, set it as the default id\n", " # Then, we will \"pass up\" the leaf id from the longest branch\n", " if tag.name == \"span\" and tag.has_attr(\"id\"):\n", " max_span_id = tag[\"id\"]\n", "\n", " # Iterate through the child span tags WITHOUT RECURSING \n", " # i.e. only immediate children, not ancestors\n", " for t in tag.find_all(\"span\", recursive=False):\n", " # Recurse through t's children for the max branch length and leaf id\n", " t_depth,t_id = span_tag_depth_id(t)\n", "\n", " # If t has the deepest depth so far, replace the max depth/id.\n", " if t_depth>max_span_depth:\n", " max_span_depth = t_depth\n", " max_span_id = t_id # leaf id\n", " \n", " # Return the augmented max_depth and the id of the leaf.\n", " return 1+max_span_depth,max_span_id\n", "\n", "# Create a beautiful soup, either from the url or a local copy\n", "\n", "# --- Comment this out for the final version ------------\n", "with open(\"capture.html\", \"rt\") as infile:\n", " soup = BeautifulSoup(infile, \"html.parser\")\n", "# -------------------------------------------------------\n", "\n", "# # --- Comment this out during development ---------------\n", "# with urlopen(\"https://www.dumas.io/teaching/2021/spring/mcs275/data/capture.html\") as response:\n", "# soup = BeautifulSoup(response,\"html.parser\")\n", "# # -------------------------------------------------------\n", "print(\"Maximum depth:\",span_tag_depth(soup.span))\n", "\n", "depth,span_id = span_tag_depth_id(soup.span)\n", "print(\"Maximum depth:\",depth,\"Leaf id:\",span_id)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"dec0ded\"\n" ] } ], "source": [ "# Here's a bonus solution from Melanie Wertz!\n", "# It uses the indentation properties of prettify() \n", "# and checks for the line with the most whitespace before a tag.\n", "\n", "from bs4 import BeautifulSoup\n", "# I already prettified capture.html to make prettycapture.html\n", "with open(\"capture.html\") as fobj:\n", " max_depth = 0\n", " max_id = 0\n", " for line in fobj:\n", " if \"id\" in line:\n", " opening_whitespace = line.index(\"<\")\n", " opening_slice = line[0:opening_whitespace]\n", " if opening_slice.count(\" \") > max_depth:\n", " max_depth = opening_slice.count(\" \")\n", " id_start = line.index(\"=\")\n", " id_end = line.index(\">\")\n", " max_id = line[id_start+1:id_end]\n", " print(max_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Course evaluation reminder\n", "\n", "If you finish the exercises above, this would be a good time to complete your MCS 275 course evaluation. This anonymous survey helps UIC improve its courses and teaching. Every student enrolled in the course received a link to complete such a survey by email to their uic.edu account.\n", "\n", "Evaluations for 15-week Spring 2022 courses are due by 11:55pm on Sunday, May 1." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.2" } }, "nbformat": 4, "nbformat_minor": 4 }