{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MCS 275 Spring 2022 Worksheet 15\n", "\n", "* Course instructor: David Dumas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Topics\n", "\n", "This worksheet focuses on **urllib** and **Beautiful Soup**.\n", "\n", "## Resources\n", "\n", "These things might be helpful while working on the problems. Remember that for worksheets, we don't strictly limit what resources you can consult, so these are only suggestions.\n", "\n", "\n", "* [Lecture 39 - HTML and CSS](https://dumas.io/teaching/2022/spring/mcs275/slides/lecture39.html)\n", "* [Lecture 40 - Parsing and scraping HTML](https://dumas.io/teaching/2022/spring/mcs275/slides/lecture40.html)\n", "* [Lecture 41 - Beautiful soup](https://dumas.io/teaching/2022/spring/mcs275/slides/lecture41.html)\n", "* [urllib examples notebook](https://dumas.io/teaching/2022/spring/mcs275/nbview/samplecode/scraping/scraping-demos.html)\n", "* [Beautiful Soup example scraper notebook](https://dumas.io/teaching/2022/spring/mcs275/nbview/samplecode/scraping/mathgradscrape.html)\n", "* [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)\n", "* [w3schools HTML tutorial](https://www.w3schools.com/html/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. HTML prettifier and warning utility\n", "\n", "Use Beautiful Soup to write a script that takes an HTML file and writes equivalent HTML that is more nicely indented to an output file. (Recall that Beautiful Soup has a [method to generate nicely indented HTML](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#pretty-printing) for any tag or BeautifulSoup object.)\n", "\n", "Also, if there is no `` tag in the input HTML file, the script should print a warning.\n", "\n", "The input HTML filename should be expected as the first command line argument, and the filename to which the \"prettified\" HTML is written is the second command line argument.\n", "\n", "For example, if the `in.html` contains\n", "```\n", "<!doctype html><html><head></head><body>\n", "<h1>MCS 275 HTML file</h1></body></html>\n", "```\n", "\n", "Then running\n", "```\n", "python3 prettify.py in.html out.html\n", "```\n", "should print a message\n", "```\n", "Warning: This HTML file has no <title>.\n", "```\n", "and should write the following to `out.html`:\n", "```\n", "<!DOCTYPE html>\n", "<html>\n", " <head>\n", " </head>\n", " <body>\n", " <h1>\n", " MCS 275 HTML file\n", " </h1>\n", " </body>\n", "</html>\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Complex analysis homework scraper\n", "\n", "Consider this web page for a graduate complex analysis class that was taught at UIC in 2016:\n", "* [Math 535 Spring 2016](https://www.dumas.io/teaching/2016/spring/math535/)\n", "\n", "One section of the page lists weekly homework. Each homework assignment has a number, a title, and a list of problems from various sections of the textbook. Write a scraper that downloads this course web site's HTML, parses it with Beautiful Soup, and creates one dictionary for each homework assignment having the following format\n", "```\n", "{\n", " \"number\": 10,\n", " \"title\": \"Harmonic functions\",\n", " \"problems\": \"Sec 4.6.2(p166): 1,2\\nSec 4.6.4(p171): 1,2,3,4\"\n", "}\n", "```\n", "It should then put these dictionaries into a list and save the list to a JSON file called `math535spring2016homework.json`.\n", "\n", "**Note:** If you finish this problem early, you might find it fun to watch this [animation of the UIC logo distortion](https://www.dumas.io/teaching/2016/spring/math535/images/inverted-logo-animation.gif) that appears on the Math 535 course web page, and see if you can figure out what's going on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Capture the tag\n", "\n", "Here is a link to an HTML file:\n", "\n", "* [capture.html](https://www.dumas.io/teaching/2022/spring/mcs275/data/capture.html)\n", "\n", "If you open it in a browser, you won't see anything. The document contains nothing but `<span>` tags, and no text. Some of the `<span>` tags are nested inside other `<span>` tags. How deeply are they nested?\n", "\n", "Every `<span>` tag in this file has an `id` attribute. There is exactly one `<span>` that has greater depth in the the DOM tree than any other. What is its `id` attribute?\n", "\n", "Write a Python script to load the HTML file with Beautiful Soup and tranverse the DOM to answer these questions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Course evaluation reminder\n", "\n", "If you finish the exercises above, this would be a good time to complete your MCS 275 course evaluation. This anonymous survey helps UIC improve its courses and teaching. Every student enrolled in the course received a link to complete such a survey by email to their uic.edu account.\n", "\n", "Evaluations for 15-week Spring 2022 courses are due by 11:55pm on Sunday, May 1." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }