Lecture 29

Regular expressions 2;

Encodings and binary files

MCS 260 Fall 2020
David Dumas

Reminders

  • I hope you have worked on Project 3
  • Quiz 10 due Monday (Nov 2)
  • Nov 3: No discussions
  • Nov 5: Discussion converted to TA office hours
  • Regex quick reference

    • . — matches any character except newline
    • \s — matches any whitespace character
    • \d — matches a decimal digit
    • + — previous item must repeat 1 or more times
    • * — previous item must repeat 0 or more times
    • ? — previous item must repeat 0 or 1 times
    • {n} — previous item must appear n times
    • (...) — treat part of a pattern as a unit and capture its match into a group
    • [...] — match any one of a set of characters
    • A|B — match either pattern A or pattern B.
    • ^ — match the beginning of the string.
    • $ — match the end of the string or the end of the line.

    re module quick reference

    • re.search(pattern,string) — does string contain a match to the pattern? Return a match object or None.
    • re.finditer(pattern,string) — Return an iterable containing all non-overlapping matches as match objects.
    • re.findall(pattern,string) — return a list of all non-overlapping matches as strings.

    Example problem

    Find all of the phone numbers in a string that are written in the format 319-555-1012, and split each one into area code (e.g. 319), exchange (e.g. 555), and line number (e.g. 1012).

    Square brackets

    Give a list of characters and to match any one of them.

    [abc] matches any of the characters a,b,c.

    [^abc] matches any character except a,b,c.

    [A-Za-z] matches any alphabet letter.

    [0-9a-fA-F] matches any hex digit.

    Or

    A|B matches either pattern A or pattern B.

    Use this inside parentheses to limit how much of the pattern is considered to be part of A or B, e.g.

    [Hh](ello|i),? my name is (.*).

    Finding functions

    Let's make a program to find function definitions in a Python source file and print the function names.

    Encoding preview

    What is the size of a file if we open and write one of these words to it?

    • Hello (5 characters)
    • Frühstück (9 characters)
    • 😊 (1 character, U+1F60A)

    Note: The last item in the list above has an emoji which doesn't render correctly in the PDF slides.

    Encoding

    As the OS sees it, a file is a sequence of bytes. To write text, we need to decide how to represent code points as bytes.

    A scheme to do this is an encoding. Encodings can also specify which code points are allowed.

    The default encoding in Python is usually UTF-8, though officially this is platform-dependent.

    In UTF-8, the first 128 code points are stored as a single byte. Others become two, three, or four bytes.

    Binary files

    Opening a file with "b" in its mode string will make it a binary file. E.g. "rb" reads a binary file, "wb" writes to one.

    Reading from a binary file gives a bytes object, a sequence of ints in the range 0 to 255.

    We can decode bytes into a string with the method .decode(), and can encode a string as bytes with .encode(). Each takes optional encoding parameter.

    References

    • In Downey:
      • Regular expressions, character encoding, and binary files are not discussed.
    • The official Python tutorial has a section about reading and writing files which discusses binary files and encoding.
    • Pythex is a free online regular expression editor and tester that can be very helpful for debugging patterns.
    • Google's free online Python course has a unit on regular expressions.
      • This course was developed for Python 2, so calls to print are lacking parentheses. Otherwise, the code should work.
    • The documentation of the re module is good as a reference, but may not be ideal to learn from.

    Revision history

    • 2020-10-29 Initial publication