Lecture 7

strings and integers

MCS 260 Fall 2021
David Dumas

Reminders

  • Worksheet 3 available
  • Project 1 description posted

Bytes

We've discussed the bit (b) or binary digit (0 or 1).

A byte (B) is a sequence of 8 bits, equivalently, an 8-digit binary number or a 2-digit hex number. It can represent an integer between 0=$\texttt{0x00}$ and 255=$\texttt{0xff}$.

Computers store information as sequences of bytes.

Unicode

Basic problem: How to turn written language into a sequence of bytes?

Unicode (1991) splits this into two steps:

  • Make a central directory of characters of most written languages; these are code points
  • Specify ways to encode code points into sequences of bytes (not discussed today)

Every code point has a number (an integer between 0 and 0x10ffff=1,114,111).

Code point numbers are always written $\texttt{U+}$ followed by hexadecimal digits.

$\texttt{U+41}$A
$\texttt{U+109}$ĉ
$\texttt{U+1f612}$😒

The first 128 code points, U+0 to U+7F, include all "en-us" keyboard keys, and follow the ASCII code (1969).

strings

In Python 3, a str is a sequence of code points.

Several syntaxes are supported for literals:


        'Hello world'  # single quotes
        "Hello world"  # double quotes
        
        # multi-line string with triple single quote
        '''This is a string
        that contains line breaks'''
        
        # multi-line string with triple double quote
        """François: How is MCS 260?
        Binali: It's going ok.  Too many slides.
        François: ¯\_(ツ)_/¯"""
    

Escape sequences

The $\texttt{\\}$ character has special meaning; it begins an escape sequence, such as:
  • $\texttt{\\n}$ - the newline character
  • $\texttt{\\'}$ - a single quote
  • $\texttt{\\"}$ - a double quote
  • $\texttt{\\\\}$ - a backslash
  • $\texttt{\\u0107}$ - Code point $\texttt{U+107}$
  • $\texttt{\\U0001f612}$ - Code point $\texttt{U+1f612}$

(There is a full list of escape sequences.)

Note $\texttt{\\}$ appears a lot in Windows paths!


>>> print("I \"like\":\n\u0050\u0079\u0074\u0068\u006f\u006e")
I "like":
Python
>>> 

Operations on strings

Most arithmetic operations forbid strings. Exceptions:

  • $\texttt{+}$ joins strings, e.g. "cat"+"erpillar"
  • $\texttt{*}$ joins a specified number of copies, e.g. "doo"*6

>>> "Hello" + " " + "world!"
'Hello world!'
>>> "Hello" - "llo"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for -: 'str' and 'str'
>>> "Ha" * 4
'HaHaHaHa'
>>> prefix = "Dr. "
>>> fullname = "Ramanujan"
>>> prefix+fullname
'Dr. Ramanujan'

sequence stuff

Reminder: Like lists, strings are sequences.

You can use indexing to get individual characters, slices to get substrings, and len(...) to get the length.

str

Python's str() function converts any other value to a string, e.g.


        >>> str(5678)
        '5678'
        >>> str(5678)[1]
        '6'
        >>> int(str(5678)[1])
        6
    

str() is rarely needed, but it does give a way to access decimal digits of an integer individually.

int

When converting from a string, $\texttt{int()}$ defaults to base $10$. But it supports other bases as well. The base is given as the second argument of the function.

>>> int("1001",2)
9
>>> int("3e",16)
62
    
Integer literal prefixes you'd use in code ($\texttt{0b}$, $\texttt{0x}$, etc.) must not be present here. The $\texttt{int()}$ function works with just digits when you specify the base.
However, if a base of $0$ is specified, then this signals that the string should be read as a Python literal, i.e. the base is determined by its prefix.

>>> int("0b1001",0)
9
>>> int("0x3e",0)
62
>>> int("77",0)
77
    

Bitwise operators

There are certain operators that only work on ints, and which are based on the bits in the binary expression:

$\texttt{<<}$ $\texttt{>>}$ $\texttt{&}$ $\texttt{|}$ $\texttt{^}$
left shift right shift bitwise AND bitwise OR bitwise XOR

$\texttt{a << b}$ moves the bits of $\texttt{a}$ left by $\texttt{b}$ positions.

$\texttt{a >> b}$ moves the bits of $\texttt{a}$ right by $\texttt{b}$ positions.
(This detroys the lowest $\texttt{b}$ bits of $\texttt{a}$.)


>>> 9 << 3  # 9 = 0b1001 becomes 0b1001000 = 72
72
>>> 7 << 1  # 7 = 0b111 becomes 0b1110 = 14
14
>>> 9 >> 2  # 9 = 0b1001 becomes 0b10
2
Notice $\texttt{a << b}$ is equivalent to $\texttt{a * 2**b}$.
Bitwise AND compares corresponding bits, and the output bit is $1$ if both input bits are $1$:

>>> 9 & 5  # 9 = 0b1001,  5 = 0b0101
1

1 0 0 1
0 1 0 1
AND: 0 0 0 1
Bitwise OR is similar, but the output bit is $1$ if at least one of the input bits is $1$.

    >>> 9 | 5  # 9 = 0b1001,  5 = 0b0101
    13
    

1 0 0 1
0 1 0 1
OR: 1 1 0 1
Bitwise XOR makes the output bit $1$ if exactly one of the input bits is $1$.

    >>> 9 ^ 5  # 9 = 0b1001,  5 = 0b0101
    12
    

1 0 0 1
0 1 0 1
XOR: 1 1 0 0

Logic gates

Circuits that perform logic operations on bits, logic gates, are fundamental building blocks of computers.

Thus the Python operators $\texttt{<<}$,$\texttt{>>}$,$\texttt{&}$,$\texttt{|}$,$\texttt{^}$ are especially low-level operations.

74LS08PC photo by Trio3D CC-BY-SA 3.0

This chip (or integrated circuit / IC) contains four AND gates built from about $50$ transistors. The processor in an iPhone 11 has about $8,\!500,\!000,\!000$ transistors.

References

Revision history

  • 2021-09-08 Initial publication
  • 2021-09-09 Fix typo