Lecture 4

strings and integers

MCS 260 Fall 2020
David Dumas

Reminders

  • Quiz 1 due today at 6pm Central
    • Excuse requests must be sent to TA before deadline
  • Python 3 and editor working?
    • If not, tell me immediately
  • Worksheet 2 available, Quiz 2 will be posted soon

Storage units

We've discussed the bit (b), a binary digit (0 or 1).

A byte (B) is a sequence of 8 bits, equivalently, an 8-digit binary number or a 2-digit hex number. It can represent an integer between 0=$\texttt{0x00}$ and 255=$\texttt{0xff}$.

A word is a longer sequence of bits of a length fixed by the hardware or operating system. Today, a word usually means 16 bits = 2 bytes.

Computers store information as sequences of bytes.

Counting bytes to measure the size of data often leads to large numbers.

Coarser units based on SI prefixes:

  • kilobyte (KB) = 1,000 bytes
  • megabyte (MB) = 1,000,000 bytes
  • gigabyte (GB) = 1,000,000,000 bytes

Based on powers of 2 (IEC system), useful in CS:
  • kibibyte (KiB) = $2^{10}$ bytes = 1024 bytes
  • mebibyte (MiB) = 1024 KiB = 1,048,576 bytes
  • gibibyte (GiB) = 1024 MiB = 1,073,741,824 bytes
Unfortunate current reality:
  • Occasionally, SI abbreviations are used for IEC units; in Windows, "GB" means GiB.
  • Very often, IEC units are read aloud using SI names; e.g. write 16GiB and read aloud as "16 gigabytes"

Unicode

Basic problem: How to turn written language into a sequence of bytes?

Unicode (1991) splits this into two steps:

  • Enumerate characters1 of most2 written languages; these are code points
  • Specify a way of encoding each code point as a sequence of bytes (not discussed today)
  • [1] There are also code points for many non-character entities, such as an indicator of whether the language is read left-to-right or right-to-left.
  • [2] Coverage is not perfect and the standard is regularly revised, adding new code points. Unicode 13.0 was released in March 2020.
Every code point has a number (a positive integer between 0 and 0x10ffff=1,114,111).

Code point numbers are always written $\texttt{U+}$ followed by hexadecimal digits.

$\texttt{U+41}$A
$\texttt{U+109}$ĉ
$\texttt{U+1f612}$😒

The first 127 code points, U+0 to U+7F, include all the printable characters on an "en-us" keyboard, numbered exactly as in the older ASCII code (1969).

strings

In Python 3, a str is a sequence of code points.

A string literal is a way of writing a str in code.

Several syntaxes are supported:


        'Hello world'  # single quotes
        "Hello world"  # double quotes
        
        # multi-line string with triple single quote
        '''This is a string
        that contains line breaks'''
        
        # multi-line string with triple double quote
        """François: How is MCS 260?
        Binali: It's going ok, I guess.
        François: [shrugs]"""
    

Escape sequences

The $\texttt{\\}$ character has special meaning; it begins an escape sequence, such as:
  • $\texttt{\\n}$ - the newline character
  • $\texttt{\\'}$ - a single quote
  • $\texttt{\\"}$ - a double quote
  • $\texttt{\\\\}$ - a backslash
  • $\texttt{\\u0107}$ - Code point $\texttt{U+107}$
  • $\texttt{\\U0001f612}$ - Code point $\texttt{U+1f612}$

(There is a full list of escape sequences.)


>>> print("I \"like\":\n\u0050\u0079\u0074\u0068\u006f\u006e")
I "like":
Python
>>> 

Operations on strings

Most arithmetic operations forbid str operands.

$\texttt{+}$ is allowed between two strings. It concatenates the strings (meaning joins them).

$\texttt{*}$ is allowed with a string and an int. It concatenates $n$ copies of the string, where $n$ is the int argument.


>>> "Hello" + " " + "world!"
'Hello world!'
>>> "Hello" - "llo"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for -: 'str' and 'str'
>>> "Ha" * 4
'HaHaHaHa'
>>> prefix = "Dr. "
>>> fullname = "Ramanujan"
>>> prefix+fullname
'Dr. Ramanujan'

len and indexing

The built-in $\texttt{len()}$ can be applied to a string to find the length of the string (a nonnegative int):

>>> len("MCS 260")
7
A single character from a string $\texttt{s}$ can be extracted using $\texttt{s[i]}$ where $\texttt{i}$ is the $0$-based index. So $0$=first character, $1$=second, etc..

>>> s = "lorem ipsum"
>>> s[2]
'r'
We'll say much more about indexing next time.

int

When converting from a string, $\texttt{int()}$ defaults to base $10$. But it supports other bases as well. The base is given as the second argument of the function.

>>> int("1001",2)
9
>>> int("3e",16)
62
    
Notice that integer literal prefixes like $\texttt{0b}$, $\texttt{0x}$, etc. must not be present here. The $\texttt{int()}$ function works with just digits.
However, if a base of $0$ is specified, then this signals that the string should be read as a Python literal, i.e. the base is determined by its prefix.

>>> int("0b1001",0)
9
>>> int("0x3e",0)
62
>>> int("77",0)
77
    

Bitwise operators

There are certain operators that only work on ints, and which are based on the bits in the binary expression:

$\texttt{<<}$ $\texttt{>>}$ $\texttt{&}$ $\texttt{|}$ $\texttt{^}$
left shift right shift bitwise AND bitwise OR bitwise XOR

$\texttt{a << b}$ moves the bits of $\texttt{a}$ left by $\texttt{b}$ positions.

$\texttt{a >> b}$ moves the bits of $\texttt{a}$ right by $\texttt{b}$ positions.
(This detroys the lowest $\texttt{b}$ bits of $\texttt{a}$.)


>>> 9 << 3  # 9 = 0b1001 becomes 0b1001000 = 72
72
>>> 7 << 1  # 7 = 0b111 becomes 0b1110 = 14
14
>>> 9 >> 2  # 9 = 0b1001 becomes 0b10
2
Notice $\texttt{a << b}$ is equivalent to $\texttt{a * 2**b}$.
Bitwise AND compares corresponding bits, and the output bit is $1$ if both input bits are $1$:

>>> 9 & 5  # 9 = 0b1001,  5 = 0b0101
1

1 0 0 1
0 1 0 1
AND: 0 0 0 1
Bitwise OR is similar, but the output bit is $1$ if at least one of the input bits is $1$.

    >>> 9 | 5  # 9 = 0b1001,  5 = 0b0101
    13
    

1 0 0 1
0 1 0 1
OR: 1 1 0 1
Bitwise XOR makes the output bit $1$ if exactly one of the input bits is $1$.

    >>> 9 ^ 5  # 9 = 0b1001,  5 = 0b0101
    12
    

1 0 0 1
0 1 0 1
XOR: 1 1 0 0

Logic gates

Circuits that perform logic operations on bits, logic gates, are fundamental building blocks of computers.

Thus the Python operators $\texttt{<<}$,$\texttt{>>}$,$\texttt{&}$,$\texttt{|}$,$\texttt{^}$ are especially low-level operations.

74LS08PC photo by Trio3D CC-BY-SA 3.0

This chip (or integrated circuit / IC) contains four AND gates built from about $50$ transistors. The processor in an iPhone 11 has about $8,\!500,\!000,\!000$ transistors.

References

Revision history

  • 2020-08-31 Typos fixed, explanation of bitwise operators slightly expanded.
  • 2020-08-30 Initial publication