MAT 370: Binary and Floating Point Arithmetic

Dr. Gilbert

December 24, 2025

A Reminder From Last Time

We concluded our Day 1 discussion by seeing that Python does not recognize that 0.1 + 0.2 is equal to 0.3.

We certainly recognize the equality, but why doesn’t Python?

Let’s start with why it matters…

▶ Video from the Rocket Surgery YouTube Channel.

Binary Representations of Integers

  • Humans evolved to use a base-10 number system simply because we have 10 fingers and 10 toes to count on.

  • Computers don’t have fingers or toes and can only “think” in terms of electrical pulses.

    • For this reason, they work in binary – you can think of this as \(0\) indicating a low voltage value and \(1\) indicating a high voltage value.
  • Since computers work in binary, they can’t natively understand our base-10 number system.

  • All information stored in a computer must be stored in binary.

    • Luckily, all numbers have a corresponding binary representation.

Below we show how we can use the Euclidean Algorithm to convert an integer (\(84\)) into its binary representation.

\[\begin{align*} 84 &= 2\left(42\right) + 0 \end{align*}\]

Binary Representations of Integers

  • Humans evolved to use a base-10 number system simply because we have 10 fingers and 10 toes to count on.

  • Computers don’t have fingers or toes and can only “think” in terms of electrical pulses.

    • For this reason, they work in binary – you can think of this as \(0\) indicating a low voltage value and \(1\) indicating a high voltage value.
  • Since computers work in binary, they can’t natively understand our base-10 number system.

  • All information stored in a computer must be stored in binary.

    • Luckily, all numbers have a corresponding binary representation.

Below we show how we can use the Euclidean Algorithm to convert an integer (\(84\)) into its binary representation.

\[\begin{align*} 84 &= 2\left(42\right) + 0\\ 42 &= 2\left(21\right) + 0 \end{align*}\]

Binary Representations of Integers

  • Humans evolved to use a base-10 number system simply because we have 10 fingers and 10 toes to count on.

  • Computers don’t have fingers or toes and can only “think” in terms of electrical pulses.

    • For this reason, they work in binary – you can think of this as \(0\) indicating a low voltage value and \(1\) indicating a high voltage value.
  • Since computers work in binary, they can’t natively understand our base-10 number system.

  • All information stored in a computer must be stored in binary.

    • Luckily, all numbers have a corresponding binary representation.

Below we show how we can use the Euclidean Algorithm to convert an integer (\(84\)) into its binary representation.

\[\begin{align*} 84 &= 2\left(42\right) + 0\\ 42 &= 2\left(21\right) + 0\\ 21 &= 2\left(10\right) + 1 \end{align*}\]

Binary Representations of Integers

  • Humans evolved to use a base-10 number system simply because we have 10 fingers and 10 toes to count on.

  • Computers don’t have fingers or toes and can only “think” in terms of electrical pulses.

    • For this reason, they work in binary – you can think of this as \(0\) indicating a low voltage value and \(1\) indicating a high voltage value.
  • Since computers work in binary, they can’t natively understand our base-10 number system.

  • All information stored in a computer must be stored in binary.

    • Luckily, all numbers have a corresponding binary representation.

Below we show how we can use the Euclidean Algorithm to convert an integer (\(84\)) into its binary representation.

\[\begin{align*} 84 &= 2\left(42\right) + 0\\ 42 &= 2\left(21\right) + 0\\ 21 &= 2\left(10\right) + 1\\ 10 &= 2\left(5\right) + 1 \end{align*}\]

Binary Representations of Integers

  • Humans evolved to use a base-10 number system simply because we have 10 fingers and 10 toes to count on.

  • Computers don’t have fingers or toes and can only “think” in terms of electrical pulses.

    • For this reason, they work in binary – you can think of this as \(0\) indicating a low voltage value and \(1\) indicating a high voltage value.
  • Since computers work in binary, they can’t natively understand our base-10 number system.

  • All information stored in a computer must be stored in binary.

    • Luckily, all numbers have a corresponding binary representation.

Below we show how we can use the Euclidean Algorithm to convert an integer (\(84\)) into its binary representation.

\[\begin{align*} 84 &= 2\left(42\right) + 0\\ 42 &= 2\left(21\right) + 0\\ 21 &= 2\left(10\right) + 1\\ 10 &= 2\left(5\right) + 1\\ 5 &= 2\left(2\right) + 1 \end{align*}\]

Binary Representations of Integers

  • Humans evolved to use a base-10 number system simply because we have 10 fingers and 10 toes to count on.

  • Computers don’t have fingers or toes and can only “think” in terms of electrical pulses.

    • For this reason, they work in binary – you can think of this as \(0\) indicating a low voltage value and \(1\) indicating a high voltage value.
  • Since computers work in binary, they can’t natively understand our base-10 number system.

  • All information stored in a computer must be stored in binary.

    • Luckily, all numbers have a corresponding binary representation.

Below we show how we can use the Euclidean Algorithm to convert an integer (\(84\)) into its binary representation.

\[\begin{align*} 84 &= 2\left(42\right) + 0\\ 42 &= 2\left(21\right) + 0\\ 21 &= 2\left(10\right) + 1\\ 10 &= 2\left(5\right) + 1\\ 5 &= 2\left(2\right) + 1\\ 2 &= 2\left(1\right) + 0 \end{align*}\]

Binary Representations of Integers

  • Humans evolved to use a base-10 number system simply because we have 10 fingers and 10 toes to count on.

  • Computers don’t have fingers or toes and can only “think” in terms of electrical pulses.

    • For this reason, they work in binary – you can think of this as \(0\) indicating a low voltage value and \(1\) indicating a high voltage value.
  • Since computers work in binary, they can’t natively understand our base-10 number system.

  • All information stored in a computer must be stored in binary.

    • Luckily, all numbers have a corresponding binary representation.

Below we show how we can use the Euclidean Algorithm to convert an integer (\(84\)) into its binary representation.

\[\begin{align*} 84 &= 2\left(42\right) + 0\\ 42 &= 2\left(21\right) + 0\\ 21 &= 2\left(10\right) + 1\\ 10 &= 2\left(5\right) + 1\\ 5 &= 2\left(2\right) + 1\\ 2 &= 2\left(1\right) + 0\\ 1 &= 2\left(0\right) + 1 \end{align*}\]

Reading the remainders from bottom to top, we find that \(\left(84\right)_{10} = \left(1011100\right)_{2}\).

Each binary digit is referred to as a bit, and it is common to consider collections of \(8\) bits together – known as a byte.

That is, we would more commonly write \(\left(01011100\right)_{2}\) as the binary representation of \(84\).

Example: Use the Euclidean Algorithm to find the binary representation of \(\left(118\right)_{10}\).

Binary Representations of Real Numbers

There is more to the story with binary representations of integers.

  • For example, there are several different representation schemes allowing for representation of positive, \(0\), and negative integers.

    • Three popular schemes are sign-and-magnitude, ones-complement, and twos-complement.

We’ll be more interested in floating point numbers than in integers in our course, so we’ll leave our discussion on integer representations here.

It is possible to represent non-integer values using binary.

For example, \(\frac{1}{8} = 0.125\) can be represented as \(\left(0.001\right)_{2}\) with the justification shown below.

\[\begin{align*} 2\left(0.125\right) &= 0.25 + 0 \end{align*}\]

Binary Representations of Real Numbers

There is more to the story with binary representations of integers.

  • For example, there are several different representation schemes allowing for representation of positive, \(0\), and negative integers.

    • Three popular schemes are sign-and-magnitude, ones-complement, and twos-complement.

We’ll be more interested in floating point numbers than in integers in our course, so we’ll leave our discussion on integer representations here.

It is possible to represent non-integer values using binary.

For example, \(\frac{1}{8} = 0.125\) can be represented as \(\left(0.001\right)_{2}\) with the justification shown below.

\[\begin{align*} 2\left(0.125\right) &= 0.25 + 0\\ 2\left(0.25\right) &= 0.5 + 0 \end{align*}\]

Binary Representations of Real Numbers

There is more to the story with binary representations of integers.

  • For example, there are several different representation schemes allowing for representation of positive, \(0\), and negative integers.

    • Three popular schemes are sign-and-magnitude, ones-complement, and twos-complement.

We’ll be more interested in floating point numbers than in integers in our course, so we’ll leave our discussion on integer representations here.

It is possible to represent non-integer values using binary.

For example, \(\frac{1}{8} = 0.125\) can be represented as \(\left(0.001\right)_{2}\) with the justification shown below.

\[\begin{align*} 2\left(0.125\right) &= 0.25 + 0\\ 2\left(0.25\right) &= 0.5 + 0\\ 2\left(0.5\right) &= 0 + 1 \end{align*}\]

Collecting the integer components from top to bottom, we see that \(0.125 = \left(0.001\right)_{2}\).

Let’s try one more.

Example: Convert \(53.7\) to binary.

Important Takeaway: Notice that the decimal part (0.7) is repeating and non-terminating.

Floating Point Representations

Computer-based implementations of number systems don’t actually use pure binary.

Such a scheme would be too limited – the scale at which we could represent numbers would be small.

The standard for representing floating point numbers is identified in IEEE 754

  • Floating point numbers can be represented with single precision (32-bits) or double precision (64-bits).
  • A floating point representation consists of a mantissa (significant digits) and an exponent (magnitude).

Single Precision Floating Points

Single precision uses 32-bits to store numbers. Those bits include…

  • One sign bit (0 is positive, 1 is negative)

  • An 8-bit exponent

    • A bias of \(127\) is subtracted from the exponent to allow for exponents ranging from \(-126\) to \(127\)).
    • An exponent of \(0\) is reserved for special cases such as \(0\) or NaN.
  • A 23-bit mantissa which determines the precision of the number

    • The mantissa is normalized so that the leftmost bit is always a \(1\) and there is only a single \(1\) to the left of the radix. This allows for a free extra bit of precision since that \(1\) does not need to be stored.

The range for single-precision floats is approximately between \(\pm 1.2\times 10^{-38}\) and \(\pm 3.4\times 10^{38}\) with a precision of around \(7\) decimal places.

Double Precision Floating Points

Double precision uses 64-bits, including

  • One sign bit.

  • An 11-bit exponent is used .

    • A bias of \(1023\) is subtracted from the exponent, allowing for exponents between \(-1022\) and \(1023\).
    • Like with single precision, an exponent of \(0\) is reserved for \(0\) or NaN values.
  • A 52-bit mantissa determines the precision of the number.

    • Again, the mantissa is normalized to gain an extra bit of precision.

Similarly, the range for double-precision floats is approximately between \(\pm 2.2\times 10^{-308}\) and \(\pm 1.8\times 10^{308}\).

An Example: The Float \(5\)

Example: Computers represent the integer \(5\) in double precision as follows:

  • Notice that \(5 = \left(101\right)_{2}\).

  • We can represent \(5\) in double precision as \(1.\underbrace{0100\cdots 0}_{\text{52 bits}}\times 2^2\) or \(1.\underbrace{0100\cdots 0}_{\text{52 bits}}\times 2^\underbrace{100\cdots 01}_{\text{11 bits}}\)

    • Remember that the exponent has a bias adjustment of \(1023\).
  • According to IEEE 754, the number \(5\) is represented in double-precision by

\[\overbrace{0}^{\text{sign}}\underbrace{100\cdots 01}_{\text{Exponent, 11 bits}}\overbrace{0100\cdots 0}^{\text{mantissa, 52 bits}}\]

Closing Comments

Tieing Up a Loose End: The real numbers \(0.1\), \(0.2\), and \(0.3\) are all non-terminating decimals in binary. This is the reason that 0.1 + 0.2 != 0.3 – the rounding errors in the floating point representations are the issue!

Important Concept (Machine Epsilon): Machine epsilon is the smallest size step you can take from the number \(1\). In double precision, that number is \(2^{-53}\) since the mantissa contains \(52\) precision bits, plus the extra precision due to normalization.

Question: What is machine epsilon for a single precision float?

Next Time: A Crash Course in Numerical Python.