MAT 370: Binary and Floating Point Arithmetic

Dr. Gilbert

January 9, 2026

A Reminder From Last Time

We concluded our Day 1 discussion by seeing that Python does not recognize that 0.1 + 0.2 is equal to 0.3.

We certainly recognize the equality, but why doesn’t Python?

Let’s start with why it matters…

▶ Video from the Rocket Surgery YouTube Channel.

Binary Representations of Integers

Humans evolved to use a base-10 number system simply because we have 10 fingers and 10 toes to count on.
Computers don’t have fingers or toes and can only “think” in terms of electrical pulses.
- For this reason, they work in binary – you can think of this as \(0\) indicating a low voltage value and \(1\) indicating a high voltage value.
Since computers work in binary, they can’t natively understand our base-10 number system.
All information stored in a computer must be stored in binary.
- Luckily, all numbers have a corresponding binary representation.

Below we show how we can use the Euclidean Algorithm to convert an integer (\(84\)) into its binary representation.

\[\begin{align*} 84 &= 2\left(42\right) + 0 \end{align*}\]

Binary Representations of Integers

Humans evolved to use a base-10 number system simply because we have 10 fingers and 10 toes to count on.
Computers don’t have fingers or toes and can only “think” in terms of electrical pulses.
- For this reason, they work in binary – you can think of this as \(0\) indicating a low voltage value and \(1\) indicating a high voltage value.
Since computers work in binary, they can’t natively understand our base-10 number system.
All information stored in a computer must be stored in binary.
- Luckily, all numbers have a corresponding binary representation.

Below we show how we can use the Euclidean Algorithm to convert an integer (\(84\)) into its binary representation.

\[\begin{align*} 84 &= 2\left(42\right) + 0\\ 42 &= 2\left(21\right) + 0 \end{align*}\]

Binary Representations of Integers

Humans evolved to use a base-10 number system simply because we have 10 fingers and 10 toes to count on.
Computers don’t have fingers or toes and can only “think” in terms of electrical pulses.
- For this reason, they work in binary – you can think of this as \(0\) indicating a low voltage value and \(1\) indicating a high voltage value.
Since computers work in binary, they can’t natively understand our base-10 number system.
All information stored in a computer must be stored in binary.
- Luckily, all numbers have a corresponding binary representation.

Below we show how we can use the Euclidean Algorithm to convert an integer (\(84\)) into its binary representation.

\[\begin{align*} 84 &= 2\left(42\right) + 0\\ 42 &= 2\left(21\right) + 0\\ 21 &= 2\left(10\right) + 1 \end{align*}\]

Binary Representations of Integers

Humans evolved to use a base-10 number system simply because we have 10 fingers and 10 toes to count on.
Computers don’t have fingers or toes and can only “think” in terms of electrical pulses.
- For this reason, they work in binary – you can think of this as \(0\) indicating a low voltage value and \(1\) indicating a high voltage value.
Since computers work in binary, they can’t natively understand our base-10 number system.
All information stored in a computer must be stored in binary.
- Luckily, all numbers have a corresponding binary representation.

Below we show how we can use the Euclidean Algorithm to convert an integer (\(84\)) into its binary representation.

\[\begin{align*} 84 &= 2\left(42\right) + 0\\ 42 &= 2\left(21\right) + 0\\ 21 &= 2\left(10\right) + 1\\ 10 &= 2\left(5\right) + 0 \end{align*}\]

Binary Representations of Integers

Humans evolved to use a base-10 number system simply because we have 10 fingers and 10 toes to count on.
Computers don’t have fingers or toes and can only “think” in terms of electrical pulses.
- For this reason, they work in binary – you can think of this as \(0\) indicating a low voltage value and \(1\) indicating a high voltage value.
Since computers work in binary, they can’t natively understand our base-10 number system.
All information stored in a computer must be stored in binary.
- Luckily, all numbers have a corresponding binary representation.

Below we show how we can use the Euclidean Algorithm to convert an integer (\(84\)) into its binary representation.

\[\begin{align*} 84 &= 2\left(42\right) + 0\\ 42 &= 2\left(21\right) + 0\\ 21 &= 2\left(10\right) + 1\\ 10 &= 2\left(5\right) + 0\\ 5 &= 2\left(2\right) + 1 \end{align*}\]

Binary Representations of Integers

Humans evolved to use a base-10 number system simply because we have 10 fingers and 10 toes to count on.
Computers don’t have fingers or toes and can only “think” in terms of electrical pulses.
- For this reason, they work in binary – you can think of this as \(0\) indicating a low voltage value and \(1\) indicating a high voltage value.
Since computers work in binary, they can’t natively understand our base-10 number system.
All information stored in a computer must be stored in binary.
- Luckily, all numbers have a corresponding binary representation.

Below we show how we can use the Euclidean Algorithm to convert an integer (\(84\)) into its binary representation.

\[\begin{align*} 84 &= 2\left(42\right) + 0\\ 42 &= 2\left(21\right) + 0\\ 21 &= 2\left(10\right) + 1\\ 10 &= 2\left(5\right) + 0\\ 5 &= 2\left(2\right) + 1\\ 2 &= 2\left(1\right) + 0 \end{align*}\]

Binary Representations of Integers

Humans evolved to use a base-10 number system simply because we have 10 fingers and 10 toes to count on.
Computers don’t have fingers or toes and can only “think” in terms of electrical pulses.
- For this reason, they work in binary – you can think of this as \(0\) indicating a low voltage value and \(1\) indicating a high voltage value.
Since computers work in binary, they can’t natively understand our base-10 number system.
All information stored in a computer must be stored in binary.
- Luckily, all numbers have a corresponding binary representation.

Below we show how we can use the Euclidean Algorithm to convert an integer (\(84\)) into its binary representation.

\[\begin{align*} 84 &= 2\left(42\right) + 0\\ 42 &= 2\left(21\right) + 0\\ 21 &= 2\left(10\right) + 1\\ 10 &= 2\left(5\right) + 0\\ 5 &= 2\left(2\right) + 1\\ 2 &= 2\left(1\right) + 0\\ 1 &= 2\left(0\right) + 1 \end{align*}\]

Reading the remainders from bottom to top, we find that \(\left(84\right)_{10} = \left(1010100\right)_{2}\).

Each binary digit is referred to as a bit, and it is common to consider collections of \(8\) bits together – known as a byte.

That is, we would more commonly write \(\left(01010100\right)_{2}\) as the binary representation of \(84\).

Example: Use the Euclidean Algorithm to find the binary representation of \(\left(118\right)_{10}\).

Binary Representations of Real Numbers

There is more to the story with binary representations of integers.

For example, there are several different representation schemes allowing for representation of positive, \(0\), and negative integers.
- Three popular schemes are sign-and-magnitude, ones-complement, and twos-complement.

We’ll be more interested in floating point numbers than in integers in our course, so we’ll leave our discussion on integer representations here.

It is possible to represent non-integer values using binary.

For example, \(\frac{1}{8} = 0.125\) can be represented as \(\left(0.001\right)_{2}\) with the justification shown below.

\[\begin{align*} 2\left(0.125\right) &= 0.25 + 0 \end{align*}\]

Binary Representations of Real Numbers

There is more to the story with binary representations of integers.

For example, there are several different representation schemes allowing for representation of positive, \(0\), and negative integers.
- Three popular schemes are sign-and-magnitude, ones-complement, and twos-complement.

We’ll be more interested in floating point numbers than in integers in our course, so we’ll leave our discussion on integer representations here.

It is possible to represent non-integer values using binary.

For example, \(\frac{1}{8} = 0.125\) can be represented as \(\left(0.001\right)_{2}\) with the justification shown below.

\[\begin{align*} 2\left(0.125\right) &= 0.25 + 0\\ 2\left(0.25\right) &= 0.5 + 0 \end{align*}\]

Binary Representations of Real Numbers

There is more to the story with binary representations of integers.

For example, there are several different representation schemes allowing for representation of positive, \(0\), and negative integers.
- Three popular schemes are sign-and-magnitude, ones-complement, and twos-complement.

We’ll be more interested in floating point numbers than in integers in our course, so we’ll leave our discussion on integer representations here.

It is possible to represent non-integer values using binary.

For example, \(\frac{1}{8} = 0.125\) can be represented as \(\left(0.001\right)_{2}\) with the justification shown below.

\[\begin{align*} 2\left(0.125\right) &= 0.25 + 0\\ 2\left(0.25\right) &= 0.5 + 0\\ 2\left(0.5\right) &= 0 + 1 \end{align*}\]

Collecting the integer components from top to bottom, we see that \(0.125 = \left(0.001\right)_{2}\).

Let’s try one more.

Example: Convert \(53.7\) to binary.

Important Takeaway: Notice that the decimal part (0.7) is repeating and non-terminating.

Floating Point Representations

Computer-based implementations of number systems don’t actually use pure binary.

Such a scheme would be too limited – the scale at which we could represent numbers would be small.

The standard for representing floating point numbers is identified in IEEE 754

Floating point numbers can be represented with single precision (32-bits) or double precision (64-bits).
A floating point representation consists of a mantissa (significant digits) and an exponent (magnitude).

Single Precision Floating Points

Single precision uses 32-bits to store numbers. Those bits include…

One sign bit (0 is positive, 1 is negative)
An 8-bit exponent
- A bias of \(127\) is subtracted from the exponent to allow for exponents ranging from \(-126\) to \(127\)).
- An exponent of \(0\) is reserved for special cases such as \(0\) or NaN.
A 23-bit mantissa which determines the precision of the number
- The mantissa is normalized so that the leftmost bit is always a \(1\) and there is only a single \(1\) to the left of the radix. This allows for a free extra bit of precision since that \(1\) does not need to be stored.

The range for single-precision floats is approximately between \(\pm 1.2\times 10^{-38}\) and \(\pm 3.4\times 10^{38}\) with a precision of around \(7\) decimal places.

Double Precision Floating Points

Double precision uses 64-bits, including

One sign bit.
An 11-bit exponent is used .
- A bias of \(1023\) is subtracted from the exponent, allowing for exponents between \(-1022\) and \(1023\).
- Like with single precision, an exponent of \(0\) is reserved for \(0\) or NaN values.
A 52-bit mantissa determines the precision of the number.
- Again, the mantissa is normalized to gain an extra bit of precision.

Similarly, the range for double-precision floats is approximately between \(\pm 2.2\times 10^{-308}\) and \(\pm 1.8\times 10^{308}\) with a precision of around 15 decimal places.

An Example: The Float \(5\)

Example: Computers represent the integer \(5\) in double precision as follows:

Notice that \(5 = \left(101\right)_{2}\).
We can represent \(5\) in double precision as \(1.\underbrace{0100\cdots 0}_{\text{52 bits}}\times 2^2\) or \(1.\underbrace{0100\cdots 0}_{\text{52 bits}}\times 2^\underbrace{100\cdots 01}_{\text{11 bits}}\)
- Remember that the exponent has a bias adjustment of \(1023\).
According to IEEE 754, the number \(5\) is represented in double-precision by

\[\overbrace{0}^{\text{sign}}\underbrace{100\cdots 01}_{\text{Exponent, 11 bits}}\overbrace{0100\cdots 0}^{\text{mantissa, 52 bits}}\]

Closing Comments

Tieing Up a Loose End: The real numbers \(0.1\), \(0.2\), and \(0.3\) are all non-terminating decimals in binary. This is the reason that 0.1 + 0.2 != 0.3 – the rounding errors in the floating point representations are the issue!

Important Concept (Machine Epsilon): Machine epsilon is the smallest size step you can take from the number \(1\). In double precision, that number is \(2^{-53}\) since the mantissa contains \(52\) precision bits, plus the extra precision due to normalization.

Question: What is machine epsilon for a single precision float?

Next Time: A Crash Course in Numerical Python.