**Floating Point Numbers**

**REVIEW OF EXPONENTIAL NOTATION**

Numbers
with fractional parts are often called rational numbers or real numbers in
mathematics. Numbers with fractional parts are called **floating
point numbers **in computer science. Floating point numbers are
easy. They're just like **scientific notation**, only in binary
instead of decimal.

**EXAMPLE:**

Represent
the number -12,005 in scientific notation. There are an infinite number of ways
to represent this number. Some possibilities are:

- -12005 x 10
^{0} - -12005000 x 10
^{-3} - -.12005 x 10
^{5} - -.00012005 x 10
^{8}

Note
that the second and fourth examples have zeros on the end and beginning,
respectively. These zeros contribute nothing to the value of the number and are
considered **insignificant zeros**. The zeros between the 2 and the 5,
however, are important, as are the three non-zero digits. Everything between
the leftmost non-zero digit and the rightmost non-zero digit are
is considered a **significant digit** (if you left out
one of them, it would affect your calculations). Because there are so many
different ways to represent any number in scientific notation, we usually
decide on a standard (**normalized**) way to represent the
number. A common normalized format is to place the decimal point immediately to
the left of the leftmost significant digit. There will only be one normalized
way to represent any given number in scientific notation. Our example, when
normalized, would be -.12005 x 10^{5}

Scientific
notation requires the specification of four separate components to define the
number:

- The sign of the number ("-" in the example above).
- The magnitude, or value, of the number, called the
**mantissa**(the digits .12005 in the normalized example). - The sign of the exponent ("+" in the example).
- The magnitude of the exponent (5 in the normalized example) .

**FLOATING POINT FORMAT**

Like
integers, floating point numbers could be represented in many different formats.
However, it makes life easier for everybody if a standard format is used.
Obviously the sign of the mantissa and the sign of the exponent only require
one bit each. The remaining bits will be divided between the mantissa and the
exponent. The more bits we use for the exponent, the greater will be the range
of values that can be represented; but this means fewer bits for the mantissa
and fewer bits of **precision **(number of significant
digits that can be represented). If we increase the precision by devoting more
bits to the mantissa, that leaves fewer bits for the
exponent and a smaller range of values that can be represented.

There
are many possible ways of dividing up the bits among exponent and mantissa. The
important thing, however, is to agree on a **standard representation**.
Such a standard is the **IEEE Standard 754**.

**FLOATING POINT IN THE COMPUTER &**

**CONVERSION
BETWEEN BASE 10 AND BASE 2**

Floating point numbers in a computer are stored and manipulated in
exactly the same way as we just discussed, except that the numbers are
represented in binary instead of decimal. Again, the representation of floating
point numbers is completely arbitrary, and the only reason for standards is to
make it easy to move floating point data from one system to another.

**The 754 standard**

**Single
precision:**

**Bit 31 (1 bit)
– sign bit of the mantissa**

**Bits 30-23 (8
bits) – biased exponent, 127 more than the actual exponent**

**Bits 22-0 (23 bits)
– mantissa, stored in normalized form, with an ****implied**** digit of 1 to the left of the binary point
(the 1 is not stored)**

**Double
precision:**

**Bits 63 (1
bit) – sign bit of the mantissa**

**Bits 62-52 (11
bits) – biased exponent, 1023 more than the actual exponent**

**Bits 51-0 (52
bits) – mantissa, stored in normalized form, with an ****implied**** digit of 1 to the left of the binary point
(the 1 is not stored)**

**There are
several reserved bit patterns that have several meanings:**

**Exponent is 0
and mantissa is 0 (plus or minus) equals 0.**

**Exponent is
255 and mantissa is 0 (plus or minus) equals infinity**

**Exponent is
255 and mantissa is ****not**** zero**

IEEE reserves exponent field values of all 0s and all
1s to denote special values in the floating-point scheme.

As mentioned above, zero is not directly representable
in the straight format, due to the assumption of a leading 1 (we'd need to
specify a true zero mantissa to yield a value of zero). Zero is a special value
denoted with an exponent field of zero and a fraction field of zero. Note that
-0 and +0 are distinct values, though they both compare as equal.

If the exponent is all 0s, but the fraction is non-zero
(else it would be interpreted as zero), then the value is a *denormalized*
number, which does *not* have an assumed leading 1 before the binary
point. Thus, this represents a number (-1)^{s}
x 0.*f* x 2^{-126}, where *s* is the sign bit and *f* is
the fraction.

The values +infinity and -infinity are denoted with an
exponent of all 1s and a fraction of all 0s. The sign bit distinguishes between
negative infinity and positive infinity. Being able to denote infinity as a
specific value is useful because it allows operations to continue past overflow
situations. *Operations with infinite values are well defined in IEEE
floating point.*

The value *Not a Number*) is used to represent a value that does not represent a
real number. *Quiet NaN*) and SNaN (*Signalling** NaN*).

A QNaN is a

An SNaN is a

Semantically, QNaN's denote *indeterminate*
operations, while SNaN's denote *invalid*
operations.

**EXAMPLE**

What IEEE 754-format floating point number is this? 1100 0000 1101
0000 0000 0000 0000 0000

First partition the digits to identify the sign, biased exponent, and mantissa:

**1**** ****1000 0001**** ****1010 0000 0000
0000 0000 000**. The red digit is the sign of the mantissa (1=negative). The green
digits are the biased exponent. 1000 0001 is 129. Subtract the bias of 127 and
the exponent is 2. The blue digits are the mantissa. Remember that the leftmost
1 is not stored, so we must insert it (in blue). The actual value of the
mantissa is **1.**101 0000 0000
0000 0000 0000. This must be multiplied by 2^{2}, giving

110**.**1 0000 0000
0000 0000 0000, which is 4 + 2 + ½ , or 6.5.
Since the sign bit was 1, the result is negative, or -6.5. Note that the base
for our exponent is 2, not 10.

**EXAMPLE**

Let's go the other way. Convert 17.75 to IEEE 754-format floating
point. First, convert it to binary:

10001.11. In scientific notation, this is 10001.11 x 2^{0}.
Shift the "binary point" four places to the left to normalize the
number and we get 1**.**000111
x 2^{4}. The number is positive, so the sign bit is **0**. The exponent is 4, so the biased exponent is 4 + 127 =
131, which in binary is **1000 0011**.
The mantissa would be 1**.**000 1110
0000 0000 0000 0000 except for our trick of not storing the leftmost digit
of the mantissa. After we discard the leftmost digit, we get: **000 1110 0000 0000
0000 0000**. Put the digits together and we get: **0****100 0001 1****000 1110 0000 0000
0000 0000**. A number with this many bits is easier to represent in hex: 41 8E
00 00 .

**Special cases:**

0
is represented like this:

0
for sign, 0 for exponent, 0 for mantissa

Infinity
is represented like this:

255
for exponent, 0 for mantissa (either + or -)

**AN ALTERNATIVE REPRESENTATION: PACKED DECIMAL**

Packed
decimal uses 4 bits per digit, allowing 2 decimal digits to be
"packed" into a single byte. 1100 is used to indicate a positive
number, and 1101 is used to indicate a negative number. If the number of digits
is odd (leaving half a byte left over), the sign bits will occupy the remaining
half-byte. If the number of digits is odd, the sign bits will occupy the first
four bits of an extra byte, leaving four bits unused. These remaining four bits
are appended on the left (location of high-order [leading] digits).

**EXAMPLE: DECIMAL to PACKED DECIMAL**

Convert
+12345 to packed decimal. Convert each digit to a
4-bit binary number, and append the 4-bit sign on the right end:

` 1 2 3 4 5 +`

0001 0010 0011 0100 0101 1100

**EXAMPLE: DECMIAL to PACKED DECIMAL**

Convert -7890 to packed
decimal. Convert each digit to a 4-bit binary number, and append the 4bit sign
on the right end. Append a 4-bit leading 0 to fill out the extra half-byte:

` 1 2 3 4 5 +`

`0000 0111 1000 1001 0000 1101`

**EXAMPLE: PACKED DECIMAL TO DECIMAL**

Convert the packed decimal number 0100 1001 0101
1000 1101 to base 10. To solve this problem,
convert each 4-bit number to its equivalent decimal digit, and convert the
rightmost 4 bits to its equivalent sign:

-4958

**PROGRAMMING CONSIDERATIONS**

I have one comment to add: In the 21^{st} century,
you don't need to worry very much about saving a few bits or CPU cycles here
and there. If you think that an integer will never be larger than 255, you can
store it in a byte, but your savings in both memory and CPU cycles will be
minimal. However, if the value of the integer should get larger than you were
able to foresee, and it exceeds 255, your program will fail. And this is the
type of error that (on many systems) will not generate an error message; it
will simply "wrap around" (i.e., throwing away the carry bit, so that
256 becomes 0, 257 becomes 1, etc.) and give the user incorrect output. If the
output is not obviously wrong, the user will be making decisions based on
incorrect calculations (just to save a few bits!). Today's computers use a word
size of 32 or 64 bits, meaning that they both fetch data in 32- or 64-bit
chunks and perform arithmetic operations on 32- or 64-bit chunks, so the time
saved by using a data size smaller than the word size of the CPU you are using
would probably either be 0 or insignificant. And, of course, saving time on a
single instruction is meaningless; the only time that it matters is if the
instruction is in a loop that is executed many, many times (e.g. millions of
times).