## The Architecture of Computer Hardware and Systems Software

Floating Point Numbers

REVIEW OF EXPONENTIAL NOTATION

Numbers with fractional parts are often called rational numbers or real numbers in mathematics. Numbers with fractional parts are called floating point numbers in computer science. Floating point numbers are easy. They're just like scientific notation, only in binary instead of decimal.

EXAMPLE:

Represent the number -12,005 in scientific notation. There are an infinite number of ways to represent this number. Some possibilities are:

• -12005 x 100
• -12005000 x 10-3
• -.12005 x 105
• -.00012005 x 108

Note that the second and fourth examples have zeros on the end and beginning, respectively. These zeros contribute nothing to the value of the number and are considered insignificant zeros. The zeros between the 2 and the 5, however, are important, as are the three non-zero digits. Everything between the leftmost non-zero digit and the rightmost non-zero digit are is considered a significant digit (if you left out one of them, it would affect your calculations). Because there are so many different ways to represent any number in scientific notation, we usually decide on a standard (normalized) way to represent the number. A common normalized format is to place the decimal point immediately to the left of the leftmost significant digit. There will only be one normalized way to represent any given number in scientific notation. Our example, when normalized, would be -.12005 x 105

Scientific notation requires the specification of four separate components to define the number:

• The sign of the number ("-" in the example above).
• The magnitude, or value, of the number, called the mantissa (the digits .12005 in the normalized example).
• The sign of the exponent ("+" in the example).
• The magnitude of the exponent (5 in the normalized example) .

FLOATING POINT FORMAT

Like integers, floating point numbers could be represented in many different formats. However, it makes life easier for everybody if a standard format is used. Obviously the sign of the mantissa and the sign of the exponent only require one bit each. The remaining bits will be divided between the mantissa and the exponent. The more bits we use for the exponent, the greater will be the range of values that can be represented; but this means fewer bits for the mantissa and fewer bits of precision (number of significant digits that can be represented). If we increase the precision by devoting more bits to the mantissa, that leaves fewer bits for the exponent and a smaller range of values that can be represented.

There are many possible ways of dividing up the bits among exponent and mantissa. The important thing, however, is to agree on a standard representation. Such a standard is the IEEE Standard 754.

FLOATING POINT IN THE COMPUTER &
CONVERSION BETWEEN BASE 10 AND BASE 2

Floating point numbers in a computer are stored and manipulated in exactly the same way as we just discussed, except that the numbers are represented in binary instead of decimal. Again, the representation of floating point numbers is completely arbitrary, and the only reason for standards is to make it easy to move floating point data from one system to another.

The 754 standard

Single precision:

Bit 31 (1 bit) – sign bit of the mantissa

Bits 30-23 (8 bits) – biased exponent, 127 more than the actual exponent

Bits 22-0 (23 bits) – mantissa, stored in normalized form, with an implied digit of 1 to the left of the binary point (the 1 is not stored)

Double precision:

Bits 63 (1 bit) – sign bit of the mantissa

Bits 62-52 (11 bits) – biased exponent, 1023 more than the actual exponent

Bits 51-0 (52 bits) – mantissa, stored in normalized form, with an implied digit of 1 to the left of the binary point (the 1 is not stored)

There are several reserved bit patterns that have several meanings:

Exponent is 0 and mantissa is 0 (plus or minus) equals 0.

Exponent is 255 and mantissa is 0 (plus or minus) equals infinity

Exponent is 255 and mantissa is not zero

## Special Values

IEEE reserves exponent field values of all 0s and all 1s to denote special values in the floating-point scheme.

### Zero

As mentioned above, zero is not directly representable in the straight format, due to the assumption of a leading 1 (we'd need to specify a true zero mantissa to yield a value of zero). Zero is a special value denoted with an exponent field of zero and a fraction field of zero. Note that -0 and +0 are distinct values, though they both compare as equal.

### Denormalized

If the exponent is all 0s, but the fraction is non-zero (else it would be interpreted as zero), then the value is a denormalized number, which does not have an assumed leading 1 before the binary point. Thus, this represents a number (-1)s x 0.f x 2-126, where s is the sign bit and f is the fraction.

### Infinity

The values +infinity and -infinity are denoted with an exponent of all 1s and a fraction of all 0s. The sign bit distinguishes between negative infinity and positive infinity. Being able to denote infinity as a specific value is useful because it allows operations to continue past overflow situations. Operations with infinite values are well defined in IEEE floating point.

### Not A Number

The value NaN (Not a Number) is used to represent a value that does not represent a real number. NaN's are represented by a bit pattern with an exponent of all 1s and a non-zero fraction. There are two categories of NaN: QNaN (Quiet NaN) and SNaN (Signalling NaN).

A QNaN is a NaN with the most significant fraction bit set. QNaN's propagate freely through most arithmetic operations. These values pop out of an operation when the result is not mathematically defined.

An SNaN is a NaN with the most significant fraction bit clear. It is used to signal an exception when used in operations. SNaN's can be handy to assign to uninitialized variables to trap premature usage.

Semantically, QNaN's denote indeterminate operations, while SNaN's denote invalid operations.

EXAMPLE

What IEEE 754-format floating point number is this? 1100 0000 1101 0000 0000 0000 0000 0000
First partition the digits to identify the sign, biased exponent, and mantissa:
1 1000 0001 1010 0000 0000 0000 0000 000. The red digit is the sign of the mantissa (1=negative). The green digits are the biased exponent. 1000 0001 is 129. Subtract the bias of 127 and the exponent is 2. The blue digits are the mantissa. Remember that the leftmost 1 is not stored, so we must insert it (in blue). The actual value of the mantissa is 1.101 0000 0000 0000 0000 0000. This must be multiplied by 22, giving
110.1 0000 0000 0000 0000 0000, which is 4 + 2 + ½ , or 6.5. Since the sign bit was 1, the result is negative, or -6.5. Note that the base for our exponent is 2, not 10.

EXAMPLE

Let's go the other way. Convert 17.75 to IEEE 754-format floating point. First, convert it to binary:
10001.11. In scientific notation, this is 10001.11 x 20. Shift the "binary point" four places to the left to normalize the number and we get 1.000111 x 24. The number is positive, so the sign bit is
0. The exponent is 4, so the biased exponent is 4 + 127 = 131, which in binary is 1000 0011. The mantissa would be 1.000 1110 0000 0000 0000 0000 except for our trick of not storing the leftmost digit of the mantissa. After we discard the leftmost digit, we get: 000 1110 0000 0000 0000 0000. Put the digits together and we get: 0100 0001 1000 1110 0000 0000 0000 0000. A number with this many bits is easier to represent in hex: 41 8E 00 00 .

Special cases:

0 is represented like this:

0 for sign, 0 for exponent, 0 for mantissa

Infinity is represented like this:

255 for exponent, 0 for mantissa (either + or -)

AN ALTERNATIVE REPRESENTATION: PACKED DECIMAL

Packed decimal uses 4 bits per digit, allowing 2 decimal digits to be "packed" into a single byte. 1100 is used to indicate a positive number, and 1101 is used to indicate a negative number. If the number of digits is odd (leaving half a byte left over), the sign bits will occupy the remaining half-byte. If the number of digits is odd, the sign bits will occupy the first four bits of an extra byte, leaving four bits unused. These remaining four bits are appended on the left (location of high-order [leading] digits).

EXAMPLE: DECIMAL to PACKED DECIMAL

Convert +12345 to packed decimal. Convert each digit to a 4-bit binary number, and append the 4-bit sign on the right end:

`   1    2    3    4    5    +`
`0001 0010 0011 0100 0101 1100`

EXAMPLE: DECMIAL to PACKED DECIMAL

Convert -7890 to packed decimal. Convert each digit to a 4-bit binary number, and append the 4bit sign on the right end. Append a 4-bit leading 0 to fill out the extra half-byte:

`   1    2    3    4    5    +`
`0000 0111 1000 1001 0000 1101`

EXAMPLE: PACKED DECIMAL TO DECIMAL

Convert the packed decimal number 0100 1001 0101 1000 1101 to base 10. To solve this problem, convert each 4-bit number to its equivalent decimal digit, and convert the rightmost 4 bits to its equivalent sign:

-4958

PROGRAMMING CONSIDERATIONS

I have one comment to add: In the 21st century, you don't need to worry very much about saving a few bits or CPU cycles here and there. If you think that an integer will never be larger than 255, you can store it in a byte, but your savings in both memory and CPU cycles will be minimal. However, if the value of the integer should get larger than you were able to foresee, and it exceeds 255, your program will fail. And this is the type of error that (on many systems) will not generate an error message; it will simply "wrap around" (i.e., throwing away the carry bit, so that 256 becomes 0, 257 becomes 1, etc.) and give the user incorrect output. If the output is not obviously wrong, the user will be making decisions based on incorrect calculations (just to save a few bits!). Today's computers use a word size of 32 or 64 bits, meaning that they both fetch data in 32- or 64-bit chunks and perform arithmetic operations on 32- or 64-bit chunks, so the time saved by using a data size smaller than the word size of the CPU you are using would probably either be 0 or insignificant. And, of course, saving time on a single instruction is meaningless; the only time that it matters is if the instruction is in a loop that is executed many, many times (e.g. millions of times).