The Architecture of Computer Hardware and Systems Software

Chapter 5 Notes

Floating Point Numbers

 

5.1 REVIEW OF EXPONENTIAL NOTATION

Numbers with fractional parts are often called rational numbers or real numbers in mathematics. Numbers with fractional parts are called floating point numbers in computer science. Floating point numbers are easy. They're just like scientific notation, only in binary instead of decimal.

 


 

EXAMPLE:

Represent the number -12,005 in scientific notation. There are an infinite number of ways to represent this number. Some possibilities are:

Note that the second and fourth examples have zeros on the end and beginning, respectively. These zeros contribute nothing to the value of the number and are considered insignificant zeros. The zeros between the 2 and the 5, however, are important, as are the three non-zero digits. Everything between the leftmost non-zero digit and the rightmost non-zero digit are is considered a significant digit (if you left out one of them, it would affect your calculations). Because there are so many different ways to represent any number in scientific notation, we usually decide on a standard (normalized) way to represent the number. A common normalized format is to place the decimal point immediately to the left of the leftmost significant digit. There will only be one normalized way to represent any given number in scientific notation. Our example, when normalized, would be -.12005 x 105

 


 

Scientific notation requires the specification of four separate components to define the number:


 

5.2 FLOATING POINT FORMAT

Like integers, floating point numbers could be represented in many different formats. However, it makes life easier for everybody if a standard format is used. Obviously the sign of the mantissa and the sign of the exponent only require one bit each. The remaining bits will be divided between the mantissa and the exponent. The more bits we use for the exponent, the greater will be the range of values that can be represented; but this means fewer bits for the mantissa and fewer bits of precision (number of significant digits that can be represented). If we increase the precision by devoting more bits to the mantissa, that leaves fewer bits for the exponent and a smaller range of values that can be represented.

There are many possible ways of dividing up the bits among exponent and mantissa. The important thing, however, is to agree on a standard representation. Such a standard is the IEEE Standard 754.

 


 

5.3 NORMALIZATION AND FORMATTING OF FLOATING POINT NUMBERS (BASE 10)

Before looking at how floating point numbers are represented in binary, we will consider something more familiar: floating point in base 10. Although any representation could be used, we will use the format described on page 124 of Englander: SEEMMMMM. The "S" is a sign (0=positive, 5=negative), "EE" is an exponent (00 through 99), and "MMMMM" is the mantissa with an assumed decimal point to the left of the first digit (.00000 through .99999). Since the sign is the sign of the mantissa, and there is no provision for a sign for the exponent, it appears that there is no way to represent negative exponents. However, we can use the odometer trick that we used earlier. Choose the midpoint (50) to represent 0. Then 51 represents 1, 52 represents 2, ... and 99 represents an exponent of 49. Negative numbers begin with 49 representing -1, 48 representing -2,... and 0 representing -50. The digits that are stored in the representation of the exponent are actually 50 more than the actual value of the exponent. Such an exponent is called a biased exponent.

 


 

EXAMPLE

What number does 05523687 represent? The biased exponent of 55 is 50 more than the actual exponent, so the actual exponent is 5. The mantissa is .23687. If we multiply .23687 by 105, we get 23,687.

 


 

EXAMPLE

How would the number -1234.5678 be represented in Englander's floating point format? The number is negative, so the sign is 5. The mantissa is only 5 digits, so it becomes .12345 (if you truncate extra digits, .12346 if you round). In normalizing the mantissa, the decimal place is moved 4 places to the left, which is the equivalent of dividing by 104. Therefore, to restore the decimal point to its correct position, the number must be multiplied by 104. The actual exponent is 4, but the biased exponent is the actual exponent plus 50, or 54. The final result is 55412345.

 


 

5.4 A PROGRAMMING EXAMPLE

You can skip this section if you are not familiar with Pascal (although it should be easy to follow for anybody with programming experience).

 


 

5.5 FLOATING POINT CALCULATIONS

Multiplication and Division

Multiplication and division are easiest, so we'll do them first. Consider how multiplication of numbers in scientific notation is done. Multiply 250 times 5. The result should be 1250. In scientific notation, the numbers are (.25 x 103) x (.5 x 101). If we group the mantissas together and the exponents together, we get (.25 x .5) x (103 x 101). (.25 x .5) is .125 and (103 x 101) is 104. So the result is .125 x 104. All you have to do is multiply mantissas and add exponents. When doing division, divide mantissas and subtract exponents. If the resulting mantissa isn't normalized, normalize it.

 

Addition and Subtraction

Addition and subtraction are slightly more difficult because only numbers with the same place values can be added or subtracted. We will look at an example in scientific notation.

 


 

EXAMPLE

Let's add the two numbers from the previous example: 250 + 5. The result should be 255. In scientific notation, the numbers are: 250 = .25 x 103, 5 = .5 x 101, and 255 = .255 x 103. We must take the number with the smaller exponent and change it so that its exponent is the same as the larger exponent. This means that we must take .5 x 101 and make its exponent 3, which is the same as multiplying by 100. If we multiply the number by 100, we must divide its mantissa by 100 also to keep from changing the value of the number. So the number 5 is rewritten as .005 x 103. Now we can add the mantissas, giving a result of .255. The exponent remains 3, so the result is .255 x 103. The result is already in normalized form, so we are done.

 


 

EXAMPLE

Let's add the numbers 975 and 50. The result should be 1025. In scientific notation, these three numbers are: 975 = .975 x 103, 50 = .5 x 102, and 1025 = .1025 x 104. Before adding, take the number with the smaller exponent, and change it so that its exponent is the same as the larger exponent. So the 50 can be rewritten as .05 x 103. Now that the exponents are the same, we can add the mantissas: .975 + .05 = 1.025. The exponent is 3, so the result is 1.025 x 103. However, this result is not normalized. To normalize it, we must divide the mantissa by 10, and if we divide one part of the number by 10, we must multiply another part by 10 or we will change the value of the number. The "other part" that we multiply by 10 is the exponent: 103 x 10 = 104.

 


 

5.6 FLOATING POINT IN THE COMPUTER &
5.7 CONVERSION BETWEEN BASE 10 AND BASE 2

Floating point numbers in a computer are stored and manipulated in exactly the same way as we just discussed, except that the numbers are represented in binary instead of decimal. Again, the representation of floating point numbers is completely arbitrary, and the only reason for standards is to make it easy to move floating point data from one system to another.

 

The 754 standard

Single precision:

Bit 31 (1 bit) sign bit of the mantissa

Bits 30-23 (8 bits) biased exponent, 127 more than the actual exponent

Bits 22-0 (23 bits) mantissa, stored in normalized form, with an implied digit of 1 to the left of the binary point (the 1 is not stored)

 

Double precision:

Bits 63 (1 bit) sign bit of the mantissa

Bits 62-52 (11 bits) biased exponent, 1023 more than the actual exponent

Bits 51-0 (52 bits) mantissa, stored in normalized form, with an implied digit of 1 to the left of the binary point (the 1 is not stored)

 

There are several reserved bit patterns that have several meanings:

 

Exponent is 0 and mantissa is 0 (plus or minus) equals 0.

 

Exponent is 255 and mantissa is 0 (plus or minus) equals infinity

 

Exponent is 255 and mantissa is not zero

Special Values

IEEE reserves exponent field values of all 0s and all 1s to denote special values in the floating-point scheme.

Zero

As mentioned above, zero is not directly representable in the straight format, due to the assumption of a leading 1 (we'd need to specify a true zero mantissa to yield a value of zero). Zero is a special value denoted with an exponent field of zero and a fraction field of zero. Note that -0 and +0 are distinct values, though they both compare as equal.

Denormalized

If the exponent is all 0s, but the fraction is non-zero (else it would be interpreted as zero), then the value is a denormalized number, which does not have an assumed leading 1 before the binary point. Thus, this represents a number (-1)s x 0.f x 2-126, where s is the sign bit and f is the fraction.

Infinity

The values +infinity and -infinity are denoted with an exponent of all 1s and a fraction of all 0s. The sign bit distinguishes between negative infinity and positive infinity. Being able to denote infinity as a specific value is useful because it allows operations to continue past overflow situations. Operations with infinite values are well defined in IEEE floating point.

Not A Number

The value NaN (Not a Number) is used to represent a value that does not represent a real number. NaN's are represented by a bit pattern with an exponent of all 1s and a non-zero fraction. There are two categories of NaN: QNaN (Quiet NaN) and SNaN (Signalling NaN).

A QNaN is a NaN with the most significant fraction bit set. QNaN's propagate freely through most arithmetic operations. These values pop out of an operation when the result is not mathematically defined.

An SNaN is a NaN with the most significant fraction bit clear. It is used to signal an exception when used in operations. SNaN's can be handy to assign to uninitialized variables to trap premature usage.

Semantically, QNaN's denote indeterminate operations, while SNaN's denote invalid operations.

 

EXAMPLE

What IEEE 754-format floating point number is this? 1100 0000 1101 0000 0000 0000 0000 0000
First partition the digits to identify the sign, biased exponent, and mantissa:
1 1000 0001 1010 0000 0000 0000 0000 000. The red digit is the sign of the mantissa (1=negative). The green digits are the biased exponent. 1000 0001 is 129. Subtract the bias of 127 and the exponent is 2. The blue digits are the mantissa. Remember that the leftmost 1 is not stored, so we must insert it (in blue). The actual value of the mantissa is 1.101 0000 0000 0000 0000 0000. This must be multiplied by 22, giving
110.1 0000 0000 0000 0000 0000, which is 4 + 2 + , or 6.5. Since the sign bit was 1, the result is negative, or -6.5. Note that the base for our exponent is 2, not 10.

 


 

EXAMPLE

Let's go the other way. Convert 17.75 to IEEE 754-format floating point. First, convert it to binary:
10001.11. In scientific notation, this is 10001.11 x 20. Shift the "binary point" four places to the left to normalize the number and we get 1.000111 x 24. The number is positive, so the sign bit is
0. The exponent is 4, so the biased exponent is 4 + 127 = 131, which in binary is 1000 0011. The mantissa would be 1.000 1110 0000 0000 0000 0000 except for our trick of not storing the leftmost digit of the mantissa. After we discard the leftmost digit, we get: 000 1110 0000 0000 0000 0000. Put the digits together and we get: 0100 0001 1000 1110 0000 0000 0000 0000. A number with this many bits is easier to represent in hex: 41 8E 00 00 .

 

Special cases:

0 is represented like this:

0 for sign, 0 for exponent, 0 for mantissa

 

Infinity is represented like this:

255 for exponent, 0 for mantissa (either + or -)

 


 

5.8 AN ALTERNATIVE REPRESENTATION: PACKED DECIMAL

Packed decimal uses 4 bits per digit, allowing 2 decimal digits to be "packed" into a single byte. 1100 is used to indicate a positive number, and 1101 is used to indicate a negative number. If the number of digits is odd (leaving half a byte left over), the sign bits will occupy the remaining half-byte. If the number of digits is odd, the sign bits will occupy the first four bits of an extra byte, leaving four bits unused. These remaining four bits are appended on the left (location of high-order [leading] digits).

 

EXAMPLE: DECIMAL to PACKED DECIMAL

Convert +12345 to packed decimal. Convert each digit to a 4-bit binary number, and append the 4-bit sign on the right end:

 1 2 3 4 5 +
0001 0010 0011 0100 0101 1100

 

EXAMPLE: DECMIAL to PACKED DECIMAL

Convert -7890 to packed decimal. Convert each digit to a 4-bit binary number, and append the 4bit sign on the right end. Append a 4-bit leading 0 to fill out the extra half-byte:

 1 2 3 4 5 +
0000 0111 1000 1001 0000 1101

 

EXAMPLE: PACKED DECIMAL TO DECIMAL

Convert the packed decimal number 0100 1001 0101 1000 1101 to base 10. To solve this problem, convert each 4-bit number to its equivalent decimal digit, and convert the rightmost 4 bits to its equivalent sign:

-4958

 


 

5.9 PROGRAMMING CONSIDERATIONS

I have one comment to add: In the 21st century, you don't need to worry very much about saving a few bits or CPU cycles here and there. If you think that an integer will never be larger than 255, you can store it in a byte, but your savings in both memory and CPU cycles will be minimal. However, if the value of the integer should get larger than you were able to foresee, and it exceeds 255, your program will fail. And this is the type of error that (on many systems) will not generate an error message; it will simply "wrap around" (i.e., throwing away the carry bit, so that 256 becomes 0, 257 becomes 1, etc.) and give the user incorrect output. If the output is not obviously wrong, the user will be making decisions based on incorrect calculations (just to save a few bits!). Today's computers use a word size of 32 or 64 bits, meaning that they both fetch data in 32- or 64-bit chunks and perform arithmetic operations on 32- or 64-bit chunks, so the time saved by using a data size smaller than the word size of the CPU you are using would probably either be 0 or insignificant. And, of course, saving time on a single instruction is meaningless; the only time that it matters is if the instruction is in a loop that is executed many, many times (e.g. millions of times).