Floating Point Numbers
REVIEW OF EXPONENTIAL NOTATION
Numbers with fractional parts are often called rational numbers or real numbers in mathematics. Numbers with fractional parts are called floating point numbers in computer science. Floating point numbers are easy. They're just like scientific notation, only in binary instead of decimal.
Represent the number -12,005 in scientific notation. There are an infinite number of ways to represent this number. Some possibilities are:
Note that the second and fourth examples have zeros on the end and beginning, respectively. These zeros contribute nothing to the value of the number and are considered insignificant zeros. The zeros between the 2 and the 5, however, are important, as are the three non-zero digits. Everything between the leftmost non-zero digit and the rightmost non-zero digit are is considered a significant digit (if you left out one of them, it would affect your calculations). Because there are so many different ways to represent any number in scientific notation, we usually decide on a standard (normalized) way to represent the number. A common normalized format is to place the decimal point immediately to the left of the leftmost significant digit. There will only be one normalized way to represent any given number in scientific notation. Our example, when normalized, would be -.12005 x 105
Scientific notation requires the specification of four separate components to define the number:
FLOATING POINT FORMAT
Like integers, floating point numbers could be represented in many different formats. However, it makes life easier for everybody if a standard format is used. Obviously the sign of the mantissa and the sign of the exponent only require one bit each. The remaining bits will be divided between the mantissa and the exponent. The more bits we use for the exponent, the greater will be the range of values that can be represented; but this means fewer bits for the mantissa and fewer bits of precision (number of significant digits that can be represented). If we increase the precision by devoting more bits to the mantissa, that leaves fewer bits for the exponent and a smaller range of values that can be represented.
There are many possible ways of dividing up the bits among exponent and mantissa. The important thing, however, is to agree on a standard representation. Such a standard is the IEEE Standard 754.
FLOATING POINT IN THE COMPUTER &
CONVERSION BETWEEN BASE 10 AND BASE 2
Floating point numbers in a computer are stored and manipulated in exactly the same way as we just discussed, except that the numbers are represented in binary instead of decimal. Again, the representation of floating point numbers is completely arbitrary, and the only reason for standards is to make it easy to move floating point data from one system to another.
The 754 standard
Bit 31 (1 bit) – sign bit of the mantissa
Bits 30-23 (8 bits) – biased exponent, 127 more than the actual exponent
Bits 22-0 (23 bits) – mantissa, stored in normalized form, with an implied digit of 1 to the left of the binary point (the 1 is not stored)
Bits 63 (1 bit) – sign bit of the mantissa
Bits 62-52 (11 bits) – biased exponent, 1023 more than the actual exponent
Bits 51-0 (52 bits) – mantissa, stored in normalized form, with an implied digit of 1 to the left of the binary point (the 1 is not stored)
There are several reserved bit patterns that have several meanings:
Exponent is 0 and mantissa is 0 (plus or minus) equals 0.
Exponent is 255 and mantissa is 0 (plus or minus) equals infinity
Exponent is 255 and mantissa is not zero
IEEE reserves exponent field values of all 0s and all 1s to denote special values in the floating-point scheme.
As mentioned above, zero is not directly representable in the straight format, due to the assumption of a leading 1 (we'd need to specify a true zero mantissa to yield a value of zero). Zero is a special value denoted with an exponent field of zero and a fraction field of zero. Note that -0 and +0 are distinct values, though they both compare as equal.
If the exponent is all 0s, but the fraction is non-zero (else it would be interpreted as zero), then the value is a denormalized number, which does not have an assumed leading 1 before the binary point. Thus, this represents a number (-1)s x 0.f x 2-126, where s is the sign bit and f is the fraction.
The values +infinity and -infinity are denoted with an exponent of all 1s and a fraction of all 0s. The sign bit distinguishes between negative infinity and positive infinity. Being able to denote infinity as a specific value is useful because it allows operations to continue past overflow situations. Operations with infinite values are well defined in IEEE floating point.
A QNaN is a
An SNaN is a
Semantically, QNaN's denote indeterminate operations, while SNaN's denote invalid operations.
What IEEE 754-format floating point number is this? 1100 0000 1101
0000 0000 0000 0000 0000
First partition the digits to identify the sign, biased exponent, and mantissa:
1 1000 0001 1010 0000 0000 0000 0000 000. The red digit is the sign of the mantissa (1=negative). The green digits are the biased exponent. 1000 0001 is 129. Subtract the bias of 127 and the exponent is 2. The blue digits are the mantissa. Remember that the leftmost 1 is not stored, so we must insert it (in blue). The actual value of the mantissa is 1.101 0000 0000 0000 0000 0000. This must be multiplied by 22, giving
110.1 0000 0000 0000 0000 0000, which is 4 + 2 + ½ , or 6.5. Since the sign bit was 1, the result is negative, or -6.5. Note that the base for our exponent is 2, not 10.
Let's go the other way. Convert 17.75 to IEEE 754-format floating
point. First, convert it to binary:
10001.11. In scientific notation, this is 10001.11 x 20. Shift the "binary point" four places to the left to normalize the number and we get 1.000111 x 24. The number is positive, so the sign bit is 0. The exponent is 4, so the biased exponent is 4 + 127 = 131, which in binary is 1000 0011. The mantissa would be 1.000 1110 0000 0000 0000 0000 except for our trick of not storing the leftmost digit of the mantissa. After we discard the leftmost digit, we get: 000 1110 0000 0000 0000 0000. Put the digits together and we get: 0100 0001 1000 1110 0000 0000 0000 0000. A number with this many bits is easier to represent in hex: 41 8E 00 00 .
0 is represented like this:
0 for sign, 0 for exponent, 0 for mantissa
Infinity is represented like this:
255 for exponent, 0 for mantissa (either + or -)
AN ALTERNATIVE REPRESENTATION: PACKED DECIMAL
Packed decimal uses 4 bits per digit, allowing 2 decimal digits to be "packed" into a single byte. 1100 is used to indicate a positive number, and 1101 is used to indicate a negative number. If the number of digits is odd (leaving half a byte left over), the sign bits will occupy the remaining half-byte. If the number of digits is odd, the sign bits will occupy the first four bits of an extra byte, leaving four bits unused. These remaining four bits are appended on the left (location of high-order [leading] digits).
EXAMPLE: DECIMAL to PACKED DECIMAL
Convert +12345 to packed decimal. Convert each digit to a 4-bit binary number, and append the 4-bit sign on the right end:
1 2 3 4 5 +
0001 0010 0011 0100 0101 1100
EXAMPLE: DECMIAL to PACKED DECIMAL
Convert -7890 to packed decimal. Convert each digit to a 4-bit binary number, and append the 4bit sign on the right end. Append a 4-bit leading 0 to fill out the extra half-byte:
1 2 3 4 5 +
0000 0111 1000 1001 0000 1101
EXAMPLE: PACKED DECIMAL TO DECIMAL
Convert the packed decimal number 0100 1001 0101 1000 1101 to base 10. To solve this problem, convert each 4-bit number to its equivalent decimal digit, and convert the rightmost 4 bits to its equivalent sign:
I have one comment to add: In the 21st century, you don't need to worry very much about saving a few bits or CPU cycles here and there. If you think that an integer will never be larger than 255, you can store it in a byte, but your savings in both memory and CPU cycles will be minimal. However, if the value of the integer should get larger than you were able to foresee, and it exceeds 255, your program will fail. And this is the type of error that (on many systems) will not generate an error message; it will simply "wrap around" (i.e., throwing away the carry bit, so that 256 becomes 0, 257 becomes 1, etc.) and give the user incorrect output. If the output is not obviously wrong, the user will be making decisions based on incorrect calculations (just to save a few bits!). Today's computers use a word size of 32 or 64 bits, meaning that they both fetch data in 32- or 64-bit chunks and perform arithmetic operations on 32- or 64-bit chunks, so the time saved by using a data size smaller than the word size of the CPU you are using would probably either be 0 or insignificant. And, of course, saving time on a single instruction is meaningless; the only time that it matters is if the instruction is in a loop that is executed many, many times (e.g. millions of times).