Back Home Up Next

Numbers Fractions Negative numbers Floating point Binary arithmetic Codes

Floating Point Arithmetic

 

We have included floating-point arithmetic for the sake of completeness. Some readers might wish to skip this section on a first reading. The type of numbers we have dealt with so far have been integers or numbers divided into an integer part and a fractional part by an imaginary binary point. A computer with an n-bit wordlength is capable of handling unsigned integers in the range 0 to 2n - 1 in a single word. Larger integers can be created by chaining together words (i.e., regarding two words in memory as part of the same number). By chaining words together, the programmer can handle numbers of any size; for example, a 32-bit computer can work with 64-bit binary numbers by taking two 32-bit values and regarding one as the upper half of a 64-bit value, and the other as the lower half of the 64-bit value.

 

Sometimes we need to handle numbers with an immense range of values (e.g., from the mass of an electron to the mass of a star). Integer or fixed-point arithmetic is not cost-effective in such applications because it would take hundreds of bits to represent each number. A means of representing very large and very small numbers has been devised, and is called floating point arithmetic. It is so called because the binary point is not located at a fixed position in the number. Instead, a floating point value is stored as two components: a number and the location of the binary point within the number.

 

The floating point representation of numbers is also called scientific notation because it is widely used by scientists and engineers. Typical examples of decimal floating point numbers are: 1.2345 x 1020, 0.4599 x 10-50, -8.5 x 103. In decimal arithmetic, scientific numbers are written in the form mantissa x 10exponent, where the mantissa describes the number and the exponent scales it by the appropriate power of ten.

 

A binary floating point number is represented by mantissa x 2exponent. For example, you could represent the binary number 101010.111110 by x 25. In this case, the mantissa is 0.101010111110 and the exponent is 5.

 

Although several different systems have been used to represent the mantissa and exponent of floating point numbers in computers, the most common representation used today is called the IEEE format for floating point because it was standardized by the Institution of Electrical and Electronic Engineers. There are three versions of this format: one for 32-bit numbers (single precision), one for 64-bit numbers (double precision), and one for 128-bit numbers (quad precision). We consider only single-precision floating point numbers here. The other formats are similar—they just use more bits to extend both the range and the precision of the number.

 

 

A 32-bit single precision IEEE floating point number is represented by the bit sequence

 

S EEEEEEEE 1.MMMMMMMMMMMMMMMMMMMMMM

 

where S is the sign bit that tells you whether the number is positive or negative, E the eight-bit biased exponent that tells you where to put the binary point, and M the 23-bit fractional mantissa. If you count the bits in this number, you will find that there are 33 not 32 bits. The leading 1 in front of the mantissa is omitted when the number is stored in memory. Only the fractional part of the mantissa, the M’s, are stored in memory (the reason for this will soon become clear).

 

The S-bit is a sign bit that determines the sign of the number. If S = 0, the number is positive, and if S = 1, it is negative. The mantissa uses the sign and magnitude representation of signed numbers.

 

The exponent is the power of two that scales the mantissa in a floating point number.

 

 

The exponent used in the IEEE format is a biased exponent, because it is the actual exponent plus 127. For example, if the floating point number is +1.110010 x 212, the exponent is stored not as 12, but as 12 + 127 = 13910 = 100010112. This arrangement is also called excess 127 and begs the question, why should the exponent be too big by 127?

 

·         If the stored exponent is 127, the actual exponent is 127 - 127 = 0.

·         If the stored exponent is 126, the actual exponent is 126 - 127 = -1.

·         If the stored exponent is 0, the actual exponent is 0 - 127 = -127.

 

Biased exponents remove the need to deal with both positive and negative exponents when processing floating point numbers. The most negative actual exponent (i.e., -127) is represented by 0.

 

The mantissa of an IEEE floating point number is always normalized and is expressed in the form 1.xxxx; that is, the  mantissa ranges between 1.0000...00 and 1.1111...11 (unless the floating point number is zero, in which case it is represented by 0.000...00). Normalizing a floating point number makes maximum advantage of the available bits of the mantissa. Consider the two mantissas 0.00011010000001010101111 and 1.10100000010101011111010, respectively. The first mantissa is not normalized, whereas the second mantissa is. Note how the normalized mantissa has gained an extra four bits of precision—these are shown in bold typeface. Because the mantissa is always normalized and always begins with a leading 1, it is not necessary to include the leading 1 when the number is stored in memory.

 

We can now take a 32-bit IEEE floating point number and unpack it into a sign bit, a biased exponent and a mantissa. Consider the 32-bit floating point value 11000001100110011000000000, that can be unpacked into S = 1, E = 10000011, and M = 1.00110011000000000 (note that we inserted a leading one in front of the stored mantissa when we unpacked it). The number is therefore equal to ‑1.00110011000000000 x 210000011-01111111 = -1.00110011000000000 x 24 = -10011.0011.

 

Suppose we wish to convert the decimal value 4100.12510 into a 32-bit single precision IEEE floating point binary number. We first convert 4100.125 into binary form by first converting the integer part  to get 410010 = 10000000001002 and the fractional part to get 0.12510 = 0.0012. Therefore, 4100.12510 = 1000000000100.0012. The next step is to normalize this binary number to get: 1.000000000100001 x 212. Normalization is carried out by moving the binary point to the left until the mantissa is in the form 1.xxx... and incrementing the exponent for each place the binary point is moved. Now we can assemble the three components of the number:

 

·         The sign bit, S, is 0 because the number is positive

·         The stored exponent is the true exponent plus 127; that is, 12 + 127 = 13910 = 100010112

·         The stored mantissa is 00000000010000100000000 (the leading 1 is stripped and the mantissa expanded to 23 bits).

·         The final number is therefore, 01000101100000000010000100000000, or in hexadecimal format, 4580210016.

 

Let’s carry out the reverse operation for the sake of completeness. Consider the hexadecimal value C46C000016. In binary form, this number is 11000100011011000000000000000000. The first step is to unpack the number into sign bit, biased exponent, and fractional mantissa.

 

·         S = 1

·         E = 10001000

·         M =11011000000000000000000

 

The sign bit is 1, which means that the number is negative. The biased exponent is 100100002, and we have to subtract 127 to get the actual exponent; that is, 10010002 - 011111112 = 000001112 = 710. The fractional mantissa is .110110000000000000000002. To get the actual mantissa, we reinsert the leading one to give 1.110110000000000000000002. The number is therefore. -1.110110000000000000000002 x 27, or ‑111011002 (i.e., -236.010)

 

When the computer uses an IEEE number in a calculation, it unpacks to separate the mantissa and exponent (the leading 1 before the stored mantissa must be reinserted before calculations can be carried out). Whole texts have been written on floating point arithmetic. All we want to say here is that floating point arithmetic is slower than fixed point arithmetic, because two floating point numbers can take part in addition or subtraction only if they have the same exponent. Whenever two floating point numbers are added (or subtracted), the smaller exponent is made equal to the larger exponent by denormalizing the corresponding mantissa. Then the mantissas are added (subtracted), and the result re-normalized. This sequence of operations takes time. Let’s look at an example. Suppose you wish to subtract 1.11101...0 x2123 from 1.01010...0 x 2124. Since these numbers don’t have the same exponent, the number with the smaller exponent is first denormalized to get 0.111101...0 x 2124. The subtraction can now take place to give 1.010100...0 x 2124 - 0.111101...0 x 2124 = 0.010111...0 x 2124. This result is not normalized and must be re-normalized to give 1.01110...0 x 2122.

 

You might be surprised to find that not all microprocessors contain floating point instructions (i.e., all operations apply only to integer values). If you wish to perform floating point operations, you have to write programs to handle floating point numbers in terms of very primitive operations.

 

Floating point is a good example of the computer trade off between time and money. Carrying out floating point operations rapidly, requires more complex and expensive hardware. If you want to minimize cost, you can write a program to carry out floating point operations using only primitive binary operations—an approach that slows down the computer.

 

We have now provided a brief overview of computer arithmetic. For most of the time the computer programmer doesn’t have to worry about how arithmetic is performed inside a computer. However, when you are dealing with very large or very small numbers, or when you require answers to a large degree of precision, you cannot neglect what goes on inside the computer.