Back Home Up Next

Numbers Fractions Negative numbers Floating point Binary arithmetic Codes

Binary Codes and Information Representation

 

Because a computer uses only patterns of 1’s and 0’s to represent information, we have to encode all the information we want a computer to process in the form of binary strings. How do we do this? The short answer is “However we want to do it”. There’s no fundamental rule of computing that tells us how to encode information. We are guided only by common sense. Consider the representation of alphanumeric characters (i.e., the symbols found on a keyboard — A to Z, a to z, 0 to 9, and the special symbols !@#$%^&*() etc.) Computer scientists and engineers have designed a standard method of encoding these symbols—the so called ASCII code (American Standard Code for Information Interchange). This code is widely used and has been recognized by the various national and international bodies concerned with standardization. The Java language uses unicode that represents characters by 16-bit values, allowing a much richer character set than ASCII.

 

 

 

0

000

1

001

2

010

3

011

4

100

5

101

6

110

7

111

 

 

 

 

 

 

 

 

 

0  0000

NULL

DCL

SP

0

@

P

p

1  0001

SOH

DC1

!

1

A

Q

a

q

2  0010

STX

DC2

2

B

R

b

r

3  0011

ETX

DC3

#

3

C

S

c

s

4  0100

EOT

DC4

$

4

D

T

d

t

5  0101

ENQ

NAK

%

5

E

U

e

u

6  0110

ACK

SYN

&

6

F

V

f

v

7  0111

BEL

ETB

7

G

W

g

w

8  1000

BS

CAN

(

8

H

X

h

x

9  1001

HT

EM

)

9

I

Y

i

y

A  1010

LF

SUB

*

:

J

Z

j

z

B  1011

VT

ESC

+

;

K

[

k

}

C  1100

FF

FS

,

L

\

l

|

D  1101

CR

GS

-

=

M

]

m

}

E  1110

SO

RS

.

N

^

n

~

F  1111

SI

US

/

?

O

_

o

DEL

 

 

 

 

 

 

 

 

 

The ASCII code

To convert an ASCII character into its corresponding 7-bit binary code, you read the upper-order three bits of the code from the column in which the character falls, and the lower-order bits of code from the row. This table indicates the row and column numbering in both binary and hexadecimal forms; for example, the ASCII representation of the letter “Z” is given by 5A16 or 10110102. Since most computers use 8-bit bytes, the ASCII code for “Z” would be 01011010 (some systems might not set the most-significant bit to zero as we’ve done — but that doesn’t matter here). If you wish to print the letter “Z” on a printer, you transmit the ASCII code for Z, 01011010, to the printer.

 

 

The ASCII codes for the decimal digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9, are 3016, 3116, 3216, 3316, 3416, 3516, 3616, 3716, 3816, and 3916, respectively. For example, the symbol for the number 4 is represented by the ASCII code 001101002, whereas the binary value for 4 is represented by 000001002. When you hit the key “4” on a keyboard, the computer receives the input 00110100 and not 00000100. Whenever you read input from a keyboard or send output to a display, you have to convert between the codes for the numbers and the values of the numbers. In high-level language this translation takes place automatically.

 

The two left-hand columns of table 1.4, representing ASCII codes 0000000 to 0011111, do not contain letters, numbers, or symbols. These columns contain special codes that are used either to control printers and display devices or to control data transmission links. Typical printer control codes are CR (carriage return) and BS (back space). Data link control characters such as ACK (acknowledge) and SYN (synchronous idle) are associated with communications systems that mix the text being transmitted with the special codes required to regulate the flow of the information.

 

Binary Codes

 

We can design endless binary codes, each with its own particular characteristics.

 

Binary pattern

Unsigned number

Binary coded decimal

Gray code

2-out-of-5 code

 

 

 

 

 

0000

0

0000 0000

0000

00011

0001

1

0000 0001

0001

00101

0010

2

0000 0010

0011

00110

0011

3

0000 0011

0010

01001

0100

4

0000 0100

0110

01010

0101

5

0000 0101

0111

01100

0110

6

0000 0110

0101

10001

0111

7

0000 0111

0100

10010

1000

8

0000 1000

1100

10100

1001

9

0000 1001

1101

11000

1010

10

0001 0000

1111

 

1011

11

0001 0001

1110

 

1100

12

0001 0010

1010

 

1101

13

0001 0011

1011

 

1110

14

0001 0100

1001

 

1111

15

0001 0101

1000

 

Examples of binary codes

 

This table describes four typical binary codes. Binary coded decimal, or BCD, is used to represent decimal numbers in applications such as pocket calculators and digital watches. BCD allows you to operate on decimal numbers using decimal arithmetic, but with digits that are represented by 4-bit binary values. Each digit of a decimal number is replaced by its BCD value; for example 1234110= 0001 0010 0011 0100. BCD values in the range 1010 to 1111 are illegal and do not exist. Consider the operation 6 x 7 = 42. In 8-bit binary arithmetic, this operation is represented by 00000110 x 00000111 = 00101010. In BCD arithmetic, the same operation is represented by 0000 0110 x 0000 0111 = 0100 0010.

 

The Gray code is called a nonweighted code because the value of a bit is not determined by the column in which it falls. The special characteristic of a Gray code is that two consecutive values in the code differ by exactly one bit—as you step through the Gray code, only one bit at a time changes. For example, the Gray codes for 7 and 8 are 0100 and 1100, respectively, and differ in only one bit position. Contrast the Gray code with the natural 8421 weighted binary codes for 7 and 8—these are 0111 and 1000, respectively, and differ in all four positions.

 

 

Why would anyone want to use the Gray code? Suppose a machine has a rotating shaft and you want to transmit the shaft’s current angle to a computer. Mechanical devices used to measure angles are not able to change two or more bits in a word at exactly the same time. If the shaft rotates from 5° to 6°, a simple 8421 binary code might go from 0101 to 0111 to 0110,  because the least-significant bit was slow to change. Even though the intermediate code, 0111, is spurious, the computer may act on it—such a short-term spurious error is called a glitch. Because successive values in the Gray code change a bit at a time, glitches are eliminated. The Gray code is also call a unit distance code (we will have more to say about the distance between codes later).

 

The 2-out-of-5 code in table 1.5 uses five bits to represent the ten states 0 to 9. This is another nonweighted code, whose characteristic is that each value has exactly three zeros and two ones.

 

Clever Things you can do with Binary Numbers

 

There’s more to binary codes than the representation of numbers as strings of 1’s and 0’s. Some binary codes have special properties of great interest to both the computer scientist and engineer. In this section were are going to look at two aspects of binary codes: their ability to detect and correct errors, and their ability to compress binary strings into smaller binary strings.

Error Detecting Codes

 

In an ideal world, errors don’t occur. In reality a bit that should be a 1 sometimes gets changed into a 0, and a bit that should be a 0 sometimes gets changed into a 1. Electrical interference can corrupt data being transmitted over long paths; a particle of radiation passing through a memory chip can change the contents of a memory cell; the magnetic surface of a disk or tape may contain a flaw. Whatever the cause, random errors do occur—albeit very infrequently. An error may lead to a program crashing (i.e., the computer stopping), or it may lead to the use of incorrect data. At best this is inconvenient; at worst it can be catastrophic (e.g., in a flight control system).

 

We can use some of the properties of binary numbers to detect errors, or even to correct errors. Suppose we take the binary pattern 01101011 and ask whether or not it is in error (has one of the bits been changed?). We can’t answer this question, because, as far as we’re concerned, one binary pattern is just as good as another. Now consider the word Jamuary. You will immediately realize that it should be spelt January, because there is no word “Jamuary” in the English language. You can correct this spelling error because the closest valid word to “Jamuary” is January. It’s exactly the same with binary codes.

 

 

Encoding a sourceword

In order to create an error detecting code, we have to construct a code in such a way that a single error always leaves a noticeable trace or marker. An error detecting code increases the length of the source code by adding one (or more) redundant bits, so called because they carry no new information. This figure  demonstrates how r redundant bits are added to an m-bit sourceword to create an (m + r)-bit codeword. The redundant bits are also called check bits because they are used to check whether the codeword is valid or not.

 

 

The simplest error detecting code is called a parity check code. We take an m-bit source word and append a parity bit to the front of the source word to produce an (m+1)-bit codeword. The parity bit is chosen to make the total number of 1’s in the code word even (an even parity bit), or odd (an odd parity bit). We will assume an even parity bit here.

 

 

Creating a codeword with even parity

This shows an 8-bit source word, 01101001 that is convert into a 9-bit code with even parity. This binary string has a total of four 1’s, so the parity bit must be selected as 0 to keep the total number of 1s even. Assuming that the parity bit is appended to the left-hand end of the string, the 9-bit codeword is 001101001. If, when this value is stored in memory or transmitted over a data link, any one of the bits is changed, the resulting parity will no longer be even. Imagine that bit 2 is changed from 0 to 1, and the codeword becomes 001101101. This word now contains five 1’s which indicates an error, because the parity is odd.

 

 

 

Before continuing, we need to introduce a new concept—the Hamming distance, which is a measure of the difference between two binary strings. The Hamming distance between two codewords is the number of locations (columns) in which the two binary strings differ. For an m-bit codeword, the smallest Hamming distance is 0 when the two numbers are identical, and the largest is m when the two numbers are complements (i.e., logical opposites). Table 1.6 provides five examples of the calculation of Hamming distances. The Gray code we introduced in the previous section is a unit distance code so that the Hamming distance between consecutive codes is 1.

 

 

 

Case 1

Case 2

Case 3

Case 4

Case 5

Codeword 1

00110111

11110011

11110011

11110011

11110011

Codeword 2

00110011

01010011

11110011

10000010

00001100

Places different

     x

x x

 

 xxx   x

xxxxxxxx

Hamming distance

1

2

0

4

8

Calculating the Hamming distance

 

 

 

 A 3-bit error detecting code

 

This provides a graphical representation of a 2-bit source word with an even parity bit. Although three bits provide eight possible binary values, only four of them are valid codewords with an even parity  (i.e., the black circles representing codes 000, 101, 110, 011). As you can see, a valid codeword is separated from another nearest valid codeword by two unit lengths. In figure 1.7 a unit length corresponds to an edge of the cube. If one of the valid codewords suffers a single error (i.e., only one bit changes), the codeword changes from one of the black circles to one of the white circles one unit length away. Since these codewords all have an odd parity, you can always detect a single error. Two errors (or any even number of errors) cannot be detected because you move from one valid codeword to another valid codeword. Fortunately, if one error is a rare event, two errors are correspondingly more rare (unless the nature of the error inducing mechanism affects more than one bit at a time).

 

 

 

If you detect an error, you must ask for the correct data to be retransmitted (if the corruption occurred over a data link). If the data was stored in memory, there’s little you can do other than to tell the operating system that something has gone wrong.

 

 

Error Correcting Codes

 

We can design an error detecting and correcting code to both locate an error and to fix it.

 

 

3-bit error correcting code

This illustrates the simplest possible 3-bit error detecting and correcting code. Only the two codewords 000 and 111 are valid. Note that the Hamming distance between these two valid codewords is three. Suppose that the valid codeword 111 is stored in memory and later read back as 110. This codeword is clearly invalid, because it is neither 000 nor 111. However, you can see that the invalid codeword 110 has a Hamming distance of 1 from the valid codeword 111, and a Hamming distance 2 from the valid codeword 000. An error correcting code selects the correct code as the nearest valid codeword to the invalid codeword (ieach invalid codeword is connected to its nearest valid codeword by a heavy line). In this case, we assume that the valid codeword was 111—we have now both detected an error and corrected it. We are now going to look at a rather more sophisticated error correcting code, the Hadamard code. As this requires a certain amount of math, you may skip this section initially.

 

 

Adjacent codewords in a 4-unit code

 

 

 

 

 

Adjacent codewords in a 8-unit code

In this case codewords differ from all other valid codewords in 8 bit positions.  Up to three in a codeword errors can still be corrected because a codeword with three errors is three Hamming units away from the correct codeword and five Hamming units away from the next nearest valid codeword.

 

Data Compressing Codes

 

 

Most computer users will have heard of data compression because it is now either part of their operating system or can be obtained as an option. Data compression operates by encoding data in such a way that the encoded version occupies less bits than the original version. Although you can’t compress random numbers, you can compress data that contains redundancy. Consider the sentence “I am going to Washington in January 1996”. If you were taking this down in note form, you might write “I’m gng to W’shtn Jan 96”. You have now compressed 40 ASCII characters into 24 ASCII characters. Digital data can also be compressed if it contains redundant information that can later be restored when the data is decompressed—e.g., English text, diagrams, and pictures.

 

Let’s look at how you might go about compressing data. The figure below shows a radar screen with three targets. Suppose that the radar image is converted into a block of space 16 units by 16 units, and a 1 used to represent a target and a 0 no target. The three targets are represented by 1’s in this block of 16 x 16 elements.

 

A radar image (0 = no target, 1 = target)
 

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 

 

The information can be stored as a string of 16 x 16 = 256 bits. Or, we could just store the locations of the three targets as conventional x,y coordinates: 4,6 10,10 13,3 (the data block begins at location 0,0 at the top left hand, and the location is given as row,column). These three coordinates can be stored as the string of six values 4,6,10,10,13,3 or 010001101010101011010011 in binary form (each coordinate is a 4-bit value). We have just compressed a 256-bit table into 24 bits. As you can imagine, this type of data compression is effective only if the table is sparse and contains relatively few targets.

 

We can approach the coding in another way. If you look at the data, you can regard it as strings of 0’s and 1’s (even though there are no multiple 1’s in this example). If we regard the start of the string as the top left-hand bit, it can be expressed as: 70 zeros, 1 one, 99 zeros, 1 one, 40 zeros, 1 one, 44 zeros. The information needed to store this block of data is 70,1,99,1,40,1,44; i.e., seven bytes (i.e., 56 bits).

 

These two examples demonstrate that some data can be greatly compressed, although the type of data found in real computers can’t be compressed to such an extent. The figure of merit of a data compression system is its compression ratio that is defined as uncompressed size:compressed size. The compression ratio for a typical mixture of code, text and images found in a PC is approximately 2:1.

 

 

Huffman Encoding

 

One of the earliest digital codes was the Morse code that has the four states: dot, dash, the space between dots and dashes, and the space between letters. If Samuel Morse had had a basic knowledge of binary arithmetic, he might have used a binary code composed of exactly five dots and dashes to represent up to 32 letters and numbers. That is, all Morse characters would be five elements long, and range from dot dot dot dot dot to dash dash dash dash dash. Fortunately, Morse devised a much better scheme that took into account the characteristics of English text.

 

Morse argued that some letters in the English language appear very frequently, and some letters are seldom used. He sent his assistant to a printer who counted the relative frequencies of letters used by the printer.

 

Morse gave frequently used letters like E very short codes (i.e., “E” = dot). Seldom used letters like Q were given longer Morse codes (i.e., “Q” = dash,dash,dot,dash). The resulting code was a variable length code because not all letters were encoded by the same number of dots and dashes. The Morse code compresses plain text, because the most commonly used letters have short codes. Interestingly enough, Morse’s code comes very close to the theoretical optimum code for the English language.

 

The type of encoding performed by Samuel Morse is now known as Huffman encoding, and employs a probabilistic model of the sourcewords to generate codewords. Frequently occurring sourcewords are assigned short codewords, and infrequently used sourcewords are assigned long codewords. We can easily calculate the average length of a message that has been Huffman encoded. Suppose that a sourceword si has a probability pi of appearing as the next element in the information to be encoded. The average length of a Huffman-encoded message is therefore given by the sum of the probability of each sourceword in the message multiplied by the length of its codeword that is:

 

nSpisi  for i = 1 to n symbols.

 

If a system employs four codewords with the lengths 1, 2, 4, 5, and the probability of each codeword is 0.4, 0.3, 0.2, and 0.1, respectively, the average length of a message with n codewords is: 1 x 0.4 + 2 x 0.3 + 4 x 0.2 + 5 x 0.1 = 2.3n.

 

Let’s look at a simple example of Huffman encoding. Consider an alphabet of five characters A, B, C, D, and E. Table 1.10 provides the relative frequency of the occurrence of these letters in a long message. Such a table can be derived by obtaining the statistics from many messages. The values in this table are hypothetical and have been chosen to provide a simple example.

 

Symbol

A

B

C

D

E

Relative frequency

8

4

2

1

1

Relative probability

0.5

0.25

0.125

0.0625

0.0625

The relative frequency of symbols in an alphabet

Symbol A is the most common symbol (it occurs eight times more frequently than symbol D) and we will give it the shortest possible code—a single bit. It doesn’t matter whether we choose a 1 or a 0. We’ll represent A by 0. If symbol A is represented by a single-bit code 0, what is represented by a 1? The answer is, all the remaining symbols. We therefore have to qualify the code 1 by other bits in order to distinguish between the remaining four symbols.

 

 

 

 

A Huffman encoding tree

We will represent the next most common symbol B by the code 10, leaving the code 11 to be shared among symbols C, D, and E. Continuing in this manner, the code for symbol C is 110, for symbol D is 1110, and for symbol E is 1111. Figure 1.12 provides a coding tree or trellis to illustrate how the symbols are encoded. As you can see, we have now constructed a code in which symbol A is represented by a single bit, whereas symbol E is represented by four bits.

Consider the encoding of the string BAEAABDA. We begin at the point labeled start and follow the tree until we get to the terminal symbol we wish to encode (i.e., A, B, C, D, or E); for example, the bit 0 takes us immediately to the terminal symbol A, whereas you have to take the path 1,1,1,0 to reach the terminal symbol D. The encoding of the sequence BAEAABDA is therefore B = 10, A = 0, E = 1111, A = 0, A = 0, B = 10, D = 1110, A = 0, to give the string 1001111001011100.

 

 

The average number of bits/symbol required to encode a long message is given by calculating the sum of the length of each symbol multiplied by the symbol’s probability. In this example there are five symbols and the average length of a message is: 1 x .5 + 2 x .25 + 3 x .125 + 4 x .0625 + 4 x .0625 = 1.875. If the same five symbols had been coded conventionally, three bits would have been required to represent 000 = A to E = 100. Huffman encoding has reduced the average storage requirement from three to less than two bits per symbol.