Back Home Up

Fundamentals Architecture Operating systems Networks Performance

 

COMPUTER PERFORMANCE

This section briefly looks at how we measure computer performance hand hhow we can compare the relative performance of different computers.

 

What Determines Performance?

 This shows the structure of the computer from the point of view of the systems designer[1]. The performance of a computer is dependent on factors other than the processor (CPU) itself. These factors include the cache memory (fast memory that holds frequently used data), main memory that holds programs and data, buses that allow sub-modules to communicate with each other, and secondary storage such as hard disks and CD ROM drives. You cannot maximize the performance of a computer by improving only one of its components.

 

Performance, the computer, and the designer 

Improved computer performance can come from many sources. Indeed, this figure only hints at some of these sources. Below the level of the CPU, the device physicist and semiconductor engineer create intrinsically faster devices; for example, the maximum clock rate can be increased by reducing the propagation delay of signals through the gates in a chip. Relatively few probably appreciate that head dissipation is one of the principle limiting factors on computer speed today – the heat density (watts per cubic centimeter) in a modern high-speed processor chip now approaches the same level as the heat density of the core of a nuclear reactor! Above the level of the CPU, the software engineer develops compiler technology to better exploit the underlying architecture of the processor. Indeed, in today’s world the computer designer and the compiler writer have to work hand-in-hand.

Relative Performance 

We are often more interested in how one computer performs with respect to another. The relative performances of computers A and B is the inverse of their execution times; that is 

 

If system A executes a program in 105 seconds and system B executes the same program in 125 seconds we can calculate the value of n as 125/105 = 1.190. 

You can therefore say that machine A is 19% faster than machine B. 

The objective of the computer designer is to create a system with the greatest possible throughput (no it isn’t – it’s the desire to maximize corporate profit which may not necessarily lead to the computer with the highest performance). When you are trying to improve a system, you are often most interested in how much better the new system is in comparison with the old system. The old system may be a previous machine, the same machine without the improvement, even a competitor’s machine and is called the reference machine or baseline machine. The speedup ratio is a form of “relative performance” and is defined as 

If a reference machine takes 100 seconds to run a program and the test machine takes 50 seconds, the speedup ratio is 100/50 = 2.

Clock Rate 

The most obvious indicator of a computer’s performance is its clock rate, the speed at which fundamental operations are carried out within the computer. For example, you may buy a 3.0 GHz Pentium 4 processor in the hope that the 3.00 GHz processor is faster than the 2.8 GHz version. Using the clock rate as a metric to compare processors is better than nothing – but only just. 

This illustrates the CPU clock. At each clock cycle, the processor carries out an internal operation. At first sight it's tempting to think that the processor’s performance is proportional to its clock rate and therefore clock rate is a precise metric.

There are so many flaws in this argument that it's difficult to know where to begin. We can start by pointing out that there is no single clock in most computers. There are several clocks in nearly every computer system. Some systems may have entirely independent clocks (i.e., there’s a separate clock generator for each functional part such as the CPU, the bus, and the memory). Moreover, the CPU itself may have several functional units each driven by its own clock. Some systems have a single master clock that generates pulses at the highest rate required by any circuit and all other clocks run at a sub-multiple of this frequency.

Some systems require more clock cycles per operation than others. For example, an early CISC based processor required about 32 clock cycles to execute a single instruction. A RISC processor might require about two clock cycles per instruction. A superscalar processor might require perform two instructions per clock cycle. If all three systems have the same clock rate, the CISC system is 16 times slower than the RISC system and up to 32 times slower than the superscalar processor. Assuming, of course, that the instructions running on the three processors are approximately the same in terms of the work they do. 

An interesting comment on clock rates is provided by the behavior of Intel and AMD. Both companies market processors that are compatible with the IA32 instruction set. AMD’s processors are not IA32 clones; they interpret IA32 instructions using a RISC-like core architecture. Consequently, it is unfair to compare directly the clock rate of, for example, an Intel Pentium 4 and an AMD Athlon. However, competitive pressures have locked Intel and AMD into a clock-rate battle. AMD was first to break the 1 GHz barrier and Intel was first to reach 2 GHz. In the Fall of 2001 AMD introduced its Athlon XP 1800+ processor that was specified in terms of an effective clock-speed rating. AMD quoted the equivalent clock speed of their XP 1800+ as 1.8 GHz even though this device had an actual clock speed of 1.533 GHz. AMD's argument is that 1.8 GHz is a better representation of the processor's capability than 1.533 GHz. 

MIPS 

A slightly better metric than clock rate is MIPS, or millions of instruction per second. This metric removes the discrepancy between different systems in terms of the number of clocks per operation because it measures instructions per second rather than clocks per second. For a given computer 

 

where n is the number of instructions executed and texecute is the time taken to execute them. 

The MIPS rating is a poor metric that fails for the same reason as the clock rate. A computer’s MIPS rating tells you only how fast it executes instructions, it doesn’t tell you what is actually achieved by the instructions being executed. Consider the following hypothetical example of computation on two computers A and B, where computer A has a load/store architecture and computer B has a memory-to-register architecture. Computer A is rather more verbose than B . Both the computers evaluate the expression z = 4(x + y).  

Computer A (LOAD/STORE)

Computer B (memory-register)

 

LDR   r1,(r0)   ;load x
LDR   r2,(4,r0) ;load y
ADD   r2,r1,r2  ;x+y
ADD   r2,r1,r2  ;2(x+y)
ADD   r2,r2,r1  ;4(x+y)
STR   r2,(8,r0) ;store z

 

LDR   r1,(r0)    ;load x
ADD   r1,(4,r0)  ;x+y
MUL   r1,#4      ;4(x+y)
STR   r1,(8,r0)  ;store z

Suppose both computers A and B have the same MIPS; that is, they execute code equally rapidly. If you relied solely on MIPS as a metric of performance, you'd have to conclude that the computers offer the same level of performance. As you can see, computer B is faster than computer A because only four instructions are required to get the work done (this argument is based on the assumption that all instructions take the same time).

In practice, computer A might be faster than computer B because memory-to-register architectures are slower than register-to-register architectures. Note that we perform one multiplication by shifting twice and one by direct multiplication.

On the other hand, a CISC processor with special instructions for multimedia-oriented operations such as Intel’s streaming extensions might achieve a very low MIPS rating while executing complex instructions that would take tens of primitive instructions on a typical RISC machine.

 The MIPS metric is also sensitive to the way in which a compiler generates code. The duration of a single instruction is cycles x tcycle, where cycles is the number of machine cycles required to execute the instruction and tcycle is the cycle time (usually the clock period). The total execution time for a program is given by: 

texecution = tcycle x åni x ci  

where ni is the number of times instruction i occurs in the program and ci is the number of cycles required by instruction i.  

If we plug this formula into the equation for MIPS, we get 

 This expression tells us that the MIPS value is affected by the instruction mix (i.e., the nature of ni) and the length of each instruction executed (i.e., the ci term). In particular, if an instruction mix consists of a large number of instructions with one cycle, the MIPS value may be high even though the actual code is executed more slowly than an instruction mix that contains a lot of multiple-cycle instructions. Consider the following hypothetical example. 

A program is compiled to run on a computer and the compiler generates two million one-cycle instructions and one million two-cycle instructions. If we assume that the cycle time is 10 ns, the time taken is given by: 

2 x 106 x 1 x 10 ns + 1 x 106 x 2 x 10 ns = 4 x 106 x 10 ns = 4 x 10-2s. 

Now suppose that a different compiler generates code for the same problem but with 1.5 million one-cycle instructions and 1.2 million two-cycle instructions. In this case the time required to execute the code is 

1.5 x 106 x 1 x 10 ns + 1.2 x 106 x 2 x 10 ns = 3.9 x 106 x 10 ns = 3.9 x 10-2s. 

As you can see, the second compiler generated faster code. Now let’s evaluate the MIPS for each case. 

In the first case, the MIPS is given by ; that is 

MIPS  = 3 x 106/(10 ns x (2 x 106 x 1 + 1 x 106 x 2) x 106) = 0.75 x 102 = 75 MIPS 

In the second case, the MIPS is given by 

MIPS  = 2.7 x 106/(10 ns x (1.5 x 106 x 1 + 1.2 x 106 x 2) x 106)  = 0.69 x 10-2 = 69 MIPS 

We now have a situation in which the faster computer has a lower MIPS when it is clearly superior to the slower computer. This failure of the MIPS metric is inevitable because instruction throughput takes no account of how much work each instruction actually performs.  

MFLOPS

The MFLOPS metric is almost identical to MIPS except that MFLOPS indicates millions of floating-point instructions per second. In principal, the same objections to MIPS also apply to the MFLOPS metric and therefore you might expect MFLOPS to be as poor an indicator of performance as MIPS. 

MFLOPS can be a better metric than MIPS because MFLOPS is a measure of work done by a computer rather than a measure of instruction throughput. MIPS is calculated by counting all instructions executed by a computer, many of which perform no useful work in solving a problem (e.g., data movement operations). MFLOPS considers only floating-point operations and these floating-point operations are almost certainly at the heart of the algorithm being implemented. Consequently, MFLOPS may be an effective metric when comparing computers that are running scientific applications that make extensive use of floating-point operations. 

Of course, no matter how fast a computer is, its MFLOPS metric will be zero if the target program contains no floating-point operations. MFLOPS is most useful in comparing computers that are to be used in scientific applications when numerical calculations tend to dominate the computation. Gliadi [Giladi96] points out that even in the realm of scientific calculation the nature of the program and its data has a profound influence on the MFLOPS rating; for example, when comparing two computers, one system might yield a better rating for calculations involving sparse matrices and the other might yield better results for calculations involving full matrices. MFLOPS is also useful in comparing special-purpose processors such as DSP (digital signal processing) chips that are used in embedded applications to process audio and visual signals in real time. 

Unfortunately, the MFLOPS metric isn’t easy to calculate because all computers don’t implement floating-point arithmetic in the same way. One computer might use dedicated hardware to calculate a trigonometric function such as a sine, whereas another computer might evaluate sin(x) by directly evaluating the appropriate series for sin(x). 

Benchmarks 

The ideal program with which to evaluate a computer is the one you are going to run on that computer. The computer should be configured in exactly the same way in which you are going to use it (i.e., same memory size, cache, I/O configuration, and operating system). Unfortunately, it is normally impractical to carry out such a test. Moreover, workstations don’t run a single program – they frequently execute a complex mix of programs that changes from moment to moment. 

One approach to benchmarking is based on kernels or fragments of real programs that require intense computation such as the LINPACK benchmark. Another approach is to run synthetic benchmarks – these are programs constructed for the express purpose of evaluating computer performance and which purport to be similar to the type of code that users might actually execute. 

Benchmarks can be divided into two categories: fine-grained and course-grained. The granularity of a benchmark is a function of the property being measured; for example, a benchmark that measures the property of a complete computer system can be considered course-grained, whereas a benchmark that measures properties of, say, branch instructions can be considered fine-grained [Krishnaswami2000]. Fine-grained benchmarks measure the execution rates of specific instructions rather than entire applications or kernels. Such benchmarks are sometimes called microbenchmarks

SPEC 

Various organizations exist round the world to offer the consumer unbiased advice, whether it be about the quality of wines or the safety of automobiles. Such an organization has arisen in the computer world to create benchmarks. SPEC, the Standard Performance Evaluation Corporation, is a non-profit making organization with the status of a charity that’s been formed to "establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to the newest generation of high-performance computers". 

The SPEC consortium develops suites of benchmarks intended to measure computer performance and new benchmarks are created by SPEC as technology develops. SPEC members develop test programs. Of course, some members might be tempted to create benchmarks that exploit special features of their own products. However, the scrutiny by other members of the consortium helps to reduce any product bias in the benchmarks created by SPEC. 

Lilja [Lilja2000] points out that SPEC benchmarks can have an effect on compiler writers. Since compilers writers are likely to use SPEC suites to test their compilers, it is probable that some compiler writers will tweak or tune their compilers to the characteristics of the SPEC programs.

Over the years SPEC suites have changed to remove some of their inefficiencies and errors; for example, in 2001 the SPEC CPU95 benchmark was declared obsolete or retired and replaced by CPU2000. The benchmark programs that form a SPEC suite can vary over a wide range from Lisp interpreters to compilers to data-compression programs. 

The first SPEC benchmarks appeared in 1988 with SPEC89, a suite of ten programs. The next major change was in 1992 when SPEC CINT92 (6 integer programs) and SEPC CFP92 (14 floating point programs were published). In 1995 SPEC CINT95 and PSEC CFP95 (8 integer, 10 floating-point programs, respectively) were introduced. The fourth major revision to the benchmarks was SPEC CPU2000 with SPEC CINT2000 and SSPEC CFP2000 (12 integer and 14 floating point programs). SPEC 2000 also introduced benchmarks in C, C++, Fortran 77, and Fortran 90. 

The rapid growth in computer technology in the late 1990s caused SPEC to expand its range of benchmarks by including the new market created by Java. Programs in Java are normally compiled to the so-called bytecode and then executed on a bytecode interpreter known as the Java virtual machine, JVM. In 1998 SPEC released SPECjvm98, a benchmark suite that measures computer system performance for Java virtual machine client platforms. 

By 2001 SPEC was writing very specific benchmarks for CPU-intensive applications; for example the SPEC Application Performance Characterization project group (SPECapc) released the first standardized benchmark for evaluating performance of systems running 3D Studio MAX R3.1 in February 2001. This benchmark contains four applications that test the underlying system running Studio MAX such as “An architectural visualization containing more than a million polygons, with multiple objects and light sources, glass walls for refraction and opacity tests, and multiple textures.” 

The advantage of a benchmark such as SPEC is that it is widely available and it is relatively easy to obtain. Moreover, SPEC provides real results on real data rather than simply relying on metrics such as clock rate and MIPS or MFLOPS that, at best, do not tell the whole story and, at worst, are positively misleading.  

SPEC Methodology

The SPEC methodology is to measure the time required to execute each program in the suite being used to carry out the test; for example, the times might be Tp1, Tp2, Tp3,… where the subscripts p1, p2, etc refer to the components of the suite. 

The SPEC suite is also executed on a so-called standard basis machine to give the values Bp1, Bp2, Bp3,… The measured times are divided by the reference times to give the values

Tp1/Bp1, Tp2/Bp2, Tp3/Bp3,… This step normalizes the execution times of the components of the SPEC suite. 

Finally, the normalized times are averaged to generate a final SPEC metric for the machine being tested. For example, the JVM98 Java benchmarks are normalized against a midrange IBM PowerPC 604 with a 133-MHz processor. The way in which a set of results is averaged is controversial and we will discuss averaging shortly.

Amdahl’s Law 

Improving or enhancing part of a system can increase the system’s overall performance. In everyday terms, you can reduce the time of a journey that involves a flight followed by car ride by increasing the car’s speed.  

Amdahl’s law is describes the performance increase you get when a program is run in a multiprocessor system. Essentially, Amdahl’s law tells you how much more performance you get in return for increasing parallelism. We introduce Amdahl’s law here because it’s of general application to any system where you are interested in the effect of local improvements on the system globally. You could say that Amdahl’s law highlights the effects of bottlenecks in a system. 

Although Amdahl’s law is often applied to parallelism, it can be used in any circumstance in which part of a system can be improved and part of a system cannot be improved.

This illustrates the effect of improving one part of a system (in this case by parallelizing part of the activity). The diagram demonstrates that the serial (irreducible) part of the process remains the same while the parallel (reducible or improvable) part of the system is reduced. In the limit, the system performance is dominated entirely by the serial part of the system.

Suppose a computer executes a program on a single processor in time ts seconds. If we have p processors and the program is divided into p equal chunks, the same program will run in ts/p seconds. In reality, it is difficult to divide a program up into equal parts that can be executed in parallel and part of the program may not be susceptible to parallelization. Suppose that a fraction of the program fp can run on the p processors and a fraction fs can run only on one processor. The execution time on the parallel computer system Tp will therefore be: 

Tp = ts×fs + tp×fp 

Since fs + fp = 1 and tp = ts/p, we can write 

Tp = ts(fs + (1 - fp)/p) 

The speedup ratio for this system is the ratio of the speed without parallelization to the speed with parallelization; that is, S = ts/Tp. 

 

Amdahl’s law tells us that the speedup varies from 1 (fs = 1 with no parallelism) to p (fs = 0 with perfect parallelism). More importantly, it teaches the law of diminishing returns. If we cannot reduce fs, there is little point in increasing the value of p