CSM Projects

ON MICROPROCESSORS, MEMORY HIERARCHIES, AND AMDAHL'S LAW

David O'Neal, NCSA/UIUC
J. Urbanic, Carnegie Mellon University

 

This file is available for
download in .pdf format.
Click here

While preparing material for an article describing fundamental performance characteristics of Cray PVP and MPP architectures (O'Neal and Urbanic, 1997), it became apparent that a relationship between performance and caching efficiency could be written for any computing system featuring a hierarchical memory system. After formalizing this relationship, selected results are plotted and a basic analysis is performed. To fix ideas, a relevant application is demonstrated. Findings are summarized in closing.

 

Introduction:

The premise of Amdahl's Law is that every algorithm has a sequential part that ultimately limits the speedup that can be achieved by a multiprocessor implementation. In this context, speedup is the ratio of the execution times for single and multiple processor runs. The classical form of Amdahl's Law (1967) is usually stated as follows: if the serial component of an algorithm accounts for 1/S of its execution time, then the maximum speedup that can be attained by running it on a parallel computer is S.

A variation of Amdahl's Law can be written in terms of the ratio of cycle counts required to complete data fetch operations from memory and cache. Consider a basic hierarchy comprised of a CPU, a single cache structure, and a monolithic memory system. In this setting, the ratio of fetch times (or rates) for memory, tM, and cache, tC, determine speedup. The supposition is that every code requires memory-to-processor bandwidth that will eventually limit its performance on a cache-based processing system. If this part of the code were to suddenly fit into cache, the potential performance improvement would be S. Other definitions needed to represent the method include:

  • fC fraction of time spent operating on cached data (cache efficiency)
  • fM fraction of time spent operating on memory resident data, or 1-fC
  • TN total time required to perform N operations
  • PN performance rate in operations per unit time, or N / TM
  • S ratio of data fetch times tM / tC (speedup)
Core concepts and bounds associated with application of the method are presented within. A procedure for addressing hierarchies comprised of more than three levels (e.g. systems featuring multiple caches and/or networked memory systems) has nearly been completed and will be the focus of a future article.

 

Objectives:

This research was initiated in 1997 while working on a paper describing fundamental differences between Cray PVP and MPP architectures. The recent decommissioning of the Cray C916 at ASC coupled with the acquisition of a 64-processor SMP machine regarded by some as a direct replacement for the Cray machine provided justification to investigate the proposed 4:1 purchase ratio of DEC EV6 microprocessors to the Cray C90 proprietary design.

 

Methods/procedures/apparatus:

Tools required to validate functions were available in the form of the perfex and hinv utilities featured by Silicon Graphics computing systems. We were also privy to accounts on similar systems featuring different CPU boards (195 MHz and 250 MHz) that allowed a basic throughput analysis to be completed as well. A similar study of the IBM POWER2 and POWER3 processors has been planned.

 

Results and Discussion:

After establishing a test value for cache efficiency, the corresponding relative and absolute performance figures can be developed for comparison purposes. The effect of increasing the clock rate without an accompanying increase in memory bandwidth is clearly indicated by the increase in speedup and the accompanying decrease in relative performance.

The difference between the rates at which contemporary microprocessors can fetch words from memory and cache can be used in conjunction with an estimated (or targeted) performance figure to approximate the corresponding cache efficiency. Conversely, estimated or targeted cache efficiencies can be used to predict relative performance.

A sensitivity study of the relative performance and cache efficiency functions has not yet been completed. However, results suggest that the method is more susceptible to error at lower values of caching efficiency. For indicated speedup values, diminishing cache efficiencies correlate to steeper rates of change.

 

Conclusions and Recommendations:

Increases in caching efficiency can be realized externally through vendor-initiated improvements in hardware or software design, but for a fixed set of conditions, the burden of enhancing cache utilization must be taken up by the user community. Any presumptions that imply a need to achieve higher cache efficiencies must consider the amount of work required to accomplish the task. In any case, unrealistic efficiencies in the form of near-peak performance levels should never be used for the purpose of evaluating equipment throughput.

The set of curves presented within clearly show that the production of efficient code becomes increasingly difficult as the difference between memory and cache fetch rates grows. However, as processor clock rates are quickly approaching physical limits, we expect that this situation will reverse itself. In the mean time, larger caches, improved logic, and additional layers of memory hierarchy must serve to mask imbalances.

 

References:

A. Ahi et al, R10000 Superscalar Microprocessor Processor, SGI Hardware Products Division, Hotchips (1995).

G. Amdahl, Validity of the single-processor approach to achieving large-scale computational capabilities, in AFIPS Conference Proceedings, volume 30, page 483, AFIPS Press (1967).

A. Ghanem, SGI/Cray Origin 2000 Architecture, NCSA Consulting Services Group (1997).

D. O'Neal and J. Urbanic, On Performance and Efficiency: Cray Architectures, Proceedings of the Parallel Computing Conference 1997, Ohio Supercomputing Center, Columbus, OH (1997).

D. Pressel, Methods for Achieving High Levels of Performance on SGI Origin 2000's and Other RISC- based Cache Coherent Shared Memory Symmetric Microprocessors, High Performance Computing Division, Corporate Information and Computing Center, U.S Army Research Laboratory, Aberdeen Proving Ground, MD (1998).

J. Williamson and J. Levesque, Cray Vectorization Manual, Pacific Sierra Corporation, Placerville, CA (1986).

M. Zagha et al, Performance Analysis Using the MIPS R10000 Performance Counters, Supercomputing '96 Conference Proceedings, Pittsburgh, PA (1996).

 

Related documents and images:

http://www.ncsa.uiuc.edu/EP/CSM/publications/1999/UCG99_Amdahl.pdf

http://www.ncsa.uiuc.edu/EP/CSM/presentations/UGC99_Amdahl_PPT.pdf

 

Acknowledgements:

ASC PET, NCSA, Pittsburgh Supercomputing Center

 


  [NCSA]