Survey Report

The purpose of writing a survey report is to study a research topic thoroughly, and to summarize the existing studies in an organized manner. It is an important step in any research project.

On-campus students may work as groups and are required to do a presentation and submit a report. Off-campus students may work alone and is only required to submit a report. The report is expected to be 15 pages plus references. Latex is recommended for writing the report, but Microsoft Word file will also be accepted. The presentation is 30-minute long. The report is due by Friday, December 9.

At the end is a list of suggested topics. You can also propose your own topic by sending e-mail to the instructor with a short description. Each topic comes with a few introductory papers to help you to start. You should search the literature for as many related papers as possible. The number is at least 20 for any topic in the list and could be much higher for some topics.

Here is a suggestion on how to search papers: Search a well-known paper in a good digital library, and look for papers cited by the paper and papers citing this paper. ACM and IEEE digital libraries are good ones. The CiteSeer web site is also a good source but is not as authoritative.

After a list of papers is obtained, the next step is to read them. Read good papers and skip bad papers. You may contact the instructor for help (but be aware that he may not read every paper). Then find a way to organize the papers. Pay attention to the contributions made by each paper. A good example of survey is "Cache Memories" by Alan Smith in 1982 (see "Reading Materials").

Suggested topics:

Reducing the complexity of issue logic
Given an implementation technology, there are two general approaches to extending a superscalar pipeline: to use more pipeline stages and to issue more instructions per cycle. It is known that the scheduling logic is the bottleneck in so doing, because the scheduling logic cannot be pipelined. For a sequence of instructions that form a dependence chain, pipelining the scheduling logic does not yield any performance improvement for this code. Note all other pipeline stages can be pipelined to increase processor frequency without major impact on IPC.
- Subbarao Palacharla and Norman P. Jouppi and J. E. Smith. Complexity-effective superscalar processors. ISCA 1997. Related publication: Complexity-Effective Superscalar Processors, by Subbarao Palacharla, PhD thesis, Wisconsin 1998.
- J. Stark, M. D. Brown, and Yale N. Patt, On pipelining dynamic instruction scheduling logic. Micro 2000.
- D. Ernst and T. Austin. Efficient dynamic scheduling through tag elimination . ISCA 2002.
- Llhyun Kim and Mikko H. Lipasti, Half-price Architecture ISCA 2003.
Processor with integrated DRAM main memory
As processor-memory speed gap continues to widen and with the advance of VLSI technology, it becomes attractive to integrate DRAM main memory into the processor chip. The integration improves DRAM latency and bandwidth. The technical challenge is that, because of different requirements in fabricating DRAM and CPU, such processors cannot run as fast as conventional processor. Even though, the reduction on memory stall time is impressive.
- A. Saulsbury, F.Pong, and A. Nowatzyk. Missing the memory wall: The case for processor/memory Integration. ISCA 1996.
- Doug Burger and James R. Goodman and Alain Kagi. Memory bandwidth limitations for future processors. ISCA 1996.
- David Patterson et al., A Case for Intelligent DRAM: IRAM. IEEE Micro, April 1997
Modern DRAM memory architectures

DRAM is not really Random Access Memory: A DRAM memory has complex internal architecture, which contains buses or channels, multiple chips, and multiple banks with internal DRAM cell arrays and row buffers. As processor speed continues to improve, evaluations and optimizations of DRAM main memory architectures are now more important than ever.

V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A performance comparison of contemporary DRAM architectures. ISCA 1999.
S. Rixner et al. Memory Access Scheduling. ISCA 2000
Z. Zhang, Z. Zhu, and X. Zhang. Fine-grain priority scheduling on multi-channel memory systems. HPCA 2002.

Correlation-based prefetching techniques
Correlation-based prefetching recognizes correlations between memory addresses (either reference addresses or miss addresses) and then predicts future miss addresses for prefetching. Stream buffer is a well know example. More complicated techniques have been developed to recognize complex memory access patterns.
- D. Joseph and D. Grunwald. Prefetching Using Markov Predictors. ISCA 1997.
- T. Sherwood, S. Sair, and B. Calder. Predictor-Directed Stream Buffers. Micro 2000.
- A.-C. Lai and C. Fide, and B. Falsafi. Dead-Block Prediction and Dead-Block Correlating Prefetchers. ISCA 2001.
Precomputation-based prefetching schemes

For many applications, correlation-based prefetching, such as stream buffer and Markov prefetching, is an effective and relatively simple approach for both instruction and data cache misses. However, it is not accurate for applications with irregular access patterns. Precomputation-based prefetching techniques spawn one or more threads to perform prefetching, usually when a cache miss happens. The prefetching threads are speculative, i.e. they do not change the architectural states of any register or memory word. Thus, they are not limited by processor resources such as ROB or issue queue. And the prefetching accuracy is high because the execution is close to the actual execution.

Murali M. Annavaram and Jignesh M. Patel and Edward S. Davidson, Data Prefetching by Dependence Graph Precomputation, ISCA 2001.
Rajeev Balasubramonian and Sandhya Dwarkadas and Daivd H. Albonesi, Dynamically Allocating Processor Resources between Nearby and Distant ILP, ISCA 2001.
Chi-Keung Luk, Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors, ISCA 2001.

Trace cache

Wide-issue processors may execute multiple basic blocks in one cycle, raising two issues for high-bandwidth instruction delivery. First, the instruction fetched within one cycle (a fetch group) may contain more than one branches, thus the branch prediction must be done more than once per cycle. Second, the instruction fetching unit may have to fetch instructions from non-continuous locations in the cache, increasing the pressure on instruction cache bandwidth. Trace cache was proposed to address this issue, and has been used in the Intel Pentium 4 processor. In trace cache, multiple basic blocks that are likely to execute in sequence are combined together to form an instruction trace. The branch prediction is now to predict which trace will be used.

E. Rotenberg and S. Bennett and J. E. Smith. Trace cache: a low latency approach to high bandwidth instruction fetching. MICRO 1996.
D. H. Friendly, S. J. Patel and Y. N. Patt, Putting the fill unit to work: dynamic optimizations for trace cache microprocessors. MICRO 1998.
B. Black, Bohuslav Rychlik, John Paul Shen, The block-based trace cache. ISCA 1999.

Simulation tools in computer architecture research

Software simulation is the most common methodology in contemporary processor design. Simulations tools are developed and extensively used not only by academic researchers but also by industry giants like Intel and IBM. A simulation tool may emulate a given ISA and produce an instruction profile for a given application. It may also simulate a certain component of the processor, e.g. cache, branch prediction, or main memory system. The most complex simulators are so called “cycle-accurate” simulators. They simulate every pipeline operations of a processor. All of those simulation tools are capable of executing genuine binary code, and some of them may even boot an operating system (called full-system simulator).

Using the SimOS Machine Simulator to Study Complex Computer Systems,” Mendel Rosenblum and Edouard Bugnion and Scott Devine and Stephen Alan Herrod, ACM Transactions on Modeling and Computer Simulation, Vol. 7, No. 1, 1997.
Jeff Gibson and Robert Kunz and David Ofelt and Mark Horowitz and John Hennessy and Mark Heinrich, “FLASH vs. (Simulated) FLASH: closing the simulation loop", ASPLOS 1998.
Rajagopalan Desikan and Doug Burger and Stephen W. Keckler, “Measuring Experimental Error in Microprocessor Simulation,” ISCA 2001.

Predicting performance of out-of-order superscalar processors

As the complexity of out-of-order superscalar processor increases, analyzing and predicting application performance is increasingly challenging. Application performance is dependent on its inherent ILP, branch prediction performance, cache performance, and other factors such as TLB miss rates, even if we ignore OS activities and I/O performance. Simulation can accurately report performance statistics; however, the computation cost is so high that only a limited portion of applications and a limited portion of execution may be studied.

T. Sherwood and E. Perelman and G. Hamerly and B. Calder. Automatically characterizing large scale program Behavior, ASPLOS, 2002.
T. Sherwood, Erez Perelman and Brad Calder, Basic Block Distribution analysis to find periodic behavior and simulation points in applications, PACT 2001.
Mark Oskin, Frederic T. Chong, and Matthew Farrens. HLS: Combining statistical and symbolic simulation to guide microprocessor designs. ISCA 2000.

Multimedia Processor Design

Multimedia is an important class of application. An important characteristic of multimedia workloads is “streaming processing.” The efficient use of memory bandwidth is important. Additionally, multimedia workloads have a high degree of ILP, which favors SIMD instruction extension or VLIW and vector processors.

Scott Rixner, William J. Dally, >Ujval J. Kapasi, Brucek Khailany , Abelardo López-Lagunas , Peter R. Mattson , John D. Owens, A bandwidth-efficient architecture for media processing, MICRO 1998.
Sarita Adve, and Norman P. Jouppi, Performance of image and video processing with general-purpose processors and media ISA extensions, ISCA 1999.
Christoforos Kozyrakis and David Patterson, Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks, MICRO 2002.