The purpose of writing a survey report is to study a research topic thoroughly, and to summarize the existing studies in an organized manner. It is an important step in any research project.
On-campus students may work as groups and are required to do a presentation and submit a report. Off-campus students may work alone and is only required to submit a report. The report is expected to be 15 pages plus references. Latex is recommended for writing the report, but Microsoft Word file will also be accepted. The presentation is 30-minute long. The report is due by Friday, December 9.
At the end is a list of suggested topics. You can also propose your own topic by sending e-mail to the instructor with a short description. Each topic comes with a few introductory papers to help you to start. You should search the literature for as many related papers as possible. The number is at least 20 for any topic in the list and could be much higher for some topics.
Here is a suggestion on how to search papers: Search a well-known paper in a good digital library, and look for papers cited by the paper and papers citing this paper. ACM and IEEE digital libraries are good ones. The CiteSeer web site is also a good source but is not as authoritative.
After a list of papers is obtained, the next step is to read them. Read good papers and skip bad papers. You may contact the instructor for help (but be aware that he may not read every paper). Then find a way to organize the papers. Pay attention to the contributions made by each paper. A good example of survey is "Cache Memories" by Alan Smith in 1982 (see "Reading Materials").
Suggested topics:Given an implementation technology, there are two general approaches to extending a superscalar pipeline: to use more pipeline stages and to issue more instructions per cycle. It is known that the scheduling logic is the bottleneck in so doing, because the scheduling logic cannot be pipelined. For a sequence of instructions that form a dependence chain, pipelining the scheduling logic does not yield any performance improvement for this code. Note all other pipeline stages can be pipelined to increase processor frequency without major impact on IPC.
As processor-memory speed gap continues to widen and with the advance of VLSI technology, it becomes attractive to integrate DRAM main memory into the processor chip. The integration improves DRAM latency and bandwidth. The technical challenge is that, because of different requirements in fabricating DRAM and CPU, such processors cannot run as fast as conventional processor. Even though, the reduction on memory stall time is impressive.
DRAM is not really Random Access Memory: A DRAM memory has complex internal architecture, which contains buses or channels, multiple chips, and multiple banks with internal DRAM cell arrays and row buffers. As processor speed continues to improve, evaluations and optimizations of DRAM main memory architectures are now more important than ever.
Correlation-based prefetching recognizes correlations between memory addresses (either reference addresses or miss addresses) and then predicts future miss addresses for prefetching. Stream buffer is a well know example. More complicated techniques have been developed to recognize complex memory access patterns.
For many applications, correlation-based prefetching, such as stream buffer and Markov prefetching, is an effective and relatively simple approach for both instruction and data cache misses. However, it is not accurate for applications with irregular access patterns. Precomputation-based prefetching techniques spawn one or more threads to perform prefetching, usually when a cache miss happens. The prefetching threads are speculative, i.e. they do not change the architectural states of any register or memory word. Thus, they are not limited by processor resources such as ROB or issue queue. And the prefetching accuracy is high because the execution is close to the actual execution.
Wide-issue processors may execute multiple basic blocks in one cycle, raising two issues for high-bandwidth instruction delivery. First, the instruction fetched within one cycle (a fetch group) may contain more than one branches, thus the branch prediction must be done more than once per cycle. Second, the instruction fetching unit may have to fetch instructions from non-continuous locations in the cache, increasing the pressure on instruction cache bandwidth. Trace cache was proposed to address this issue, and has been used in the Intel Pentium 4 processor. In trace cache, multiple basic blocks that are likely to execute in sequence are combined together to form an instruction trace. The branch prediction is now to predict which trace will be used.
Software simulation is the most common methodology in contemporary processor design. Simulations tools are developed and extensively used not only by academic researchers but also by industry giants like Intel and IBM. A simulation tool may emulate a given ISA and produce an instruction profile for a given application. It may also simulate a certain component of the processor, e.g. cache, branch prediction, or main memory system. The most complex simulators are so called “cycle-accurate” simulators. They simulate every pipeline operations of a processor. All of those simulation tools are capable of executing genuine binary code, and some of them may even boot an operating system (called full-system simulator).
As the complexity of out-of-order superscalar processor increases, analyzing and predicting application performance is increasingly challenging. Application performance is dependent on its inherent ILP, branch prediction performance, cache performance, and other factors such as TLB miss rates, even if we ignore OS activities and I/O performance. Simulation can accurately report performance statistics; however, the computation cost is so high that only a limited portion of applications and a limited portion of execution may be studied.
Multimedia is an important class of application. An important characteristic of multimedia workloads is “streaming processing.” The efficient use of memory bandwidth is important. Additionally, multimedia workloads have a high degree of ILP, which favors SIMD instruction extension or VLIW and vector processors.