A new hardware prefetching scheme based on dynamic interpretation of the instruction stream
Date
Authors
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Authors
Research Projects
Organizational Units
Journal Issue
Is Version Of
Versions
Series
Department
Abstract
It is well known that memory latency is a major deterrent to achieving the maximum possible performance of a today's high speed RISC processors. Techniques to reduce or tolerate large memory latencies become essential for achieving high processor utilization;Many methods, ranging from software to hardware solutions, have been studied with varying amounts of success. Most techniques have concentrated on data prefetching. However, our simulations show that the CPU is stalled up to 50% of the time waiting for instructions. The instruction memory latency reduction technique typically used in CPU designs today is the one block look-ahead (OBL) method;In this thesis, I present a new hardware prefetching scheme based on dynamic interpretation of the instruction stream. This is done by adding a small pipeline to the cache that scans forward in the instruction stream interpreting each instruction and predicting the future execution path. It then prefetches what it predicts the CPU will be executing in the near future;The pipelined prefetching engine has been shown to be a very effective technique for decreasing the instruction stall cycles in typical on-chip cache memories used today. It performs well, yielding reductions in stall cycles up to 30% or more for both scientific and general purpose programs, and has been shown to reduce the number of instruction stall cycles as compared to the OBL technique as well;The idea of sub-line prefetching was also studied and presented. It was thought that prefetching full cache lines might present too much overhead in terms of bus bandwidth, so prefetches should only fill partial cache lines instead. However it was determined that prefetching partial cache lines does not show any benefit when dealing with cache lines smaller than 128 bytes.