General-purpose Reconfigurable Functional Cache architecture
A machine that is configurable has the ability to control the run time environment to suit the task being executed. Customizing the processor hardware according to the specific requirement of various applications is an extension to the idea of reconfigurability. Reconfigurable cache architecture gives us the flexibility to dynamically alter the execution logic of a processor, customized to a specific application. Reconfigurable Functional Cache Architecture uses a dynamic resource configuration of on-chip cache memory by integrating Reconfigurable Functional Caches (RFC). The limitation of the RFC is the fact that the RFC can be customized for a couple of Multimedia and DSP based applications and cannot be utilized for real world applications. In this thesis we propose a new scheme to use the on-chip cache resources with the goal of utilizing it for a large domain of general- purpose applications. We map frequently used basic blocks, loops, procedures, and functions, from a program on this reconfigurable cache. These program blocks are mapped on to the cache in terms of basic ALU instructions, along with the flow in terms of the dependencies between them. The cache module is configured into stages performing specific ALU instructions. Each stage behaves like a reservation station, waiting on its operands to be ready and then performing its function. The proposed broadcast mechanism is similar to the result bus, providing the result to consuming instructions and stages. Architecturally visible extra registers are added to route the input operands to the stages and read the result. In short, a program segment along with its data flow graph can be mapped on to the cache. Our scheme increases cache access time by 1.3%. Our studies with the code corresponding accounting for 12-15% of the go benchmark show a maximum of 6.5% reduction in execution time for a 16 wide issue processor. For each of the individual basic blocks, a maximum speed-up of 6.5x while the least was 1.2x, with an average of 3x. Our scheme seeks to reduce instruction count; decrease the fetches, issues, and commits; compute certain operations in less than a cycle and remove the overhead of the ALU instruction dependencies from the program. By doing so this scheme offers better performance by proposing a new method to execute basic blocks from general-purpose programs.