Traditional ISAs in general purpose processors provide little high-level information to programs. In particular the interface between compute and memory systems has been simple scalar and vector access instructions. Yet many memory system innovations can benefit from higher-level access information, like prefetching, bypassing, replacement, and more recently, near-data computing. Recovering high-level information by analyzing access patterns dynamically is imperfect and slow. Our group is exploring how to augment VonNeumann ISAs with the ability to explicitly encode coarse grain memory access patterns – streams – and use this added information for more efficient prefetching, decentralized cache optimizations, and near data computing.
Big news for this direction. NVIDIA Hopper has recently adopted streams into their ISA, with a mechanism called “Tensor Memory Accelerators” (see the blog post and whitepaper. This is clearly just the beginning for decoupled memory access in commercial processors.
We are designing programmable accelerators to surpass other data processing accelerators like GPUs and DSPs. Some of our early accelerator ISAs were incorporated in commercial designs. See Accelerator ISAs page for more on our philosophy.
Our long term vision is that programmable accelerators can become useful for problems that we normally associate only with CPUs: ie. on “irregular” workloads. Irregular workloads are those that have some form of data-dependences, as data-dependences typically interfere with the structure exploited by traditional architectures. For example, data-dependent control flow makes vectorization difficult. However, encoding idioms for these data-dependent patterns in ISAs makes it possible to execute them much more efficiently.
Our work has explored techniques for irregular memory access, irregular control flow, and fine grain and coarse grain irregular parallelism.
While general-purpose accelerators are powerful, their mechanisms are not free, and often some amount of customization is desired…
Domain-specific hardware accelerators are extremely efficient, but require extensive manual effort in hardware and software stack development. Automated ASIC generation (eg. HLS) can be insufficient, because the hardware becomes inflexible. An ideal accelerator generation framework would be automatable, enable deep specialization to the domain, and maintain a uniform programming interface. Our insight is that many prior accelerator architectures can be approximated by composing a small number of hardware primitives, specifically those from spatial architectures. With careful design, a compiler can understand how to use available primitives, with modular and composable transformations, to take advantage of the features of a given program.
To this end, our group is exploring a novel paradigm for accelerator design where they are generated by searching within such a rich graph-based design space, and guided by the affinity of input programs for hardware primitives and their interactions. See our tutorial page for a snapshot of our tools (or contact us for the latest version). In ongoing work, we are increasing the scope of the hardware design space, input languages, search techniques, and we are also integrating our framework with ChipYard for end-to-end silicon and FPGA overlay designs.