RICH Prefetcher

Memory is a bottleneck.

Driven by the excessive demands of modern workloads, the evolution of memory technologies increasingly prioritizes bandwidth and capacity over access latency, and this is a big issue.
One way to hide this memory latency is to prefetch.
Prefetching refers to the act of predicting subsequent memory accesses and fetching the required data ahead of the processor execution.

To Achieve this, A prefetcher needs to manage training metadata, learned memory access patterns and control metadata.

We assess the effectiveness of prefetchers using the following metrics:

Accuracy: Accuracy indicates the correctness of the prefetch actions performed by the prefetcher.
Coverage: Coverage measures the prefetcher’s ability to detect the access patterns of a program.

Issues with Conventional Prefetchers

In view of the high-bandwidth, high-capacity memory trends, we believe it is essential to harness memory as an important component in prefetcher architectures. Therefore, we categorize hardware prefetchers into two groups: those who can leverage off-chip memory, and those who don’t.

Temporal Prefetchers
- they focus on irregular access patterns.
- For instance, ISB [14], MISB [15], and STMS [16] record streams of past cache misses and issue prefetches by replaying a stream when the identical leading access address appears again.
- Although they successfully out-source the storage of long streams and can tolerate high latency, temporal prefetchers can only support a specific class of applications.
- This is because they require exact access sequence repetition to work, which are only seen in programs with pointer-rich data references [32, 33] or complex instruction control flow [34–38].
Spatial Prefetchers
- Spatial prefetchers have a remarkable strength of eliminating compulsory cache misses.
- Compulsory cache misses are a major source of performance degradation in important classes of applications.
- An application exhibits spatial correlation because its data objects or groups of data objects often share a same data structure and the same memory layout thereof.
- The bit-vector-based prefetching represents an important form of spatial data prefetchers. It learns access patterns of memory regions at the granularity of a fixed size, e.g. 4 KB pages.

RICH: Design Philosophy

There are two key aspects:

identify the opportunity arising from the memory technology trends and leverage more predictor metadata to boost prefetching performance.
address the challenge of minimizing the overheads of the new metadata by wise use of off-chip and on-chip resources altogether.

Insight No.1
Workloads prefer diverse region sizes for spatial prefetching.
e.g., half of them work best under conventional 4 KB sizes, while the other half prefer larger ones. Therefore, supporting multiple region sizes is crucial for performance.

As illustrated in Figure 1, different traces reach their highest performance at different region sizes, with 46% of the samples peak when the region size is set larger than 4 KB.

RICH: High-Level Idea

Conventional prefetchers are fundamentally constrained by on-chip storage: to keep lookup latency small, they are forced to store only a tiny amount of metadata close to the core.
RICH instead treats main memory itself as a scalable backing store for prefetch metadata and only keeps a compact “cache” of hot metadata entries on-chip.
Each region of memory has an associated metadata record that captures which cache lines inside the region tend to be accessed together; this is the basis for spatial correlation.
By allowing far more metadata than fits on-chip, RICH can learn richer, longer-lived spatial patterns without throwing old information away as aggressively.

RICH: Metadata Organization

RICH partitions the physical address space into regions and allows multiple region sizes, so that an application that prefers fine-grain correlation can use small regions, while streaming or array-based codes can use larger ones.
For each active region, RICH keeps:
- A region descriptor (region base, region size, and a few control bits).
- A compact bit-vector (or similar structure) that records which cache lines in the region have been seen and which lines should be prefetched when a particular line is accessed.
These region metadata records predominantly live in off-chip memory, in a dedicated metadata area; an on-chip structure only caches the most recently used ones to keep lookup latency low.
When a region’s metadata is not present on-chip, RICH issues a metadata fetch to memory, then uses and updates it just like a regular cached entry.

RICH: Prefetching Workflow

On every qualifying cache miss (typically at the last-level cache), RICH performs three main steps:

Region lookup
- Compute the region that contains the missed address (taking the current region-size choice into account).
- Probe the on-chip RICH metadata cache; on a miss there, fetch the region’s metadata from memory.
Prefetch decision
- Use the region’s bit-vector to find lines that are strongly correlated with the current miss.
- Filter out lines that are already present or are too far ahead to be useful, then issue prefetch requests for the remaining candidates.
Training / update
- Once the demand access is resolved, update the region metadata to reflect the new access.
- If the metadata record has been modified, write it back lazily to memory so that future references (even on a different core) can reuse the learned pattern.

Trading Memory for Latency Hiding

Using main memory to store most of the metadata increases access latency for cold regions, but this is acceptable because:
- Prefetch requests are inherently latency-tolerant, and
- The on-chip cache filters most hot regions, so common cases still see low metadata lookup latency.
The extra memory footprint is modest relative to DRAM capacity but enables much richer spatial models than purely on-chip designs.
Compared to conventional spatial prefetchers that use a fixed, small table, RICH significantly improves:
- Coverage, by tracking more regions and longer execution histories.
- Accuracy, by adapting region sizes and keeping per-region correlation information instead of coarse global patterns.

Qualitative Benefits and Limitations

Strengths
- Particularly effective for workloads with structured but non-uniform spatial locality (e.g., graph analytics, irregular scientific codes, and some server workloads).
- Naturally complements temporal prefetchers: RICH focuses on spatial correlations within regions, while temporal prefetchers capture repeating miss sequences.
- Scales with memory capacity: as DRAM sizes grow, the system can afford more and richer metadata.
Costs / caveats
- Consumes extra DRAM capacity and bandwidth for metadata accesses, which can become noticeable on memory-bandwidth-bound systems.
- Less useful when workloads have very little spatial locality or when access patterns change too rapidly for metadata to be reused.
- Requires careful hardware implementation to ensure that metadata fetches and writebacks do not interfere with critical demand traffic.

Implementation Notes

Placement in the memory hierarchy
- RICH is typically implemented at or near the last-level cache (LLC) so it can observe all misses and issue prefetches toward memory.
- The on-chip metadata cache sits alongside the LLC tags and data, but is much smaller than the full off-chip metadata space.
Metadata address mapping
- Each physical region is mapped to a metadata entry using a deterministic function (e.g., hashing the region base) so that metadata can be located in DRAM without extra tags.
- A fixed area of DRAM is reserved for metadata, and the memory controller knows how to translate a region identifier into a DRAM address for the metadata record.
On-chip structures
- The RICH metadata cache can be implemented as a set-associative structure with LRU or a simple replacement policy, similar to a small cache.
- A small queue or buffer may be used to track outstanding metadata fetches and writebacks, decoupling them from critical data requests.
Interaction with the memory controller
- Metadata reads and writebacks are scheduled with lower priority than demand loads/stores to avoid harming critical-path latency.
- Simple throttling logic can limit the number of in-flight metadata operations when the memory system is heavily loaded.
Coherence and sharing
- Since metadata is advisory (it does not affect correctness), it does not need strict cache-coherence; stale metadata at worst leads to useless or missed prefetches.
- In multicore systems, cores can either share a global RICH metadata space or keep per-core partitions, depending on area and complexity budgets.
Fallback behavior
- If metadata is not available in time (e.g., due to a long metadata miss), the prefetcher can simply skip issuing prefetches for that access.
- When RICH is disabled or bypassed, the system falls back to baseline behavior with no impact on correctness.