Author: Yutong Huang, Zhiyuan Guo, Yiying Zhang
Up to 3.6x Performance Boost
over state-of-the-art far-memory systems
The Far Memory Problem
Far memory architectures use cheaper, network-attached memory, but accessing it is slow. This latency is a major bottleneck. Prefetching—predicting and fetching data before it’s needed—is the solution, but traditional methods fail when access patterns get complex.
Latency Gap
- Local
DRAM:~100 ns - Far
Memory:>2000 ns
A far memory access can be over 20x slower than local DRAM.
Why Rule-Based Prefetchers Fail
- ✔️
Sequential Access
Easy to predict (e.g., array iteration).
- ❌
Complex Access
Hard to predict (e.g., graph traversal).
Workloads like graph analytics have complex, data-dependent access patterns.
The FarSight Solution: Decoupling
FarSight’s core innovation is to separate what to prefetch from where it is in memory. A Deep Learning model predicts the access pattern’s logic, and a lightweight runtime structure called a “Future Map” translates that logic into actual memory addresses. Click Run Prediction below to see how Future Map works.
Step 1. DL Model Predicts
The model predicts the next semantic access as an ordinal number.
History: [A→B, B→D, A→C]
Prediction: ?
Step 2. Future Map Resolves
The ordinal resolves to a memory address via the Future Map.
- Ordinal 1:
0xAddrD
- Ordinal 2:
0xAddrE
- Ordinal 3:
0xAddrF
Walkthrough FarSight’s Prediction Process with Two Examples
Click green buttons below to see how FarSight makes memory access prediction with two examples: Breadth-First Search and Dijkstra’s Algorithm.
Instead of predicting volatile memory addresses (A, B, C...), FarSight predicts stable, semantic edge ordinals (1, 2, 1...). Click "Start" to see the translation for a level-by-level BFS traversal.
More complex algorithms like Dijkstra's also produce predictable semantic paths based on which edges are explored. This "visited sequence" is what FarSight learns. Click "Start" to see the translation.
Making FarSight Practical in Linux: A Deep Dive
FarSight combines several techniques to make its prediction framework efficient and effective. Explore the key components below.
- Retentive Network (RetNet):
Uses a Transformer variant model architecture (~3K parameters), small enough to fit in L1 cache for fast and constant inference time. - Small Vocabulary:
Predicts ordinals instead of full memory addresses space simplifies model complexity, boosting accuracy. - Rotational Embedding:
Uses 2π-scaled rotary positional encoding to reuse computations across overlapping memory traces, speeding up predictions.
- CPU-Core-Local Prediction:
Prediction runs on the application's CPU core, avoiding slow CPU-GPU communication. - Hiding Prediction overhead:
Prefetch requests are sent while waiting for application on-demand paging, no extra latency overhead. - Multi-Step Lookahead:
The model predicts several steps into the future to ensure prefetched data arrives before it's needed.
- Dynamic Mapping:
Future Maps are created at runtime, learning the memory layout as the app runs and making the system immune to ASLR. - Page-Specific:
Each memory page has its own Future Map, tracking the most likely next pages to be accessed from it. - Swappable & Indirected:
Future Maps can be swapped to far memory. An indirection layer (Future Map Roots) manages their locations efficiently.
- Linux Kernel Implementation:
Implemented as a swappable kernel module and introduces changes to the underlying Linux swap system. - Asynchronous Page Prefetching:
Prefetched Page are not waited on IO completion, eliminating additional IO latency. - Efficient Page Eviction:
Application pages are evicted in the background, allowing the system to reclaim memory without blocking the application.
Evaluation Results
FarSight was evaluated against two state-of-the-art systems (FastSwap and Hermit) on four data-intensive workloads. Select a workload to see how FarSight performed across different local memory constraints.
Key Contributions
This work demonstrates the viability of applying modern ML to solve complex systems problems and introduces several key ideas:
- The first ML-based prefetching system for far memory fully implemented in the Linux kernel.
- The novel idea of decoupling memory-access semantics from runtime address layouts.
- The introduction and efficient management of “Future Maps” as a core data structure for runtime address resolution.
- A full suite of optimizations (asynchronous I/O, lookahead, etc.) to ensure low overhead and high performance.
If you are interested in this work, you can find the full paper on arXiv.
Read Full Paper on arXiv