News

Reinforcement learning post-training spends most of its wall-clock time generating answers, and a few very long generations dominate every training step. We designed DAS [MLSys ‘26], a distribution-aware speculative decoding framework that speeds up RL rollouts without changing what the model learns. DAS uses a training-free drafter that rebuilds itself from recent rollouts and spends its speculation budget on the long generations that set the pace, cutting rollout time by up to 50% with identical training curves.

Read More…
Computer-use agents increasingly want to branch: try several actions in parallel, keep the best, and roll back the rest. But a branch of a desktop is an entire running workspace, and cloning it with today’s tools means a synchronous checkpoint/restore on the critical path of every speculative step. We built TClone, a workspace-versioning substrate that makes a branch runnable immediately by sharing memory and filesystem state copy-on-write and pushing durable checkpointing off the fast path. TClone clones a live workspace up to 4.9x faster than VM snapshots and 3.4x faster than stock CRIU, and cuts end-to-end agent task latency by up to 3.7x.

Read More…
A single LLM deployment now serves everything from latency-critical chat to relaxed background jobs under a fixed GPU budget. We designed Nitsum [arXiv ‘26], the first serving system that treats tensor parallelism (TP) as a runtime control surface instead of a fixed deployment choice. By making TP switching nearly free and reconfiguring the cluster to track shifting workloads, Nitsum improves SLO-compliant goodput by up to 5.3x over state-of-the-art systems.

Read More…
Most AI benchmarks focus on answer correctness but ignore the quality of cited sources, leading to a “Garbage In, Garbage Out” blind spot. To address this, we introduced SourceBench, a framework evaluating 3,996 sources across 12 AI systems using 8 distinct metrics like freshness and domain authority. Our results show that while GPT-5 leads in authoritative trust, engines like Gensee excel in relevance; crucially, we found that a “dumb” model with high-quality search tools often outperforms a “smart” reasoning model with poor search tools, proving that retrieval quality is the true bottleneck.

Read More…
Modern GPUs increasingly expose asynchronous execution engines, yet today’s kernels must still linearize memory movement, computation, and control into a single SIMT program. Virtual Decoupled Cores (VDCores) decouples memory, compute, and control, reconnecting them only through explicit dependencies. VDCores virtualizes warps into software-defined memory/compute cores that communicate via queues/ports, enabling the runtime and compiler to safely schedule overlap as emergent behavior rather than hand-tuned tricks. VDCores reduces kernel code by ~70%, enables ~90% kernel reuse across variants, and delivers ~10% performance gains over existing solutions.

Read More…
Despite rapid gains in accuracy, the latency of reasoning and deep-research systems has been largely overlooked. Reasoning models augmented with external tools have demonstrated strong abilities in solving complex tasks. We present the first systematic temporal and token study of three representative reasoning models and agents, OpenAI o3-deep-research, GPT-5, and the LangChain Deep Research Agent on DeepResearch Bench.

Read More…
FarSight is a Linux-based system that applies deep learning to far-memory prefetching, reducing high-latency memory access through accurate, low-overhead predictions. It decouples memory layout from application behavior, allowing offline-trained deep learning models to make efficient runtime decisions using lightweight mapping. Across data-intensive workloads, FarSight outperforms state-of-the-art systems by up to 3.6×, proving deep learning’s practicality for performance-critical runtime optimization. FarSight research paper can be found on arxiv.

Read More…
Building high-quality, cost-effective generative AI applications is challenging due to the absence of systematic methods for tuning, testing, and optimization. We introduce Cognify, a tool that automatically enhances generation quality and reduces costs for generative AI workflows, including those written with LangChain, DSPy, and annotated Python. Built on a novel foundation of hierarchical, workflow-level optimization, Cognify delivers up to a 48% improvement in generation quality and up to 9x cost reduction. Cognify is publicly available at https://github.com/GenseeAI/cognify.

Read More…
Today’s LLM serving systems like vLLM and TGI primarily use a scheduling approach called iterative scheduling (or continuous batching), which decides the batch composition at every round (or every few rounds) of model forwarding. Different from prior serving systems that schedule the next batch after the entire current batch finishes, iterative scheduling promises to improve GPU utilization and LLM serving rate, but with a key assumption: the scheduling overhead can be ignored. While this assumption generally held in the past, it is worth reexamination as today’s LLM inference kernels run much faster than before and as more scheduling tasks and considerations get added. Read More…