Computer-use agents increasingly want to branch: try several actions in parallel, keep the best, and roll back the rest. But a branch of a desktop is an entire running workspace, and cloning it with today’s tools means a synchronous checkpoint/restore on the critical path of every speculative step. We built TClone, a workspace-versioning substrate that makes a branch runnable immediately by sharing memory and filesystem state copy-on-write and pushing durable checkpointing off the fast path. TClone clones a live workspace up to 4.9x faster than VM snapshots and 3.4x faster than stock CRIU, and cuts end-to-end agent task latency by up to 3.7x.
A single LLM deployment now serves everything from latency-critical chat to relaxed background jobs under a fixed GPU budget. We designed Nitsum [arXiv ‘26], the first serving system that treats tensor parallelism (TP) as a runtime control surface instead of a fixed deployment choice. By making TP switching nearly free and reconfiguring the cluster to track shifting workloads, Nitsum improves SLO-compliant goodput by up to 5.3x over state-of-the-art systems.
Most AI benchmarks focus on answer correctness but ignore the quality of cited sources, leading to a “Garbage In, Garbage Out” blind spot. To address this, we introduced SourceBench, a framework evaluating 3,996 sources across 12 AI systems using 8 distinct metrics like freshness and domain authority. Our results show that while GPT-5 leads in authoritative trust, engines like Gensee excel in relevance; crucially, we found that a “dumb” model with high-quality search tools often outperforms a “smart” reasoning model with poor search tools, proving that retrieval quality is the true bottleneck.
Modern GPUs increasingly expose asynchronous execution engines, yet today’s kernels must still linearize memory movement, computation, and control into a single SIMT program. Virtual Decoupled Cores (VDCores) decouples memory, compute, and control, reconnecting them only through explicit dependencies. VDCores virtualizes warps into software-defined memory/compute cores that communicate via queues/ports, enabling the runtime and compiler to safely schedule overlap as emergent behavior rather than hand-tuned tricks. VDCores reduces kernel code by ~70%, enables ~90% kernel reuse across variants, and delivers ~10% performance gains over existing solutions.
Despite rapid gains in accuracy, the latency of reasoning and deep-research systems has been largely overlooked. Reasoning models augmented with external tools have demonstrated strong abilities in solving complex tasks. We present the first systematic temporal and token study of three representative reasoning models and agents, OpenAI o3-deep-research, GPT-5, and the LangChain Deep Research Agent on DeepResearch Bench.
Computer-Use Agents (CUAs) can perform complex tasks, but their high latency makes them impractical. A task taking a human minutes can take an agent over 20 minutes. We study these bottlenecks and construct a new benchmark, OSWorld-Human, that measures both accuracy and temporal efficiency of CUAs.
FarSight is a Linux-based system that applies deep learning to far-memory prefetching, reducing high-latency memory access through accurate, low-overhead predictions. It decouples memory layout from application behavior, allowing offline-trained deep learning models to make efficient runtime decisions using lightweight mapping. Across data-intensive workloads, FarSight outperforms state-of-the-art systems by up to 3.6×, proving deep learning’s practicality for performance-critical runtime optimization. FarSight research paper can be found on arxiv.
Building high-quality, cost-effective generative AI applications is challenging due to the absence of systematic methods for tuning, testing, and optimization. We introduce Cognify, a tool that automatically enhances generation quality and reduces costs for generative AI workflows, including those written with LangChain, DSPy, and annotated Python. Built on a novel foundation of hierarchical, workflow-level optimization, Cognify delivers up to a 48% improvement in generation quality and up to 9x cost reduction. Cognify is publicly available at https://github.com/GenseeAI/cognify.
Today’s LLM serving systems like vLLM and TGI primarily use a scheduling approach called iterative scheduling (or continuous batching), which decides the batch composition at every round (or every few rounds) of model forwarding. Different from prior serving systems that schedule the next batch after the entire current batch finishes, iterative scheduling promises to improve GPU utilization and LLM serving rate, but with a key assumption: the scheduling overhead can be ignored. While this assumption generally held in the past, it is worth reexamination as today’s LLM inference kernels run much faster than before and as more scheduling tasks and considerations get added.
Read More…
LLM prompts are growing more complex and longer with agents, tool use, large documents, video clips, and detailed few-shot examples. These prompts often have content that is shared across many requests. The computed intermediate state (KV cache) from one prompt can be reused by another for their shared parts to improve request handling performance and save GPU computation resources. However, current distributed LLM serving systems treat each request as independent and miss the opportunity to reuse the computed intermediate state.
We introduce Preble, the first distributed LLM serving system that targets long and shared prompts. Preble achieves a 1.5-14.5x average and 2-10x p99 latency reduction over SOTA serving systems. The core of Preble is a new E2 Scheduling that optimizes load distribution and KV cache reutilization. Preble is compatible with multiple serving backends such as vLLM and SGLang.