News

5/17/2026 Beat the Long Tail: Distribution-Aware Speculative Decoding for RL Training (MLSys ‘26)
5/16/2026 Preprint release of Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism (blog post)
1/26/2025 🎉 OSWorld-Human was accepted to MLSys 2026!
1/26/2025 🎉 Beat the long tail: Distribution-Aware Speculative Decoding for RL Training was accepted to MLSys 2026!
9/23/2025 🎉 FarSight was accepted to the Workshop on Machine Learning for Systems at NeurIPS 2025!
9/23/2025 🎉 Demystifying Delays in Reasoning was accepted to the Workshop on Efficient Reasoning at NeurIPS 2025!

Preble: Efficient Prompt Scheduling for Augmented Large Language Models

May 7, 2024 - 5 mins read

LLM prompts are growing more complex and longer with agents, tool use, large documents, video clips, and detailed few-shot examples. These prompts often have content that is shared across many requests. The computed intermediate state (KV cache) from one prompt can be reused by another for their shared parts to improve request handling performance and save GPU computation resources. However, current distributed LLM serving systems treat each request as independent and miss the opportunity to reuse the computed intermediate state. We introduce Preble, the first distributed LLM serving system that targets long and shared prompts. Preble achieves a 1.5-14.5x average and 2-10x p99 latency reduction over SOTA serving systems. The core of Preble is a new E2 Scheduling that optimizes load distribution and KV cache reutilization. Preble is compatible with multiple serving backends such as vLLM and SGLang.

Read More…

Efficient Augmented LLM Serving With InferCept

February 10, 2024 - 6 mins read

Today’s large language models (LLMs) are being paired with various tools and environments to satisfy increasingly complex user queries. Augmenting models with these capabilities means LLM inference can be intercepted by external actions. We designed InferCept [ICML ‘24], the first serving framework designed for augmented LLMs. InferCept minimizes resource waste and sustains a 1.6x-2x higher serving load, completing twice as many requests compared to state-of-the-art serving systems. Try InferCept here.

Read More…