News

6/19/2025 Preprint release of OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents
6/10/2025 Preprint release of FarSight: Deep-Learning-Driven Prefetching for Far Memory
6/9/2025 🎉 OSWorld-Human was accepted to the Workshop on Computer-Use Agents at ICML 2025!
5/15/2025 🎉 Cognify was accepted to KDD 2025!
2/05/2024 Preprint release of Cognify: The Automated Optimizer for Generative AI Workflows
1/22/2024 🎉 Preble was accepted to ICLR 2025!
11/25/2024 Release of Cognify: The Automated Optimizer for Generative AI Workflows
5/22/2024 Preprint release of Preble: Efficient Distributed Prompt Scheduling for LLM Serving
5/1/2024 🎉 InferCept was accepted to ICML 2024!
2/4/2024 Arxiv release of InferCept: Efficient Intercept Support for Augmented LLM Inference

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

July 2, 2025 - 3 mins read

Computer-Use Agents (CUAs) can perform complex tasks, but their high latency makes them impractical. A task taking a human minutes can take an agent over 20 minutes. We study these bottlenecks and construct a new benchmark, OSWorld-Human, that measures both accuracy and temporal efficiency of CUAs.

Read More…

FarSight: Deep-Learning-Driven Prefetching for Far Memory

June 10, 2025 - 4 mins read

FarSight is a Linux-based system that applies deep learning to far-memory prefetching, reducing high-latency memory access through accurate, low-overhead predictions. It decouples memory layout from application behavior, allowing offline-trained deep learning models to make efficient runtime decisions using lightweight mapping. Across data-intensive workloads, FarSight outperforms state-of-the-art systems by up to 3.6×, proving deep learning’s practicality for performance-critical runtime optimization. FarSight research paper can be found on arxiv.

Read More…

Cognify: A Comprehensive, Multi-Faceted Gen AI Workflow Optimizer

November 25, 2024 - 5 mins read

Building high-quality, cost-effective generative AI applications is challenging due to the absence of systematic methods for tuning, testing, and optimization. We introduce Cognify, a tool that automatically enhances generation quality and reduces costs for generative AI workflows, including those written with LangChain, DSPy, and annotated Python. Built on a novel foundation of hierarchical, workflow-level optimization, Cognify delivers up to a 48% improvement in generation quality and up to 9x cost reduction. Cognify is publicly available at https://github.com/GenseeAI/cognify.

Read More…

Can Scheduling Overhead Dominate LLM Inference Performance? A Study of CPU Scheduling Overhead on Two Popular LLM Inference Systems

September 10, 2024 - 12 mins read

Today’s LLM serving systems like vLLM and TGI primarily use a scheduling approach called iterative scheduling (or continuous batching), which decides the batch composition at every round (or every few rounds) of model forwarding. Different from prior serving systems that schedule the next batch after the entire current batch finishes, iterative scheduling promises to improve GPU utilization and LLM serving rate, but with a key assumption: the scheduling overhead can be ignored. While this assumption generally held in the past, it is worth reexamination as today’s LLM inference kernels run much faster than before and as more scheduling tasks and considerations get added. Read More…

Preble: Efficient Prompt Scheduling for Augmented Large Language Models

May 7, 2024 - 5 mins read

LLM prompts are growing more complex and longer with agents, tool use, large documents, video clips, and detailed few-shot examples. These prompts often have content that is shared across many requests. The computed intermediate state (KV cache) from one prompt can be reused by another for their shared parts to improve request handling performance and save GPU computation resources. However, current distributed LLM serving systems treat each request as independent and miss the opportunity to reuse the computed intermediate state. We introduce Preble, the first distributed LLM serving system that targets long and shared prompts. Preble achieves a 1.5-14.5x average and 2-10x p99 latency reduction over SOTA serving systems. The core of Preble is a new E2 Scheduling that optimizes load distribution and KV cache reutilization. Preble is compatible with multiple serving backends such as vLLM and SGLang.

Read More…

Efficient Augmented LLM Serving With InferCept

February 10, 2024 - 6 mins read

Today’s large language models (LLMs) are being paired with various tools and environments to satisfy increasingly complex user queries. Augmenting models with these capabilities means LLM inference can be intercepted by external actions. We designed InferCept [ICML ‘24], the first serving framework designed for augmented LLMs. InferCept minimizes resource waste and sustains a 1.6x-2x higher serving load, completing twice as many requests compared to state-of-the-art serving systems. Try InferCept here.

Read More…