<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Posts on</title><link>https://mlsys.wuklab.io/posts/</link><description>Recent content in Posts on</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Wed, 18 Feb 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://mlsys.wuklab.io/posts/index.xml" rel="self" type="application/rss+xml"/><item><title>SourceBench: Can AI Answers Reference Quality Web Sources?</title><link>https://mlsys.wuklab.io/posts/sourcebench/</link><pubDate>Wed, 18 Feb 2026 00:00:00 +0000</pubDate><guid>https://mlsys.wuklab.io/posts/sourcebench/</guid><description>Most AI benchmarks focus on answer correctness but ignore the quality of cited sources, leading to a &amp;ldquo;Garbage In, Garbage Out&amp;rdquo; blind spot. To address this, we introduced SourceBench, a framework evaluating 3,996 sources across 12 AI systems using 8 distinct metrics like freshness and domain authority. Our results show that while GPT-5 leads in authoritative trust, engines like Gensee excel in relevance; crucially, we found that a &amp;ldquo;dumb&amp;rdquo; model with high-quality search tools often outperforms a &amp;ldquo;smart&amp;rdquo; reasoning model with poor search tools, proving that retrieval quality is the true bottleneck. &lt;br/>&lt;br/> &lt;a href="https://mlsys.wuklab.io/posts/sourcebench/" target="_blank">Read More&amp;hellip;&lt;/a></description></item><item><title>VDCores: A Runtime for Modern Async GPUs</title><link>https://mlsys.wuklab.io/posts/vdcores/</link><pubDate>Wed, 18 Feb 2026 00:00:00 +0000</pubDate><guid>https://mlsys.wuklab.io/posts/vdcores/</guid><description>Modern GPUs increasingly expose asynchronous execution engines, yet today&amp;rsquo;s kernels must still linearize memory movement, computation, and control into a single SIMT program. &lt;strong>Virtual Decoupled Cores (VDCores)&lt;/strong> decouples memory, compute, and control, reconnecting them only through explicit dependencies. VDCores virtualizes warps into software-defined memory/compute cores that communicate via queues/ports, enabling the runtime and compiler to safely schedule overlap as emergent behavior rather than hand-tuned tricks. VDCores reduces kernel code by &lt;strong>~70%&lt;/strong>, enables &lt;strong>~90%&lt;/strong> kernel reuse across variants, and delivers &lt;strong>~10%&lt;/strong> performance gains over existing solutions. &lt;br/>&lt;br/> &lt;a href="https://mlsys.wuklab.io/posts/VDCores/" target="_blank">Read More&amp;hellip;&lt;/a></description></item><item><title>Why is Your AI Deep Research Slow? A Temporal and Token Analysis of Reasoning Systems</title><link>https://mlsys.wuklab.io/posts/an-reasoning/</link><pubDate>Fri, 17 Oct 2025 00:00:00 +0000</pubDate><guid>https://mlsys.wuklab.io/posts/an-reasoning/</guid><description>Despite rapid gains in accuracy, the latency of reasoning and deep-research systems has been largely overlooked. Reasoning models augmented with external tools have demonstrated strong abilities in solving complex tasks. We present the first systematic temporal and token study of three representative reasoning models and agents, OpenAI o3-deep-research, GPT-5, and the LangChain Deep Research Agent on DeepResearch Bench. &lt;br>&lt;br> &lt;a href="https://mlsys.wuklab.io/posts/anreasoning">Read More&amp;hellip;&lt;/a></description></item><item><title>OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents</title><link>https://mlsys.wuklab.io/posts/oshuman/</link><pubDate>Wed, 02 Jul 2025 00:00:00 +0000</pubDate><guid>https://mlsys.wuklab.io/posts/oshuman/</guid><description>Computer-Use Agents (CUAs) can perform complex tasks, but their high latency makes them impractical. A task taking a human minutes can take an agent over 20 minutes. We study these bottlenecks and construct a new benchmark, OSWorld-Human, that measures both accuracy and temporal efficiency of CUAs. &lt;br>&lt;br> &lt;a href="https://mlsys.wuklab.io/posts/oshuman">Read More&amp;hellip;&lt;/a></description></item><item><title>FarSight: Deep-Learning-Driven Prefetching for Far Memory</title><link>https://mlsys.wuklab.io/posts/farsight/</link><pubDate>Tue, 10 Jun 2025 00:00:00 +0000</pubDate><guid>https://mlsys.wuklab.io/posts/farsight/</guid><description>FarSight is a Linux-based system that applies deep learning to far-memory prefetching, reducing high-latency memory access through accurate, low-overhead predictions. It decouples memory layout from application behavior, allowing offline-trained deep learning models to make efficient runtime decisions using lightweight mapping. Across data-intensive workloads, FarSight outperforms state-of-the-art systems by up to &lt;strong>3.6×&lt;/strong>, proving deep learning&amp;rsquo;s practicality for performance-critical runtime optimization. FarSight research paper can be found on &lt;a href="https://arxiv.org/abs/2506.00384" target="_blank">arxiv&lt;/a>. &lt;br>&lt;br> &lt;a href="https://mlsys.wuklab.io/posts/farsight">Read More&amp;hellip;&lt;/a></description></item><item><title>Cognify: A Comprehensive, Multi-Faceted Gen AI Workflow Optimizer</title><link>https://mlsys.wuklab.io/posts/cognify/</link><pubDate>Mon, 25 Nov 2024 00:00:00 +0000</pubDate><guid>https://mlsys.wuklab.io/posts/cognify/</guid><description>Building high-quality, cost-effective generative AI applications is challenging due to the absence of systematic methods for tuning, testing, and optimization. We introduce &lt;strong>Cognify&lt;/strong>, a tool that automatically enhances generation quality and reduces costs for generative AI workflows, including those written with LangChain, DSPy, and annotated Python. Built on a novel foundation of hierarchical, workflow-level optimization, Cognify delivers up to a &lt;strong>48% improvement in generation quality&lt;/strong> and up to &lt;strong>9x cost reduction&lt;/strong>. Cognify is publicly available at &lt;a href="https://github.com/GenseeAI/cognify" target="_blank">https://github.com/GenseeAI/cognify&lt;/a>. &lt;br/>&lt;br/> &lt;a href="https://mlsys.wuklab.io/posts/cognify/" target="_blank">Read More&amp;hellip;&lt;/a></description></item><item><title>Can Scheduling Overhead Dominate LLM Inference Performance? A Study of CPU Scheduling Overhead on Two Popular LLM Inference Systems</title><link>https://mlsys.wuklab.io/posts/scheduling_overhead/</link><pubDate>Tue, 10 Sep 2024 00:00:00 +0000</pubDate><guid>https://mlsys.wuklab.io/posts/scheduling_overhead/</guid><description>Today’s LLM serving systems like &lt;a href="https://github.com/vllm-project/vllm" target="_blank">vLLM&lt;/a> and &lt;a href="https://huggingface.co/docs/text-generation-inference/en/index" target="_blank">TGI&lt;/a> primarily use a scheduling approach called iterative scheduling (or continuous batching), which decides the batch composition at every round (or every few rounds) of model forwarding. Different from prior serving systems that schedule the next batch after the entire current batch finishes, iterative scheduling promises to improve GPU utilization and LLM serving rate, but with a key assumption: the scheduling overhead can be ignored. While this assumption generally held in the past, it is worth reexamination as today’s LLM &lt;a href="https://flashinfer.ai/" target="_blank">inference kernels&lt;/a> run much faster than before and as more scheduling tasks and considerations get added.
&lt;a href="https://mlsys.wuklab.io/posts/scheduling_overhead/" target="_blank">Read More&amp;hellip;&lt;/a></description></item><item><title>Preble: Efficient Prompt Scheduling for Augmented Large Language Models</title><link>https://mlsys.wuklab.io/posts/preble/</link><pubDate>Tue, 07 May 2024 00:00:00 +0000</pubDate><guid>https://mlsys.wuklab.io/posts/preble/</guid><description>LLM prompts are growing more complex and longer with &lt;a href="https://arxiv.org/abs/2308.11432" target="_blank">agents&lt;/a>, &lt;a href="https://platform.openai.com/docs/guides/function-calling" target="_blank">tool use&lt;/a>, &lt;a href="https://arxiv.org/html/2404.07143v1" target="_blank">large documents&lt;/a>, &lt;a href="https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#context-window" target="_blank">video clips&lt;/a>, and detailed &lt;a href="https://arxiv.org/pdf/2210.03629" target="_blank">few-shot examples&lt;/a>. These prompts often have content that is shared across many requests. The computed intermediate state (KV cache) from one prompt can be reused by another for their shared parts to improve request handling performance and save GPU computation resources. However, current distributed LLM serving systems treat each request as independent and miss the opportunity to reuse the computed intermediate state.
We introduce &lt;a href="https://arxiv.org/abs/2407.00023" target="_blank">&lt;strong>Preble&lt;/strong>&lt;/a>, the first distributed LLM serving system that targets long and shared prompts. Preble achieves a &lt;strong>1.5-14.5x&lt;/strong> average and &lt;strong>2-10x&lt;/strong> p99 latency reduction over SOTA serving systems. The core of Preble is a new E2 Scheduling that optimizes load distribution and KV cache reutilization. Preble is compatible with multiple serving backends such as &lt;a href="https://github.com/vllm-project/vllm" target="_blank">vLLM&lt;/a> and &lt;a href="https://github.com/sgl-project/sglang" target="_blank">SGLang&lt;/a>. &lt;br/>&lt;br/> &lt;a href="https://mlsys.wuklab.io/posts/preble/" target="_blank">Read More&amp;hellip;&lt;/a></description></item><item><title>Efficient Augmented LLM Serving With InferCept</title><link>https://mlsys.wuklab.io/posts/infercept/</link><pubDate>Sat, 10 Feb 2024 00:00:00 +0000</pubDate><guid>https://mlsys.wuklab.io/posts/infercept/</guid><description>Today&amp;rsquo;s large language models (LLMs) are being paired with various tools and environments to satisfy increasingly complex user queries. Augmenting models with these capabilities means LLM &lt;ins>&lt;strong>infer&lt;/strong>&lt;/ins>ence can be inter&lt;ins>&lt;strong>cept&lt;/strong>&lt;/ins>ed by external actions. We designed &lt;a href="https://arxiv.org/pdf/2402.01869" target="_blank">InferCept [ICML &amp;lsquo;24]&lt;/a>, the first serving framework designed for augmented LLMs. InferCept minimizes resource waste and sustains a &lt;strong>1.6x-2x higher serving load&lt;/strong>, completing twice as many requests compared to &lt;a href="https://github.com/vllm-project/vllm" target="_blank">state-of-the-art serving systems&lt;/a>. Try InferCept &lt;a href="https://github.com/WukLab/InferCept" target="_blank">here&lt;/a>. &lt;br/>&lt;br/> &lt;a href="https://mlsys.wuklab.io/posts/infercept/" target="_blank">Read More&amp;hellip;&lt;/a></description></item></channel></rss>