Posts

2026

TClone: Decoupling Fast Branch Creation from Durable Checkpointing for Computer-Use Agents May 17, 2026
Beat the Long Tail: Distribution-Aware Speculative Decoding for RL Training May 17, 2026
Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism May 16, 2026
VDCores: A Runtime for Modern Async GPUs February 18, 2026
SourceBench: Can AI Answers Reference Quality Web Sources? February 18, 2026

2025

Why is Your AI Deep Research Slow? A Temporal and Token Analysis of Reasoning Systems October 17, 2025
OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents July 2, 2025
FarSight: Deep-Learning-Driven Prefetching for Far Memory June 10, 2025

2024

Cognify: A Comprehensive, Multi-Faceted Gen AI Workflow Optimizer November 25, 2024
Can Scheduling Overhead Dominate LLM Inference Performance? A Study of CPU Scheduling Overhead on Two Popular LLM Inference Systems September 10, 2024
Preble: Efficient Prompt Scheduling for Augmented Large Language Models May 7, 2024
Efficient Augmented LLM Serving With InferCept February 10, 2024