Posts
2026
- TClone: Decoupling Fast Branch Creation from Durable Checkpointing for Computer-Use Agents
- Beat the Long Tail: Distribution-Aware Speculative Decoding for RL Training
- Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism
- VDCores: A Runtime for Modern Async GPUs
- SourceBench: Can AI Answers Reference Quality Web Sources?
2025
- Why is Your AI Deep Research Slow? A Temporal and Token Analysis of Reasoning Systems
- OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents
- FarSight: Deep-Learning-Driven Prefetching for Far Memory
2024
- Cognify: A Comprehensive, Multi-Faceted Gen AI Workflow Optimizer
- Can Scheduling Overhead Dominate LLM Inference Performance? A Study of CPU Scheduling Overhead on Two Popular LLM Inference Systems
- Preble: Efficient Prompt Scheduling for Augmented Large Language Models
- Efficient Augmented LLM Serving With InferCept