Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism

Sat, 16 May 2026 00:00:00 +0000

A single LLM deployment now serves everything from latency-critical chat to relaxed background jobs under a fixed GPU budget. We designed Nitsum [arXiv ‘26], the first serving system that treats tensor parallelism (TP) as a runtime control surface instead of a fixed deployment choice. By making TP switching nearly free and reconfiguring the cluster to track shifting workloads, Nitsum improves SLO-compliant goodput by up to 5.3x over state-of-the-art systems.

Read More…

Tensor Parallelism on

Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism