<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Tensor Parallelism on</title><link>https://mlsys.wuklab.io/tags/tensor-parallelism/</link><description>Recent content in Tensor Parallelism on</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Sat, 16 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://mlsys.wuklab.io/tags/tensor-parallelism/index.xml" rel="self" type="application/rss+xml"/><item><title>Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism</title><link>https://mlsys.wuklab.io/posts/nitsum/</link><pubDate>Sat, 16 May 2026 00:00:00 +0000</pubDate><guid>https://mlsys.wuklab.io/posts/nitsum/</guid><description>A single LLM deployment now serves everything from latency-critical chat to relaxed background jobs under a fixed GPU budget. We designed &lt;a href="https://arxiv.org/abs/2605.05467" target="_blank">Nitsum [arXiv &amp;lsquo;26]&lt;/a>, the first serving system that treats &lt;strong>tensor parallelism (TP) as a runtime control surface&lt;/strong> instead of a fixed deployment choice. By making TP switching nearly free and reconfiguring the cluster to track shifting workloads, Nitsum improves SLO-compliant &lt;strong>goodput by up to 5.3x&lt;/strong> over state-of-the-art systems. &lt;br/>&lt;br/> &lt;a href="https://mlsys.wuklab.io/posts/nitsum/" target="_blank">Read More&amp;hellip;&lt;/a></description></item></channel></rss>