News

Today’s large language models (LLMs) are being paired with various tools and environments to satisfy increasingly complex user queries. Augmenting models with these capabilities means LLM inference can be intercepted by external actions. We designed InferCept [ICML ‘24], the first serving framework designed for augmented LLMs. InferCept minimizes resource waste and sustains a 1.6x-2x higher serving load, completing twice as many requests compared to state-of-the-art serving systems. Try InferCept here.

Read More…