Technology & AI

LightSeek Foundation Releases TokenSpeed, Open-Source LLM Inference Engine Targeting TensorRT-LLM-Level Performance for Agentic Workloads.

Inference efficiency has quietly become one of the most important constraints in AI implementation. As agent coding systems like Claude Code, Codex, and Cursor scale from developer tools to the infrastructure that powers software development in general, the underlying engines that serve those applications are under increasing pressure. I LightSeek Foundation researchers have released TokenSpeedan open source LLM inference engine released under the MIT license and designed specifically for the needs of agent workloads. I TokenSpeed the engine is currently in preview mode.

Why Agentic Interpretation Is a Different Problem

To understand what makes TokenSpeed’s design choices meaningful, it helps to understand what makes the agency’s thinking difficult. Coding agents don’t behave like a traditional chatbot. Content usually exceeds 50K tokens, and conversations often take multiple sessions. This creates simultaneous pressure on two metrics: per-GPU TPM (tokens per minute), which determines how many users one GPU can serve, and per-user TPS (tokens per second), which determines how responsive each user perceives the system. Most public benchmarks do not fully capture this behavior.

TokenSpeed ​​​​​​​​​​​is designed to maximize both. The goal is to maximize each GPU’s TPM while maintaining each user’s TPS – typically 70 TPS, and sometimes 200 TPS or more.

Architecture: Five Interoperable Subsystems

TokenSpeed ​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​ should be built around five design pillars: a compiler-based matching mechanism, a high-performance scheduler, a safe limit for KV resource reuse, a layered pluggable kernel system that supports various accelerators, and SMG integration for a low-CPU application entry point.

I modeling layer uses the local SPMD (Single Program, Multiple Data) method. SPMD is a parallel signaling model where all processes use the same program but on different subsets of the data – a common pattern in distributed deep learning. Rather than requiring developers to manually implement communication between processes, TokenSpeed​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​ provides developers with the ability to specify I/O placement annotations on module boundaries, and a lightweight static compiler then automatically generates the necessary assembly functions during model creation, eliminating the need to use the communication protocol.

I editor make a structural separation between the control plane and the kill plane. The control plane is used in C++ as a limited state machine that works with the type system to enforce safe resource management – including KV cache state transfer and implementation – at compile time instead of at runtime. Request life cycle, KV cache resources, and overlap time are expressed through explicit FSM change and proprietary semantics, so fairness is enforced by a verifiable control system instead of a standard one. By coding these correctness constraints into the type system rather than letting them interact at runtime, errors in KV cache management – one of the most common error areas in LLM rendering – are caught earlier. The execution plane is used in Python to maintain the efficiency of development, allowing rapid feature iteration and low cognitive load for developers.

I kernel layer it treats GPU kernels as a first-class modular subsystem rather than being baked into the engine core. It provides a portable public API, a centralized registration and opt-in model, and an extensible plugin approach to support various accelerators – meaning it’s not locked into NVIDIA hardware. The dev team has also developed another very fast one MLA characters (Multi-head Latent Attention). for agent workloads at NVIDIA Blackwell. In the decode kernel, q_seqlen and num_heads are grouped to make full use of Tensor Cores, as num_heads are small for some of these use cases. The pre-filling binary kernel includes a well-tuned softmax implementation. Notably, the TokenSpeed ​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​national​​

Finally, TokenSpeed​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​ The SMG – a native component of PyTorch – a CPU-side application entry point on top, reducing handoff costs between CPU orchestration and GPU execution.

Benchmark Results Against TensorRT-LLM on NVIDIA B200

It’s important to note in advance that these benchmarks cover only one (undifferentiated) application. Support for PD classification is under development and may be included in dedicated tracking from TokenSpeed the group.

In collaboration with the EvalScope team, TokenSpeed ​​​​​​​​was tested against the SWE-smith trace, which represents the traffic of a production code agent, benchmarked against TensorRT-LLM – the current state of the art at NVIDIA Blackwell. The test model was the Kimi K2.5.

For code agents using more than 70 TPS/User, the best configuration is Attention TP4 + MoE TP4, where TokenSpeed ​​​​​​dominates TensorRT-LLM on the entire Pareto frontier: about 9% faster in the min-latency case (batch size 1), and about 11% higher throughput at 100 TPS/User. TP4 here means tensor parallelization across all 4 GPUs, a process that weights the shards model across multiple devices to reduce each device’s memory pressure and latency.

In the MLA kernel, the advantages are most evident in the decoding phase. The decode kernel wraps the axis of the query sequence around the head axis to better fill BMM1. M tile, improves the use of Tensor Core. The binary version’s prefill kernel uses NVIDIA’s internal nodes to optimize the softmax implementation, outperforming MLA’s TensorRT-LLM in all five common workloads for prefilling long-cache code agents for the KV prefix. Combined with other optimizations, this roughly halves the latency relative to TensorRT-LLM on a typical recording workload with predictive recording at batch sizes 4, 8, and 16 with a long KV prefix cache.

Key Takeaways

  • TokenSpeed is a new MIT-licensed, open-source LLM engine for the LightSeek Foundation, built specifically for agent workloads. (Available in preview mode)
  • Its editor uses a C++ finite-state machine to enforce KV cache safety at compile time, while keeping the Python kill plane for usability.
  • On NVIDIA B200TokenSpeed ​​​​​​​​​​​​​​​​​​is better than TensorRT-LLM by ~9% in min-latency and ~11% in throughput at 100 TPS/User on Kimi K2.5.
  • TokenSpeed ​​MLA kernel about half the latency determination compared to TensorRT-LLM in the recording prediction function and it is already adopted by vLLM.

Check it out Technical details again GitHub Repo. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button