Cutting-edge research on model optimization, inference systems, and AI efficiency. Published at top venues and deployed in production.
We present FlashInfer, a novel approach to KV-cache management that enables 1M+ token contexts with minimal quality degradation. Our method dynamically compresses attention keys and values based on importance scores computed during inference, achieving 10x context length extension with less than 0.5% quality loss on standard benchmarks. FlashInfer is now deployed in production at InferGrove, enabling all models to support significantly longer contexts without additional hardware.
Authors: Elena Chen, Yuki Tanaka, Sophia Park, et al.
We present FlashInfer, a novel approach to KV-cache management that enables 1M+ token contexts with minimal quality degradation. Our method dynamically compresses attention keys and values based on importance scores computed during inference, achieving 10x context length extension with less than 0.5% quality loss on standard benchmarks.
We introduce SpecServe, a production system for speculative decoding that dynamically selects draft models based on prompt characteristics. Our system achieves 2.8x average speedup across diverse workloads by learning which draft model best matches each request's domain and complexity.
QuantForge is a calibration-free quantization method that achieves FP16-equivalent quality at INT4 precision. By analyzing weight distributions and activation patterns analytically, we eliminate the need for calibration datasets while maintaining accuracy across 50+ model architectures.
Building on the original PagedAttention work, we present v2 with hierarchical page tables, copy-on-write semantics for prompt caching, and NUMA-aware allocation. These improvements reduce memory waste by 40% and enable 2x more concurrent requests per GPU.
A deep dive into the engineering behind serving Meta's 400B parameter model at production scale. We cover our tensor parallelism strategy, custom attention kernels, and the scheduling algorithms that enable consistent low-latency inference.
How we built a globally distributed inference network that routes requests to the optimal cluster based on model availability, load, and latency. Includes our approach to model replication, failover, and consistency.
An explanation of our continuous batching implementation and why iteration-level scheduling is critical for maximizing GPU utilization. We show how this approach reduces queuing time by 5x compared to request-level batching.
We propose AdaLoRA, a method that automatically determines the optimal LoRA rank for each layer during fine-tuning. This reduces training compute by 30% while matching or exceeding fixed-rank LoRA quality across diverse tasks.
We contribute back to the community with production-grade tools and libraries.
High-performance inference engine with custom CUDA kernels, continuous batching, and speculative decoding. Apache 2.0 licensed.
⭐ 12.4K stars · Python/CUDA · Apache 2.0
View on GitHub →Calibration-free quantization toolkit. Quantize any model to INT4/INT8 without calibration data while maintaining quality.
⭐ 4.2K stars · Python · MIT License
View on GitHub →Adaptive KV-cache compression library. Enables 10x longer contexts with minimal quality loss on standard hardware.
⭐ 3.8K stars · C++/CUDA · Apache 2.0
View on GitHub →We're hiring researchers working on inference optimization, model compression, and serving systems.