All
Papers
Blog Posts
Engineering
Featured March 2026

FlashInfer: Adaptive KV-Cache Compression for 10x Longer Contexts

We present FlashInfer, a novel approach to KV-cache management that enables 1M+ token contexts with minimal quality degradation. Our method dynamically compresses attention keys and values based on importance scores computed during inference, achieving 10x context length extension with less than 0.5% quality loss on standard benchmarks. FlashInfer is now deployed in production at InferGrove, enabling all models to support significantly longer contexts without additional hardware.

Authors: Elena Chen, Yuki Tanaka, Sophia Park, et al.

Inference Optimization NeurIPS 2026
March 2026 · Paper

FlashInfer: Adaptive KV-Cache Compression for 10x Longer Contexts

We present FlashInfer, a novel approach to KV-cache management that enables 1M+ token contexts with minimal quality degradation. Our method dynamically compresses attention keys and values based on importance scores computed during inference, achieving 10x context length extension with less than 0.5% quality loss on standard benchmarks.

Inference Optimization NeurIPS 2026
January 2026 · Paper

SpecServe: Speculative Decoding at Scale with Dynamic Draft Models

We introduce SpecServe, a production system for speculative decoding that dynamically selects draft models based on prompt characteristics. Our system achieves 2.8x average speedup across diverse workloads by learning which draft model best matches each request's domain and complexity.

Systems MLSys 2026
November 2025 · Paper

QuantForge: Mixed-Precision Quantization Without Calibration Data

QuantForge is a calibration-free quantization method that achieves FP16-equivalent quality at INT4 precision. By analyzing weight distributions and activation patterns analytically, we eliminate the need for calibration datasets while maintaining accuracy across 50+ model architectures.

Quantization ICML 2026
September 2025 · Paper

PagedAttention v2: Efficient Memory Management for Variable-Length Sequences

Building on the original PagedAttention work, we present v2 with hierarchical page tables, copy-on-write semantics for prompt caching, and NUMA-aware allocation. These improvements reduce memory waste by 40% and enable 2x more concurrent requests per GPU.

Memory OSDI 2025
February 2026 · Blog

How We Serve Llama 4 Maverick at 120 Tokens/Second

A deep dive into the engineering behind serving Meta's 400B parameter model at production scale. We cover our tensor parallelism strategy, custom attention kernels, and the scheduling algorithms that enable consistent low-latency inference.

Engineering Deep Dive
December 2025 · Blog

Building a Global Inference Network: Lessons from 12 Regions

How we built a globally distributed inference network that routes requests to the optimal cluster based on model availability, load, and latency. Includes our approach to model replication, failover, and consistency.

Infrastructure Architecture
October 2025 · Blog

Continuous Batching: Why Iteration-Level Scheduling Matters

An explanation of our continuous batching implementation and why iteration-level scheduling is critical for maximizing GPU utilization. We show how this approach reduces queuing time by 5x compared to request-level batching.

Systems Tutorial
August 2025 · Paper

Efficient Fine-tuning with Adaptive LoRA Rank Selection

We propose AdaLoRA, a method that automatically determines the optimal LoRA rank for each layer during fine-tuning. This reduces training compute by 30% while matching or exceeding fixed-rank LoRA quality across diverse tasks.

Fine-tuning EMNLP 2025

Our open-source projects

We contribute back to the community with production-grade tools and libraries.

InferEngine

High-performance inference engine with custom CUDA kernels, continuous batching, and speculative decoding. Apache 2.0 licensed.

⭐ 12.4K stars · Python/CUDA · Apache 2.0

View on GitHub →

QuantForge

Calibration-free quantization toolkit. Quantize any model to INT4/INT8 without calibration data while maintaining quality.

⭐ 4.2K stars · Python · MIT License

View on GitHub →

FlashInfer

Adaptive KV-cache compression library. Enables 10x longer contexts with minimal quality loss on standard hardware.

⭐ 3.8K stars · C++/CUDA · Apache 2.0

View on GitHub →

Join our research team

We're hiring researchers working on inference optimization, model compression, and serving systems.