All Models
Language
Code
Image
Embedding
Audio

Popular models

Llama 4 Maverick 400B New

Meta's flagship 400B parameter model with mixture-of-experts architecture. Best-in-class reasoning and instruction following.

400B params 1M context 120 tok/s
Llama 4 Scout 109B Popular

Excellent balance of quality and speed. 109B parameters with 10M token context window. Ideal for production workloads.

109B params 512K context 340 tok/s
DeepSeek V3 685B MoE

DeepSeek's largest model with 685B total parameters. Exceptional at math, coding, and complex reasoning tasks.

685B params 128K context 95 tok/s
Mixtral 8x22B v0.3 Fast

Mistral's mixture-of-experts model. 8 experts with 22B parameters each. Excellent throughput for general tasks.

176B total 64K context 480 tok/s
Qwen 3 72B Instruct Multilingual

Alibaba's 72B model with excellent multilingual capabilities. Strong performance across 30+ languages.

72B params 128K context 410 tok/s
Gemma 3 27B Efficient

Google's efficient 27B model. Punches above its weight class with excellent instruction following and reasoning.

27B params 32K context 620 tok/s

Code generation models

DeepSeek Coder V3 Code

State-of-the-art code generation. Supports 100+ programming languages with fill-in-the-middle capability.

33B params 128K context 520 tok/s
StarCoder 3 15B Code

BigCode's latest model trained on The Stack v3. Excellent for code completion and generation tasks.

15B params 64K context 780 tok/s
CodeLlama 2 70B Code

Meta's code-specialized Llama variant. Excellent at code review, debugging, and complex refactoring tasks.

70B params 100K context 310 tok/s

Image generation models

Stable Diffusion 4 Image

Stability AI's latest diffusion model. Photorealistic image generation with excellent prompt adherence.

1024x1024 1.2s/image $0.003/img
FLUX.2 Pro Image

Black Forest Labs' flagship model. Exceptional text rendering and compositional understanding.

Up to 2048x2048 0.8s/image $0.005/img
SDXL Turbo Fast

Real-time image generation in a single step. Perfect for interactive applications and rapid prototyping.

512x512 0.1s/image $0.001/img

Audio & speech models

Whisper v4 Audio

OpenAI's latest speech recognition model. Supports 100+ languages with near-human accuracy.

100+ languages Real-time $0.006/min
Bark v2 TTS

High-quality text-to-speech with emotion control. Generate natural-sounding speech in multiple voices.

50+ voices Real-time $0.015/min
MusicGen Pro Music

Generate music from text descriptions. Create background music, jingles, and soundscapes programmatically.

Up to 30s Multiple genres $0.05/gen

Embedding models

BGE-M3 Embed

Multi-lingual, multi-granularity embedding model. Supports dense, sparse, and multi-vector retrieval.

1024 dims 8K context 10K tok/s
E5-Mistral-7B Embed

Instruction-tuned embedding model based on Mistral 7B. Excellent for semantic search and RAG applications.

4096 dims 32K context 5K tok/s
Nomic Embed v2 Embed

Lightweight, high-quality embeddings with Matryoshka representation learning. Flexible dimensionality.

768 dims 8K context 15K tok/s

Optimized for maximum throughput

Every model on InferGrove is optimized with custom CUDA kernels, quantization, and speculative decoding for best-in-class performance.

Tokens per Second — Output Generation

Gemma 3 27B
620 tok/s
Llama 4 Scout 17B
890 tok/s
Mixtral 8x22B
480 tok/s
Qwen 3 72B
410 tok/s
Llama 4 Scout 109B
340 tok/s
Llama 4 Maverick 400B
120 tok/s
DeepSeek V3 685B
95 tok/s

Benchmarks measured on InferGrove's production infrastructure with standard prompts. Actual performance may vary based on prompt length and concurrency.

Can't find your model?

We add new models every week. Request a model or bring your own custom model to deploy on our infrastructure.