Start building in minutes

🚀

Quickstart Guide

Get your first API call working in under 5 minutes. Install the SDK, get an API key, and run inference.

Read guide →
📖

API Reference

Complete reference for all API endpoints including chat completions, embeddings, images, and batch processing.

View reference →
💡

Examples & Tutorials

Step-by-step tutorials for common use cases: chatbots, RAG, agents, content generation, and more.

Browse examples →

Popular guides

Migration from OpenAI

Switch from OpenAI to InferGrove in minutes. Our API is 100% compatible — just change the base URL and API key.

Read migration guide →

Function Calling

Let models call your functions. Define tools with JSON schemas and build powerful agentic workflows.

Learn function calling →

Structured Output

Get guaranteed JSON output matching your schema. No more parsing errors or retry loops.

Learn structured output →

Building RAG Systems

Combine embeddings with chat completions to build retrieval-augmented generation pipelines.

Build RAG system →

Fine-tuning Guide

Customize any model on your data. Prepare datasets, configure training, and deploy custom models.

Start fine-tuning →

AI Agents

Build autonomous agents with tool calling, memory, and multi-step reasoning capabilities.

Build agents →

Introduction

InferGrove provides a fast, reliable API for running AI model inference. Our platform is compatible with OpenAI's API format, making migration seamless for existing applications.

Base URL

All API requests should be made to:

URL
https://api.infergrove.com/v1

Authentication

Authenticate your requests using an API key passed in the Authorization header:

curl
curl https://api.infergrove.com/v1/chat/completions \
  -H "Authorization: Bearer ig-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-4-Scout-109B", "messages": [{"role": "user", "content": "Hello!"}]}'

Install the SDK

We provide official SDKs for Python, TypeScript, Go, Rust, and Java:

bash
# Python
pip install infergrove

# TypeScript / Node.js
npm install @infergrove/sdk

# Go
go get github.com/infergrove/infergrove-go

# Rust
cargo add infergrove

Your First Request

Here's a complete example of making a chat completion request:

Python
from infergrove import InferGrove

client = InferGrove(api_key="ig-your-api-key")

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-109B",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is quantum computing?"}
    ],
    max_tokens=512,
    temperature=0.7
)

print(response.choices[0].message.content)

Streaming Responses

For real-time applications, use streaming to receive tokens as they're generated:

Python
stream = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-109B",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Next Steps

  • Explore the full API reference for all endpoints
  • Learn about function calling and structured output
  • Set up fine-tuning for custom models
  • Configure batch processing for large-scale workloads
  • Build AI agents with tool calling and memory

Error Handling

InferGrove uses standard HTTP status codes. Here are the most common errors:

Error Codes
400 Bad Request       — Invalid parameters
401 Unauthorized     — Invalid or missing API key
403 Forbidden        — Insufficient permissions
404 Not Found        — Model or resource not found
429 Rate Limited     — Too many requests
500 Server Error     — Internal error (retry)
503 Unavailable      — Service temporarily down

Rate Limits

Rate limits vary by plan. When you hit a rate limit, the API returns a 429 status code with a Retry-After header indicating when you can retry.

Python — Handling Rate Limits
from infergrove import InferGrove, RateLimitError
import time

client = InferGrove(
    api_key="ig-...",
    max_retries=3,  # Auto-retry on rate limits
    timeout=30.0
)

try:
    response = client.chat.completions.create(
        model="meta-llama/Llama-4-Scout-109B",
        messages=[{"role": "user", "content": "Hi"}]
    )
except RateLimitError as e:
    print(f"Rate limited. Retry after: {e.retry_after}s")
    time.sleep(e.retry_after)