Inference Gateway

The Inference Gateway provides high-performance, streaming-compatible endpoints for LLM interaction. It is fully compatible with the OpenAI API specification.

Base URL

http://localhost:8001/v1

Authentication

All requests require a Bearer token (API Key generated from the Dashboard).

Authorization: Bearer sk-inferia-...

Endpoints

Chat Completions

POST /v1/chat/completions

The main endpoint for chat-based inference.

Request Body

Field	Type	Required	Description
`model`	string	Yes	Deployment name (e.g., `llama-3-8b`)
`messages`	array	Yes	Conversation messages
`stream`	boolean	No	Enable SSE streaming (default: false)
`temperature`	float	No	Sampling temperature
`max_tokens`	integer	No	Maximum tokens to generate

Example Request

curl http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "llama-3-8b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Response (Non-Streaming)

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "llama-3-8b",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 15,
    "total_tokens": 25
  }
}

Response (Streaming)

Server-Sent Events (SSE) format:

data: {"id":"...","choices":[{"delta":{"content":"Hello"}}]}

data: {"id":"...","choices":[{"delta":{"content":"!"}}]}

data: [DONE]

Request Lifecycle

Authentication: Validate API key
Context Resolution: Resolve deployment config from Filtration Gateway
Rate Limiting: Check request/token limits
Input Guardrails: Scan for PII, toxicity, prompt injection
Prompt Processing: Apply templates, inject RAG context
Upstream Call: Route to LLM provider
Output Guardrails: Scan response (async or blocking)
Logging: Record inference metadata

Automatic Features

The following are applied automatically based on deployment configuration:

Prompt Templates: System prompts and variable injection
RAG Context: Retrieved documents from Knowledge Base
Guardrails: Safety scanning based on deployment policy
Rate Limits: Per-deployment token/request limits

Inference Gateway

On this page