InferiaLLM
API Reference

Inference Gateway

OpenAI-compatible inference endpoints

The Inference Gateway provides high-performance, streaming-compatible endpoints for LLM interaction. It is fully compatible with the OpenAI API specification.

Base URL

http://localhost:8001/v1

Authentication

All requests require a Bearer token (API Key generated from the Dashboard).

Authorization: Bearer sk-inferia-...

Endpoints

Chat Completions

POST /v1/chat/completions

The main endpoint for chat-based inference.

Request Body

FieldTypeRequiredDescription
modelstringYesDeployment name (e.g., llama-3-8b)
messagesarrayYesConversation messages
streambooleanNoEnable SSE streaming (default: false)
temperaturefloatNoSampling temperature
max_tokensintegerNoMaximum tokens to generate

Example Request

curl http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "llama-3-8b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Response (Non-Streaming)

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "llama-3-8b",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 15,
    "total_tokens": 25
  }
}

Response (Streaming)

Server-Sent Events (SSE) format:

data: {"id":"...","choices":[{"delta":{"content":"Hello"}}]}

data: {"id":"...","choices":[{"delta":{"content":"!"}}]}

data: [DONE]

Request Lifecycle

  1. Authentication: Validate API key
  2. Context Resolution: Resolve deployment config from Filtration Gateway
  3. Rate Limiting: Check request/token limits
  4. Input Guardrails: Scan for PII, toxicity, prompt injection
  5. Prompt Processing: Apply templates, inject RAG context
  6. Upstream Call: Route to LLM provider
  7. Output Guardrails: Scan response (async or blocking)
  8. Logging: Record inference metadata

Automatic Features

The following are applied automatically based on deployment configuration:

  • Prompt Templates: System prompts and variable injection
  • RAG Context: Retrieved documents from Knowledge Base
  • Guardrails: Safety scanning based on deployment policy
  • Rate Limits: Per-deployment token/request limits

On this page