API Reference
Inference Gateway
OpenAI-compatible inference endpoints
The Inference Gateway provides high-performance, streaming-compatible endpoints for LLM interaction. It is fully compatible with the OpenAI API specification.
Base URL
http://localhost:8001/v1
Authentication
All requests require a Bearer token (API Key generated from the Dashboard).
Authorization: Bearer sk-inferia-...Endpoints
Chat Completions
POST /v1/chat/completions
The main endpoint for chat-based inference.
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Deployment name (e.g., llama-3-8b) |
messages | array | Yes | Conversation messages |
stream | boolean | No | Enable SSE streaming (default: false) |
temperature | float | No | Sampling temperature |
max_tokens | integer | No | Maximum tokens to generate |
Example Request
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "llama-3-8b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'Response (Non-Streaming)
{
"id": "chatcmpl-...",
"object": "chat.completion",
"created": 1234567890,
"model": "llama-3-8b",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 15,
"total_tokens": 25
}
}Response (Streaming)
Server-Sent Events (SSE) format:
data: {"id":"...","choices":[{"delta":{"content":"Hello"}}]}
data: {"id":"...","choices":[{"delta":{"content":"!"}}]}
data: [DONE]Request Lifecycle
- Authentication: Validate API key
- Context Resolution: Resolve deployment config from Filtration Gateway
- Rate Limiting: Check request/token limits
- Input Guardrails: Scan for PII, toxicity, prompt injection
- Prompt Processing: Apply templates, inject RAG context
- Upstream Call: Route to LLM provider
- Output Guardrails: Scan response (async or blocking)
- Logging: Record inference metadata
Automatic Features
The following are applied automatically based on deployment configuration:
- Prompt Templates: System prompts and variable injection
- RAG Context: Retrieved documents from Knowledge Base
- Guardrails: Safety scanning based on deployment policy
- Rate Limits: Per-deployment token/request limits
