InferiaLLM
Core Features

Features

Core capabilities of InferiaLLM

InferiaLLM provides a comprehensive set of features for managing LLM infrastructure in production.

Inference

OpenAI-Compatible API

Drop-in replacement for OpenAI API with full streaming support.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8001/v1",
    api_key="sk-inferia-..."
)

response = client.chat.completions.create(
    model="llama-3-8b",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

Automatic Features

  • Prompt Templates: System prompts injected automatically
  • RAG Context: Retrieved documents from Knowledge Base
  • Guardrails: Safety scanning per deployment config

Compute Orchestration

Multi-Provider Support

ProviderTypeDescription
KubernetesOn-Prem/CloudStandard GPU clusters
SkyPilotMulti-CloudAWS, GCP, Azure VMs
NosanaDePINDecentralized GPU network
AkashDePINDecentralized compute

Deployment Management

  • Create, start, stop, terminate deployments
  • Monitor replica status and health
  • View inference and terminal logs
  • Configure rate limits per deployment

Security & Access Control

Authentication

  • JWT-based authentication
  • TOTP two-factor authentication (2FA)
  • Invitation-based onboarding

RBAC (Role-Based Access Control)

RoleCapabilities
AdminFull access, user management
DeveloperDeployments, API keys, configs
UserRead access, limited API keys
GuestView only

API Keys

  • Generate scoped API keys per deployment
  • Automatic rotation support
  • Usage tracking and quotas

Guardrails

Safety Scanners

  • PII Detection: Redact emails, phone numbers, SSNs
  • Toxicity Filter: Block harmful content
  • Prompt Injection: Detect jailbreak attempts
  • Secret Detection: Prevent API key leakage

Providers

  • LLM Guard (Local)
  • Llama Guard (Groq API)
  • Lakera Guard (API)

Knowledge Base (RAG)

Document Management

  • Upload PDF, DOCX, TXT files
  • Automatic chunking and embedding
  • ChromaDB vector storage

Deployment Integration

  • Link collections to deployments
  • Automatic context retrieval
  • Configurable chunk count

Prompt Templates

Features

  • Jinja2 templating engine
  • Variable injection
  • Version management
  • Per-deployment assignment

Variables

  • {{user_message}} - Current user input
  • {{context}} - RAG-retrieved content
  • Custom variables via API

Observability

Inference Logs

  • Request/response pairs
  • Token usage and latency
  • Model and deployment metadata

Audit Logs

  • All administrative actions
  • Security events
  • Immutable history

Metrics

  • Prometheus-compatible export
  • Request latency (p50, p95, p99)
  • Token throughput
  • Error rates

On this page