FleetLM

Context infra for LLM apps

A simple LLM app quickly grows into a stack of services just to stay alive.

FleetLM collapses that stack into one layer that scales with you.

Keep long LLM conversations fast and cheap with built-in context compaction.

Frontend (React hook)

// Drop-in React hook
const { messages, sendMessage } = useFleetlmChat({
  userId: "alice",
  agentId: "my-llm",
});

// Send message; streaming handled automatically
await sendMessage({ text: "Hello!" });

Backend (Next.js API route)

export const POST = handleLLM(async ({ messages }) => {
  const result = await streamText({
    model: openai("gpt-4o-mini"),
    messages,
  });
  return result.toUIMessageStream();
});

FleetLM handles:

Persistence (Postgres, zero data loss)
Multi-device sync (WebSocket streaming)
Context management (compaction callbacks + LLM dispatch)
Real-time delivery (to all connected clients)

7.5k msg/s (3 nodes) · p99 < 150 ms · Raft + Postgres · Apache 2.0 open source

GitHub·Documentation·Performance

The context trap

1. Simple LLM calls → works great

2. Add persistence → need database

3. Multi-device sync → need pub/sub

4. Context limits hit → need compaction

5. Debug sessions → need event log

6. Five systems. Distributed nightmare.

FleetLM does all of this out of the box.

Write stateless REST. Deploy with docker compose. We handle state, ordering, replay and compaction.

How it works:

FleetLM Architecture: Your app communicates via webhooks with FleetLM cluster nodes, which handle Raft consensus and connect to Postgres

FleetLM manages session state on top of Postgres using Raft consensus. You don't query the database directly or tune performance — FleetLM keeps it out of the hot path. You focus on your LLM logic. FleetLM is not a vector database or RAG system.

What You Get

Framework freedom

Your backend handles webhooks. Keep your stack (FastAPI, Express, Go).

Durable ordering

Raft consensus, at-least-once delivery, zero data loss.

Real-time delivery

WebSockets stream every message, REST for polling.

Automatic failover

Leader election handles crashes (~150ms recovery).

Scales horizontally

Add nodes to handle more concurrent sessions and throughput.

Easy to evaluate

Docker compose for local dev. Kubernetes or FleetLM Cloud for production.

Context compaction callbacks

Without compaction:

[msg 1] [msg 2] [msg 3] ...

[msg 98] [msg 99] [msg 100]

100,000+ tokens per request
$50+ per conversation
Slow inference (long context)
Context window limits hit

With compaction:

[summary of msgs 1-90]

[msg 91] ... [msg 100]

~5,000 tokens per request
$2.50 per conversation (95% cost reduction)
Fast inference
Infinite conversations

You implement the compression strategy:

export const POST = handleCompaction(async ({ messages }) => {
  // FleetLM calls this when context grows too large
  const summary = await summarize(messages);
  return { compacted: summary };
});

FleetLM detects when sessions exceed your token threshold, calls your compaction webhook, and stores the result. You control the summarization strategy — FleetLM handles the orchestration.

Self-host ready: Production-capable, Apache 2.0 open source

Managed cloud: Launching Q1 2026 — join waitlist for early access

FleetLM Cloud

The same open-source core, managed for reliability and scale.

The FleetLM core is Apache 2.0 licensed and will always be free to self-host.

FleetLM Cloud is for professionals who want reliability handled, including uptime, scaling, and observability.

We run the same core for you so you can ship product, not babysit infrastructure. All data tenant-isolated & encrypted at rest. We never train on your data.

The Pro Plan — $79 / month

→ Up to 1 million messages per month, not including message-parts (e.g. reasoning/tool calls)
→ Monitoring dashboards and alerts
→ Managed multi-tenant cluster for availability
→ Email support

No hidden tiers or feature gates, you always get the best version.

Comparison

Feature	Cloud Pro	Self-Host
Price	$79/mo	Free
Management	We handle it	You manage
Updates	Automatic	Manual
Support	Email, founder access	Community
Control	Managed	Full custom

Join Cloud Waitlist Quick Start

Enterprise: For more than 1 million messages per month, private VPC, or custom SLAs → Contact Us

Performance

7.5k msg/s

Sustained throughput (3-node cluster)

<150ms

p99 latency including Raft quorum

Message drops with Raft consensus

The LLM slows down before FleetLM does.

Detailed benchmark methodology and results available in documentation

Waitlist — First 10 Teams

FAQ

Is FleetLM a vector database?

No. FleetLM is session state infrastructure for LLM chat. It handles message persistence, multi-device sync, and context compaction. Use it alongside your existing vector DB and RAG setup.

How does compaction work?

You implement a handleCompaction webhook that receives old messages and returns a summary. FleetLM detects when sessions exceed your token threshold, calls your webhook, and stores the compressed result. You control the summarization strategy.

What LLM providers does this work with?

FleetLM is provider-agnostic. Your webhook can call OpenAI, Anthropic, local models, or any LLM API. FleetLM only handles session state — your LLM logic stays in your code.

How do I deploy this in production?

Self-host on Kubernetes with the open-source Apache 2.0 version, or use FleetLM Cloud (launching Q1 2026). Docker compose is for local development only, not production.

What happens if FleetLM goes down?

FleetLM uses Raft consensus for automatic failover. If a node crashes, a new leader is elected within ~150ms. All committed messages are replicated to at least 2 of 3 nodes, so zero data loss. For Cloud customers, we handle uptime and scaling.

Who owns the data when I self-host?

You do. When self-hosting, you bring your own Postgres database and infrastructure. FleetLM runs in your environment — you retain full ownership and control of all session data.

Do you train on my data?

No. FleetLM never trains on your data. For Cloud customers, all data is tenant-isolated and encrypted at rest.

FleetLM makes LLM infra as boring as it should be.

Run it once, and stop thinking about gnarly chat infra.