FleetLM
A simple LLM app quickly grows into a stack of services just to stay alive.
FleetLM collapses that stack into one layer that scales with you.
Keep long LLM conversations fast and cheap with built-in context compaction.
Frontend (React hook)
// Drop-in React hook
const { messages, sendMessage } = useFleetlmChat({
  userId: "alice",
  agentId: "my-llm",
});
// Send message; streaming handled automatically
await sendMessage({ text: "Hello!" });Backend (Next.js API route)
export const POST = handleLLM(async ({ messages }) => {
  const result = await streamText({
    model: openai("gpt-4o-mini"),
    messages,
  });
  return result.toUIMessageStream();
});FleetLM handles:
7.5k msg/s (3 nodes) · p99 < 150 ms · Raft + Postgres · Apache 2.0 open source1. Simple LLM calls → works great
2. Add persistence → need database
3. Multi-device sync → need pub/sub
4. Context limits hit → need compaction
5. Debug sessions → need event log
6. Five systems. Distributed nightmare.
FleetLM does all of this out of the box.
Write stateless REST. Deploy with docker compose. We handle state, ordering, replay and compaction.
How it works:
FleetLM manages session state on top of Postgres using Raft consensus. You don't query the database directly or tune performance — FleetLM keeps it out of the hot path. You focus on your LLM logic. FleetLM is not a vector database or RAG system.
Your backend handles webhooks. Keep your stack (FastAPI, Express, Go).
Raft consensus, at-least-once delivery, zero data loss.
WebSockets stream every message, REST for polling.
Leader election handles crashes (~150ms recovery).
Add nodes to handle more concurrent sessions and throughput.
Docker compose for local dev. Kubernetes or FleetLM Cloud for production.
Without compaction:
100,000+ tokens per request
$50+ per conversation
Slow inference (long context)
Context window limits hit
With compaction:
~5,000 tokens per request
$2.50 per conversation (95% cost reduction)
Fast inference
Infinite conversations
You implement the compression strategy:
export const POST = handleCompaction(async ({ messages }) => {
  // FleetLM calls this when context grows too large
  const summary = await summarize(messages);
  return { compacted: summary };
});FleetLM detects when sessions exceed your token threshold, calls your compaction webhook, and stores the result. You control the summarization strategy — FleetLM handles the orchestration.
Self-host ready: Production-capable, Apache 2.0 open source
Managed cloud: Launching Q1 2026 — join waitlist for early access
The same open-source core, managed for reliability and scale.
The FleetLM core is Apache 2.0 licensed and will always be free to self-host.
FleetLM Cloud is for professionals who want reliability handled, including uptime, scaling, and observability.
We run the same core for you so you can ship product, not babysit infrastructure. All data tenant-isolated & encrypted at rest. We never train on your data.
No hidden tiers or feature gates, you always get the best version.
| Feature | Cloud Pro | Self-Host | 
|---|---|---|
| Price | $79/mo | Free | 
| Management | We handle it | You manage | 
| Updates | Automatic | Manual | 
| Support | Email, founder access | Community | 
| Control | Managed | Full custom | 
Enterprise: For more than 1 million messages per month, private VPC, or custom SLAs → Contact Us
7.5k msg/s
Sustained throughput (3-node cluster)
<150ms
p99 latency including Raft quorum
0%
Message drops with Raft consensus
The LLM slows down before FleetLM does.
Detailed benchmark methodology and results available in documentation
No. FleetLM is session state infrastructure for LLM chat. It handles message persistence, multi-device sync, and context compaction. Use it alongside your existing vector DB and RAG setup.
You implement a handleCompaction webhook that receives old messages and returns a summary. FleetLM detects when sessions exceed your token threshold, calls your webhook, and stores the compressed result. You control the summarization strategy.
FleetLM is provider-agnostic. Your webhook can call OpenAI, Anthropic, local models, or any LLM API. FleetLM only handles session state — your LLM logic stays in your code.
Self-host on Kubernetes with the open-source Apache 2.0 version, or use FleetLM Cloud (launching Q1 2026). Docker compose is for local development only, not production.
FleetLM uses Raft consensus for automatic failover. If a node crashes, a new leader is elected within ~150ms. All committed messages are replicated to at least 2 of 3 nodes, so zero data loss. For Cloud customers, we handle uptime and scaling.
You do. When self-hosting, you bring your own Postgres database and infrastructure. FleetLM runs in your environment — you retain full ownership and control of all session data.
No. FleetLM never trains on your data. For Cloud customers, all data is tenant-isolated and encrypted at rest.
Run it once, and stop thinking about gnarly chat infra.