Introduction
This is part of a series of brief posts about Golem 1.5, releasing at the end of April 2026. The piece assumes reader familiarity with Golem and references other related posts for additional context. Check the other Golem-related posts for more information.
Quotas
Modern applications rely on third-party services, particularly AI agents that interact with external systems and LLM providers. These services typically impose costs and usage limits. Golem 1.5 introduces quota management to help developers control the parallel running agents and make sure we don’t over-use the limited resources.
The feature allows developers to define resources with limited availability and enforce reservations through quota tokens. Tokens can be split and passed between agents via RPC calls.
Setting Up Resources
Resources are defined per environment in the application manifest. Three limit types exist:
- Rate: Refillable pools that replenish by a specified value within a period
- Capacity: Fixed tokens that never refill once consumed
- Concurrency: Fixed pools where agents can temporarily reserve tokens
Enforcement actions include reject (return error), throttle (suspend agent), or terminate (kill agent).
resourceDefaults:
prod:
api-calls:
limit:
type: Rate
value: 100
period: minute
max: 1000
enforcementAction: reject
unit: request
units: requests
storage:
limit:
type: Capacity
value: 1073741824
enforcementAction: reject
unit: byte
units: bytes
connections:
limit:
type: Concurrency
value: 50
enforcementAction: throttle
unit: connection
units: connections
Dynamic Management
Resource limits can be modified via CLI commands or REST API calls, affecting running agents immediately without redeployment.
Token acquisition
import { acquireQuotaToken } from "golem-ts-sdk";
const token = acquireQuotaToken("api-calls", 1n);
use golem_rust::quota::QuotaToken;
let token = QuotaToken::new("api-calls", 1);
import golem.host.QuotaApi._
val token = QuotaToken("api-calls", BigInt(1))
let token = @quota.QuotaToken::new("api-calls", 1UL)
Simple rate limiting with withReservation
import { withReservation } from "golem-ts-sdk";
const result = await withReservation(token, 1n, async (reservation) => {
const response = await callSimpleApi();
return { used: BigInt(1), value: response };
});
use golem_rust::quota::with_reservation;
let result = with_reservation(&token, 1, |_reservation| {
let response = call_simple_api();
(1, response)
});
val result = withReservation(token, BigInt(1)) { reservation =>
callSimpleApi().map { response =>
(BigInt(1), response)
}
}
let result = @quota.with_reservation(token, 1UL, fn(reservation) {
let response = callSimpleApi()
(1, response)
})
LLM rate limiting based on token consumption
Token splitting and merging
const childToken: QuotaToken = token.split(200n);
const childAgent = await SummarizerAgent.newPhantom();
const summary = childAgent.summarize(text, childToken);
let child_token: QuotaToken = token.split(200);
let child_agent = SummarizerAgent::new_phantom().await;
let summary = child_agent.summarize(text, child_token).await;
val childToken: QuotaToken = token.split(BigInt(200))
for {
childAgent <- SummarizerAgent.newPhantom()
summary <- childAgent.summarize(text, childToken)
} yield summary
let child_token: QuotaToken = self.token.split(200UL)
let child_agent = SummarizerAgent::new_phantom()
child_agent.summarize(text, child_token)