Scaling & concurrency
Worker bounds, concurrent inputs, and request batching.
Worker bounds
min_workers keeps containers warm (zero cold starts). max_workers caps fan-out. Both default to 0 / 100. Setting min_workers=1 eliminates the first-request cold start for latency-sensitive endpoints.
1@app.task(min_workers=0, max_workers=20)2async def run(): ...Concurrent inputs
Let one worker handle several inputs simultaneously. Best for I/O-bound class tasks (embedding servers, API proxies) where GPU utilisation would otherwise be low.
1@app.cls(gpu=gw.Gpu("A100"))2@gw.concurrent(max_inputs=8)3class Server: ...Batching
Coalesce inputs that arrive within wait_ms of each other into a single GPU call. max_batch_size caps the list length. The method receives list[input] and must return list[output].
1@gw.method()2@gw.batched(max_batch_size=16, wait_ms=50)3async def embed(self, texts: list[str]) -> list[list[float]]:4 return self.model.encode(texts).tolist()Fan-out with map
Dispatch the same task against a list of inputs in parallel. .map() returns an iterator of results in submission order.
1items = ["item-1", "item-2", "item-3", "item-4"]2results = list(process.map(items)) # parallel dispatch3print(results)Spawn and await
.spawn() submits the task and returns a handle without blocking. Call .get() on the handle to collect the result later.
1handle = slow_task.spawn(x=42)2# do other work ...3result = handle.get() # blocks until done