Introducing TokenRateLimitPolicy for LLM API Protection

By: Jason Madigan | 10 July 2025

We're excited to announce a new TokenRateLimitPolicy API in Kuadrant as v1alpha1, designed specifically for protecting Large Language Model (LLM) APIs through token-based rate limiting.

Rate-limiting and LLMs

Traditional rate limiting (for example, that of our RateLimitPolicy) counts requests, but LLM APIs have varying computational costs based on token consumption per inference request. Platform Engineers and SRE teams making models (be they in-cluster, or hosted elsewhere) available for internal or external usage are likely to want to protect these resources from over consumption, given their expense.

`TokenRateLimitPolicy`

The TokenRateLimitPolicy introduces rate limiting that:

Counts tokens consumed - Automatically extracts token usage from OpenAI-style LLM responses
Enables tiered access - Combined with an AuthPolicy, enables different token limits for different user groups
Works with OpenAI-compatible APIs - Supports standard token usage response formats

How It Works

Here is a simple example that sets different daily token limits for user tiers:

apiVersion: kuadrant.io/v1alpha1
kind: TokenRateLimitPolicy
metadata:
  name: llm-token-limits
spec:
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: api-gateway
  limits:
    free:
      rates:
        - limit: 20000
          window: 24h
      when:
        - predicate: request.path == "/v1/chat/completions"
        - predicate: >
            request.auth.identity.metadata.annotations["kuadrant.io/groups"].split(",").exists(g, g == "free")
      counters:
        - expression: auth.identity.userid
    gold:
      rates:
        - limit: 200000
          window: 24h
      when:
        - predicate: request.path == "/v1/chat/completions"
        - predicate: >
            request.auth.identity.metadata.annotations["kuadrant.io/groups"].split(",").exists(g, g == "gold")
      counters:
        - expression: auth.identity.userid

The policy automatically extracts the usage.total_tokens field from OpenAI-style LLM API downstream responses. Currently, only non-streaming responses are supported (where stream: false or is omitted in the request).

For example, a typical response from /v1/chat/completions:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "gpt-3.5-turbo-0613",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I assist you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 85,
    "total_tokens": 100
  }
}

The TokenRateLimitPolicy reads the usage.total_tokens value (100 in this example) and counts it against the user's configured limit.

Getting Started

Check out our comprehensive TokenRateLimitPolicy tutorial.

What's Next

This is our first alpha release of TokenRateLimitPolicy. We're actively working on:

Support for streaming responses (currently only non-streaming responses are supported)
Custom token extraction paths to support more inference endpoints
Token budget alerts and notifications
Observability and metrics

We'd love your feedback! Try out the new policy and let us know your thoughts through our community channels.

Learn More

Previous: Kuadrant and Model context protocol (MCP) the ultimate pairing?