Introducing TokenRateLimitPolicy for LLM API Protection

We're excited to announce a new TokenRateLimitPolicy API in Kuadrant as v1alpha1, designed specifically for protecting Large Language Model (LLM) APIs through token-based rate limiting.

Rate-limiting and LLMs

Traditional rate limiting (for example, that of our RateLimitPolicy) counts requests, but LLM APIs have varying computational costs based on token consumption per inference request. Platform Engineers and SRE teams making models (be they in-cluster, or hosted elsewhere) available for internal or external usage are likely to want to protect these resources from over consumption, given their expense.

TokenRateLimitPolicy

The TokenRateLimitPolicy introduces rate limiting that:

How It Works

Here is a simple example that sets different daily token limits for user tiers:

apiVersion: kuadrant.io/v1alpha1
kind: TokenRateLimitPolicy
metadata:
  name: llm-token-limits
spec:
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: api-gateway
  limits:
    free:
      rates:
        - limit: 20000
          window: 24h
      when:
        - predicate: request.path == "/v1/chat/completions"
        - predicate: >
            request.auth.identity.metadata.annotations["kuadrant.io/groups"].split(",").exists(g, g == "free")
      counters:
        - expression: auth.identity.userid
    gold:
      rates:
        - limit: 200000
          window: 24h
      when:
        - predicate: request.path == "/v1/chat/completions"
        - predicate: >
            request.auth.identity.metadata.annotations["kuadrant.io/groups"].split(",").exists(g, g == "gold")
      counters:
        - expression: auth.identity.userid

The policy automatically extracts the usage.total_tokens field from OpenAI-style LLM API downstream responses. Currently, only non-streaming responses are supported (where stream: false or is omitted in the request).

For example, a typical response from /v1/chat/completions:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "gpt-3.5-turbo-0613",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I assist you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 85,
    "total_tokens": 100
  }
}

The TokenRateLimitPolicy reads the usage.total_tokens value (100 in this example) and counts it against the user's configured limit.

Getting Started

Check out our comprehensive TokenRateLimitPolicy tutorial.

What's Next

This is our first alpha release of TokenRateLimitPolicy. We're actively working on:

We'd love your feedback! Try out the new policy and let us know your thoughts through our community channels.

Learn More