GenAI Cost Runaway? Observable Token Quota Control with Azure AI Gateway

Background

In this note, I’ll continue exploring one of the core capabilities of Azure AI Gateway: token-based rate limiting.

The hands-on setup is straightforward but very practical: I deploy a DeepSeek-R1 model on Microsoft Foundry, then integrate its online inference endpoint into Azure API Management (APIM). With APIM policies, we can enforce rate limits based on token consumption, so the “GenAI bill” doesn’t spiral out of control. Along the way, this is also a great way to revisit classic APIM concepts and apply them to modern GenAI workloads.

Why token-based rate limiting matters more for GenAI

For GenAI applications, rate limiting is not just about “getting an occasional 429 and retrying.” It directly affects:

User experience: users can feel the latency or failures immediately when limits kick in
Cost control: cost is strongly correlated with token usage; uncontrolled growth becomes expensive quickly
System stability: token throughput spikes amplify pressure on upstream/downstream components (model capacity, gateway, logging pipeline, cache, etc.)

Compared to traditional APIs — where we typically limit by request rate (RPS/RPM) and burst traffic — GenAI rate limiting is often more nuanced:

Platforms may limit both request count and token throughput
Many platforms track prompt tokens and completion tokens separately
Some platforms apply separate quotas for different modalities (text vs image vs audio)

That leads to an important reality: two requests are not equal in GenAI. The longer the context, the more content you stuff from RAG, the more complex your tool/function schema, and the longer the model’s answer, the faster tokens can explode. As a result:

You might be under the request limit but still get throttled because tokens-per-minute is exhausted, or the other way around.

Demo environment and services

To demonstrate token-based throttling, I set up a small lab environment with these Azure services:

Azure API Management (APIM): exposes a unified API endpoint and hosts the throttling policies
Azure Foundry: deploys DeepSeek-R1 and provides an online inference endpoint
Log Analytics: the storage layer for token-related logs/metrics
Application Insights: built on top of the Log Analytics workspace for querying and visualization

Architecture: the rate-limit design

The overall flow looks like this:

Azure Foundry: deploy DeepSeek-R1 and expose its online inference endpoint
APIM: create a backend that points to the DeepSeek-R1 inference endpoint; create an api that wraps the backend, and expose it to end users and AI agents; implement and enforce rate-limit policies at the gateway layer
Log Analytics: create a workspace to store token consumption data
Application Insights: create an App Insights instance backed by the workspace for querying and dashboards

Deploying the DeepSeek-R1 model

Here is the DeepSeek-R1 deployment in Foundry:

Integrating APIM

After DeepSeek-R1 is deployed to Foundry, it exposes an inference endpoint. In APIM, I wrap it as a backend named foundry1:

Then I create an API named Inference API:

And bind it to the backend with the following policy snippet:

<set-backend-service backend-id="foundry1" />

This lab environment has quite a few moving parts, so I won’t expand every detail here. If you want the full working project, you can check my GitHub repo: azure-ai-gateway-labs.

Next, let’s focus on the key part: the rate-limit policy.

Rate-limit policies

The policy configuration looks like this:

In APIM, the two policies most relevant to token rate limiting are:

llm-token-limit: enforce token-based throttling
llm-emit-token-metric: emit token usage metrics (for logs/observability)

llm-token-limit: key configuration points

Here are the settings worth paying attention to:

counter-key: the dimension used for token accounting. In this lab, I use subscription.Id. subscription is an APIM concept for managing API access; clients can be required to include a subscription key (also commonly used as a lightweight security gate). By extracting subscription info from the request, we can account and throttle per subscription. Alternatively, you can throttle by client IP, e.g. Request.IpAddress
tokens-per-minute: the core threshold, i.e. how many tokens are allowed per minute. you can also use token-quota + token-quota-period for more flexible quota periods (hour/day/month/year)
estimate-prompt-tokens: whether to estimate tokens (prediction) or rely on the model’s usage returned from responses. When enabled, throttling may kick in more aggressively as you approach the limit. When disabled, it tends to be more conservative and accounts for actual consumption
remaining-tokens-header-name, tokens-consumed-header-name: write remaining/consumed tokens into response headers for debugging (the demo makes this very obvious)

llm-emit-token-metric: observability and dimensions

llm-emit-token-metric collects token usage and sends it into an Application Insights custom namespace called llm.

Common metrics include:

prompt tokens: input token count
completion tokens: output token count
total tokens: total consumption (sum of the two)

In addition, I attach two custom dimensions (tags) to each metric event: Client IP and API ID. This enables deeper filtering and analysis, for example:

Demo

Now let’s run a quick demo. The script below calls the APIM-wrapped DeepSeek-R1 inference endpoint 10 times and prints key response info so we can observe throttling behavior.

Note: variables like apim_resource_gateway_url, inference_api_path, inference_api_version, models_config, and apim_subscriptions come from your environment configuration. I keep them as-is.

import json
import requests

def print_debug_headers(response):
    interesting_headers = [
        "x-apim-retry-after",
        "x-apim-remaining-tokens",
        "x-apim-tokens-consumed",
    ]
    debug_headers = {
        name: response.headers.get(name)
        for name in interesting_headers
        if response.headers.get(name) is not None
    }
    print("headers:", debug_headers if debug_headers else "<none>")

url = f"{apim_resource_gateway_url}/{inference_api_path}/models/chat/completions?api-version={inference_api_version}"
messages = {
    "messages": [
        {"role": "system", "content": "You are a sarcastic, unhelpful assistant."},
        {"role": "user", "content": "Can you tell me the time, please?"},
    ],
    "model": models_config[0]["name"],
}
api_runs = []
for i in range(10):
    response = requests.post(
        url, headers={"api-key": apim_subscriptions[0]["key"]}, json=messages
    )
    if response.status_code == 200:
        data = json.loads(response.text)
        input_tokens = data.get("usage").get("prompt_tokens")
        output_tokens = data.get("usage").get("completion_tokens")
        total_tokens = data.get("usage").get("total_tokens")
        print(
            "▶️ Run: ",
            i + 1,
            "status code: ",
            response.status_code,
            "✅",
            "input tokens: ",
            input_tokens,
            "output tokens: ",
            output_tokens,
            "total tokens: ",
            total_tokens,
        )
        print_debug_headers(response)
        # print("💬 ", data.get("choices")[0].get("message").get("content"))
    else:
        print("▶️ Run: ", i + 1, "status code: ", response.status_code, "⛔")
        print_debug_headers(response)
        print(response.text)
        total_tokens = 0
    api_runs.append((total_tokens, response.status_code))

The output below is the key observation. Since the limit is set to tokens-per-minute: 300:

Run #1 consumes 258 tokens, leaving 42 remaining
Run #2 still succeeds, but the remaining tokens drop to 0
Subsequent runs get throttled with HTTP 429, and x-apim-retry-after tells you how long to wait before retrying

▶️ Run: 1 status code: 200 ✅ input tokens: 25 output tokens: 233 total tokens: 258 headers: {'x-apim-remaining-tokens': '42', 'x-apim-tokens-consumed': '258'}
▶️ Run: 2 status code: 200 ✅ input tokens: 25 output tokens: 184 total tokens: 209 headers: {'x-apim-remaining-tokens': '0', 'x-apim-tokens-consumed': '209'}
▶️ Run: 3 status code: 429 ⛔ headers: {'x-apim-retry-after': '28', 'x-apim-remaining-tokens': '0'} { "statusCode": 429, "message": "Token limit is exceeded. Try again in 28 seconds." }
▶️ Run: 4 status code: 429 ⛔ headers: {'x-apim-retry-after': '27', 'x-apim-remaining-tokens': '0'} { "statusCode": 429, "message": "Token limit is exceeded. Try again in 27 seconds." }
▶️ Run: 5 status code: 429 ⛔ headers: {'x-apim-retry-after': '25', 'x-apim-remaining-tokens': '0'} { "statusCode": 429, "message": "Token limit is exceeded. Try again in 25 seconds." }
▶️ Run: 6 status code: 429 ⛔ headers: {'x-apim-retry-after': '24', 'x-apim-remaining-tokens': '0'} { "statusCode": 429, "message": "Token limit is exceeded. Try again in 24 seconds." }
▶️ Run: 7 status code: 429 ⛔ headers: {'x-apim-retry-after': '22', 'x-apim-remaining-tokens': '0'} { "statusCode": 429, "message": "Token limit is exceeded. Try again in 22 seconds." }
▶️ Run: 8 status code: 429 ⛔ headers: {'x-apim-retry-after': '21', 'x-apim-remaining-tokens': '0'} { "statusCode": 429, "message": "Token limit is exceeded. Try again in 21 seconds." }
▶️ Run: 9 status code: 429 ⛔ headers: {'x-apim-retry-after': '19', 'x-apim-remaining-tokens': '0'} { "statusCode": 429, "message": "Token limit is exceeded. Try again in 19 seconds." }
▶️ Run: 10 status code: 429 ⛔ headers: {'x-apim-retry-after': '18', 'x-apim-remaining-tokens': '0'} { "statusCode": 429, "message": "Token limit is exceeded. Try again in 18 seconds." }

We can also validate this from logs:

Summary

In GenAI systems, token-based throttling is often more aligned with real resource consumption than request-count throttling. By using APIM’s llm-token-limit for quota enforcement and pairing it with llm-emit-token-metric to push token metrics into Application Insights / Log Analytics, we can manage cost, user experience, and stability under a single observable control plane.

In coming notes, I’d like to dig further into practical topics such as: designing quota tiers per subscription, handling burst traffic more gracefully, and building dashboards/alerts based on token usage. Great, right?

I’m Chris Bao. I focus on the Azure AI platform as a Microsoft-certified trainer, and I work a lot on Azure AI services and agent development.
If you’d like to collaborate or need training/consulting, feel free to reach out: [email protected].

GenAI Cost Runaway? Observable Token Quota Control with Azure AI Gateway was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

GenAI Cost Runaway? Observable Token Quota Control with Azure AI Gateway

Background

Why token-based rate limiting matters more for GenAI

Demo environment and services

Architecture: the rate-limit design

Deploying the DeepSeek-R1 model

Integrating APIM

Rate-limit policies

llm-token-limit: key configuration points

llm-emit-token-metric: observability and dimensions

Demo

Summary

Looking for a crypto payment gateway?

NexaPay — Accept Card Payments, Receive Crypto

GenAI Cost Runaway? Observable Token Quota Control with Azure AI Gateway

Background

Why token-based rate limiting matters more for GenAI

Demo environment and services

Architecture: the rate-limit design

Deploying the DeepSeek-R1 model

Integrating APIM

Rate-limit policies

llm-token-limit: key configuration points

llm-emit-token-metric: observability and dimensions

Demo

Summary

Looking for a crypto payment gateway?

NexaPay — Accept Card Payments, Receive Crypto

Related Articles

Elk Grove Village property seeks $850M in junk bonds for CoreWeave data center

Structure your Python AI Agent Apps like this

Generating Random Numbers in Go

Senator Bernie Sanders proposes 50% equity tax on AI labs to fund American AI Sovereign Wealth Fund

Nvidia enters the personal computer market with a new AI chip, challenging Intel and AMD

Coinbase enables direct rupee bank transfers in India via IMPS after FIU registration