Start now →

GenAI Cost Runaway? Observable Token Quota Control with Azure AI Gateway

By Chris Bao · Published June 1, 2026 · 7 min read · Source: Level Up Coding
PaymentsAI & Crypto
GenAI Cost Runaway? Observable Token Quota Control with Azure AI Gateway

Background

In this note, I’ll continue exploring one of the core capabilities of Azure AI Gateway: token-based rate limiting.

The hands-on setup is straightforward but very practical: I deploy a DeepSeek-R1 model on Microsoft Foundry, then integrate its online inference endpoint into Azure API Management (APIM). With APIM policies, we can enforce rate limits based on token consumption, so the “GenAI bill” doesn’t spiral out of control. Along the way, this is also a great way to revisit classic APIM concepts and apply them to modern GenAI workloads.

Why token-based rate limiting matters more for GenAI

For GenAI applications, rate limiting is not just about “getting an occasional 429 and retrying.” It directly affects:

Compared to traditional APIs — where we typically limit by request rate (RPS/RPM) and burst traffic — GenAI rate limiting is often more nuanced:

That leads to an important reality: two requests are not equal in GenAI. The longer the context, the more content you stuff from RAG, the more complex your tool/function schema, and the longer the model’s answer, the faster tokens can explode. As a result:

Demo environment and services

To demonstrate token-based throttling, I set up a small lab environment with these Azure services:

Architecture: the rate-limit design

The overall flow looks like this:

Deploying the DeepSeek-R1 model

Here is the DeepSeek-R1 deployment in Foundry:

Integrating APIM

After DeepSeek-R1 is deployed to Foundry, it exposes an inference endpoint. In APIM, I wrap it as a backend named foundry1:

Then I create an API named Inference API:

And bind it to the backend with the following policy snippet:

<set-backend-service backend-id="foundry1" />

This lab environment has quite a few moving parts, so I won’t expand every detail here. If you want the full working project, you can check my GitHub repo: azure-ai-gateway-labs.

Next, let’s focus on the key part: the rate-limit policy.

Rate-limit policies

The policy configuration looks like this:

In APIM, the two policies most relevant to token rate limiting are:

llm-token-limit: key configuration points

Here are the settings worth paying attention to:

llm-emit-token-metric: observability and dimensions

llm-emit-token-metric collects token usage and sends it into an Application Insights custom namespace called llm.

Common metrics include:

In addition, I attach two custom dimensions (tags) to each metric event: Client IP and API ID. This enables deeper filtering and analysis, for example:

Demo

Now let’s run a quick demo. The script below calls the APIM-wrapped DeepSeek-R1 inference endpoint 10 times and prints key response info so we can observe throttling behavior.

Note: variables like apim_resource_gateway_url, inference_api_path, inference_api_version, models_config, and apim_subscriptions come from your environment configuration. I keep them as-is.
import json
import requests

def print_debug_headers(response):
interesting_headers = [
"x-apim-retry-after",
"x-apim-remaining-tokens",
"x-apim-tokens-consumed",
]
debug_headers = {
name: response.headers.get(name)
for name in interesting_headers
if response.headers.get(name) is not None
}
print("headers:", debug_headers if debug_headers else "<none>")

url = f"{apim_resource_gateway_url}/{inference_api_path}/models/chat/completions?api-version={inference_api_version}"
messages = {
"messages": [
{"role": "system", "content": "You are a sarcastic, unhelpful assistant."},
{"role": "user", "content": "Can you tell me the time, please?"},
],
"model": models_config[0]["name"],
}
api_runs = []
for i in range(10):
response = requests.post(
url, headers={"api-key": apim_subscriptions[0]["key"]}, json=messages
)
if response.status_code == 200:
data = json.loads(response.text)
input_tokens = data.get("usage").get("prompt_tokens")
output_tokens = data.get("usage").get("completion_tokens")
total_tokens = data.get("usage").get("total_tokens")
print(
"▶️ Run: ",
i + 1,
"status code: ",
response.status_code,
"✅",
"input tokens: ",
input_tokens,
"output tokens: ",
output_tokens,
"total tokens: ",
total_tokens,
)
print_debug_headers(response)
# print("💬 ", data.get("choices")[0].get("message").get("content"))
else:
print("▶️ Run: ", i + 1, "status code: ", response.status_code, "⛔")
print_debug_headers(response)
print(response.text)
total_tokens = 0
api_runs.append((total_tokens, response.status_code))

The output below is the key observation. Since the limit is set to tokens-per-minute: 300:

▶️ Run: 1 status code: 200 ✅ input tokens: 25 output tokens: 233 total tokens: 258 headers: {'x-apim-remaining-tokens': '42', 'x-apim-tokens-consumed': '258'}
▶️ Run: 2 status code: 200 ✅ input tokens: 25 output tokens: 184 total tokens: 209 headers: {'x-apim-remaining-tokens': '0', 'x-apim-tokens-consumed': '209'}
▶️ Run: 3 status code: 429 ⛔ headers: {'x-apim-retry-after': '28', 'x-apim-remaining-tokens': '0'} { "statusCode": 429, "message": "Token limit is exceeded. Try again in 28 seconds." }
▶️ Run: 4 status code: 429 ⛔ headers: {'x-apim-retry-after': '27', 'x-apim-remaining-tokens': '0'} { "statusCode": 429, "message": "Token limit is exceeded. Try again in 27 seconds." }
▶️ Run: 5 status code: 429 ⛔ headers: {'x-apim-retry-after': '25', 'x-apim-remaining-tokens': '0'} { "statusCode": 429, "message": "Token limit is exceeded. Try again in 25 seconds." }
▶️ Run: 6 status code: 429 ⛔ headers: {'x-apim-retry-after': '24', 'x-apim-remaining-tokens': '0'} { "statusCode": 429, "message": "Token limit is exceeded. Try again in 24 seconds." }
▶️ Run: 7 status code: 429 ⛔ headers: {'x-apim-retry-after': '22', 'x-apim-remaining-tokens': '0'} { "statusCode": 429, "message": "Token limit is exceeded. Try again in 22 seconds." }
▶️ Run: 8 status code: 429 ⛔ headers: {'x-apim-retry-after': '21', 'x-apim-remaining-tokens': '0'} { "statusCode": 429, "message": "Token limit is exceeded. Try again in 21 seconds." }
▶️ Run: 9 status code: 429 ⛔ headers: {'x-apim-retry-after': '19', 'x-apim-remaining-tokens': '0'} { "statusCode": 429, "message": "Token limit is exceeded. Try again in 19 seconds." }
▶️ Run: 10 status code: 429 ⛔ headers: {'x-apim-retry-after': '18', 'x-apim-remaining-tokens': '0'} { "statusCode": 429, "message": "Token limit is exceeded. Try again in 18 seconds." }

We can also validate this from logs:

Summary

In GenAI systems, token-based throttling is often more aligned with real resource consumption than request-count throttling. By using APIM’s llm-token-limit for quota enforcement and pairing it with llm-emit-token-metric to push token metrics into Application Insights / Log Analytics, we can manage cost, user experience, and stability under a single observable control plane.

In coming notes, I’d like to dig further into practical topics such as: designing quota tiers per subscription, handling burst traffic more gracefully, and building dashboards/alerts based on token usage. Great, right?

I’m Chris Bao. I focus on the Azure AI platform as a Microsoft-certified trainer, and I work a lot on Azure AI services and agent development.
If you’d like to collaborate or need training/consulting, feel free to reach out: [email protected].


GenAI Cost Runaway? Observable Token Quota Control with Azure AI Gateway was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

Looking for a crypto payment gateway?

NexaPay lets merchants accept card payments and receive crypto. No KYC required. Instant settlement via Visa, Mastercard, Apple Pay, and Google Pay.

Learn More →
This article was originally published on Level Up Coding and is republished here under RSS syndication for informational purposes. All rights and intellectual property remain with the original author. If you are the author and wish to have this article removed, please contact us at [email protected].

NexaPay — Accept Card Payments, Receive Crypto

No KYC · Instant Settlement · Visa, Mastercard, Apple Pay, Google Pay

Get Started →