Priority-with-retry: Advanced APIM Policy for Azure OpenAI

The Priority-with-retry policy is the essential routing engine when using SimpleL7Proxy with Azure API Management. It ensures that priority signals, affinity tokens, and throttling backpressure are correctly communicated between the proxy and backend models.

Implementation Note: SimpleL7Proxy should always be paired with this policy (or one derived from it). Using generic routing policies will result in a loss of priority queuing capabilities and observability.

Beyond proxy integration, this policy provides enterprise-grade routing, throttling, and resiliency for any Azure OpenAI workload. Furthermore, since most other LLM providers mimic OpenAI's API structure and headers (for streaming, rate limits, etc.), this policy will likely work with other model backends with minimal adaptation.

When to Use This Policy

Use this policy when you need:

Tiered Service Levels: Guarantee capacity for critical apps while throttling background tasks.
Cost Optimization: Maximize PTU utilization and restrict PayGo usage to specific priorities.
High Availability: Automatically failover between regions and backend types.
Throttling Management: Provide specific signals to clients (like SimpleL7Proxy) to queue requests during high load instead of failing immediately.

Key Capabilities

Smart Priority Routing: Routes requests based on priority (High/Medium/Low), backend health, and deployment type (PTU vs. PayGo).
Concurrency Control: Enforces per-backend concurrency limits (LimitConcurrency) to prevent overloading.
Resiliency: Intelligent circuit breaking and retry logic that avoids throttled backends.
Cost Efficiency: Prioritizes pre-paid PTU capacity before spilling over to PayGo endpoints.
Backend Affinity: Supports sticky routing via affinity headers to maximize cache hits on OpenAI backends.
Streaming Support: Fully compatible with streaming responses.
Observability: Detailed execution logs via the backendLog response header.

Control Flow

Configuration Guide

1. Define Backends

Locate the listBackends variable initialization in the <inbound> region. Add your Azure OpenAI endpoints:

<set-variable name="listBackends" value="@{
    JArray backends = new JArray();
    backends.Add(new JObject()
    {
        { "url", "https://your-ptu-endpoint.openai.azure.com/" },
        { "priority", 1 },               // 1=High, 2=Medium, 3=Low
        { "ModelType", "PTU" },          // Informational label for logging
        { "acceptablePriorities", new JArray(1, 2, 3) }, // Priorities this backend can handle
        { "LimitConcurrency", "high" },  // high (100), medium (50), low (10), or off
        { "BufferResponse", false },     // Set to false for Streaming; true to buffer full response
        { "Timeout", 120 },              // Backend timeout in seconds
        { "api-key", "your-api-key" }    // Leave blank if using Managed Identity
    });
    // Add more backends...
    return backends;
}" />

2. Configure Priority Rules

Adjust retry behavior per priority level using the priorityCfg variable:

<set-variable name="priorityCfg" value="@{
    JObject cfg = new JObject();
    // Priority 1: Aggressive retries
    cfg["1"] = new JObject { { "retryCount", 5 }, { "requeue", true } }; 
    // Priority 3: Fail fast
    cfg["3"] = new JObject { { "retryCount", 1 }, { "requeue", false } };
    return cfg;
}" />

3. Configure Headers

You can customize the header names used for control logic by modifying the variables at the top of the <inbound> block:

<set-variable name="priorityHeaderName" value="llm_proxy_priority" /> <!-- Header for priority (1, 2, 3) -->
<set-variable name="PolicyCycleCounterHeaderName" value="x-PolicyCycleCounter" /> <!-- Tracks retry attempts count -->
<set-variable name="AffinityHeaderName" value="x-backend-affinity" /> <!-- Sticky session support -->

Standalone Usage & Client Headers

While this policy is optimized for the SimpleL7Proxy, it serves as a powerful standalone routing engine. To unlock features like sticky sessions and priority handling without the proxy, your client application should manage the following headers:

Input Headers (Client -> APIM)

Header Name	Default	Purpose
Priority	`llm_proxy_priority`	Set to `1` (High), `2` (Medium), or `3` (Low) to determine routing tier. Defaults to `3` if missing.
Affinity	`x-backend-affinity`	Send the hash received from a previous response to route to the same backend node (Session Stickiness/Cache Optimization).
Cycle Counter	`x-PolicyCycleCounter`	(Optional) If implementing a client-side retry loop, pass the last received value to maintain a cumulative attempt count for diagnostics.

Output Headers (APIM -> Client)

Header Name	Default	Purpose
Affinity	`x-backend-affinity`	Returns the hash of the backend used. The client should store this for subsequent requests in the same session.
Requeue Signal	`S7PREQUEUE`	If `true` on a `429` response, it indicates a soft throttle (capacity full). SimpleL7Proxy queues these; standalone clients should sleep for `retry-after-ms`.
Retry Delay	`retry-after-ms`	Detailed backoff time (in ms) recommended before the next attempt.

Example Scenarios

The Priority-with-retry policy can be applied to various business scenarios with different optimization goals:

Financial Services Scenario - Prioritizing performance for critical trading operations while managing costs for lower-priority workloads.
Cost Optimization Scenario - Focusing on minimizing Azure OpenAI costs while maintaining acceptable performance for all workloads.
High Availability Scenario - Ensuring maximum service availability across multiple regions and deployment types.

Each scenario demonstrates how to configure the policy to meet different business requirements.

FAQ

How do I install this? Paste the contents of Priority-with-retry.xml into your API policy editor in the Azure Portal.

How does authentication work? The policy supports both API Keys (defined in listBackends) and Azure Managed Identity (recommended).

How do I debug? Set the header S7PDEBUG: true in your request. Inspect the backendLog header in the response for execution traces.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Priority-with-retry: Advanced APIM Policy for Azure OpenAI

When to Use This Policy

Key Capabilities

Control Flow

Configuration Guide

1. Define Backends

2. Configure Priority Rules

3. Configure Headers

Standalone Usage & Client Headers

Input Headers (Client -> APIM)

Output Headers (APIM -> Client)

Example Scenarios

FAQ

FilesExpand file tree

readme.md

Latest commit

History

readme.md

File metadata and controls

Priority-with-retry: Advanced APIM Policy for Azure OpenAI

When to Use This Policy

Key Capabilities

Control Flow

Configuration Guide

1. Define Backends

2. Configure Priority Rules

3. Configure Headers

Standalone Usage & Client Headers

Input Headers (Client -> APIM)

Output Headers (APIM -> Client)

Example Scenarios

FAQ