NVIDIA Dynamo Planner: LLM Inference Optimization on Azure Kubernetes Service

In January 2026, Microsoft and NVIDIA released the second iteration of the NVIDIA Dynamo Planner—a groundbreaking tool for optimizing large language model (LLM) inference on Azure Kubernetes Service (AKS). This collaboration addresses one of the most challenging aspects of production AI: efficiently scaling GPU resources to balance cost, latency, and throughput. This comprehensive guide explores Dynamo Planner’s architecture, deployment patterns, and configuration strategies for enterprise LLM workloads.

The LLM Inference Challenge

Running LLMs in production presents unique operational challenges that traditional auto-scaling cannot address:

  • GPU memory constraints: Models like LLaMA 70B require 140GB+ of GPU memory
  • Variable request latency: Token generation time varies with sequence length
  • Batching complexity: Optimal batch size depends on request mix and model size
  • Cold start overhead: Model loading takes 30-60 seconds per GPU
  • Cost pressure: A100/H100 GPUs cost $2-10/hour each

Dynamo Planner solves these challenges with AI-driven resource planning that understands LLM-specific workload patterns.

Dynamo Planner Architecture

graph TB
    subgraph AKS ["Azure Kubernetes Service"]
        subgraph Dynamo ["Dynamo Planner"]
            Predictor["Workload Predictor"]
            Optimizer["Resource Optimizer"]
            Scheduler["GPU Scheduler"]
        end
        
        subgraph InferencePool ["Inference Pool"]
            GPU1["Node: 4x A100"]
            GPU2["Node: 4x A100"]
            GPU3["Node: 8x H100"]
        end
        
        subgraph Models ["Model Deployments"]
            LLM1["LLaMA 70B"]
            LLM2["Mistral 8x7B"]
            Embed["Embedding Model"]
        end
    end
    
    subgraph External ["External"]
        Metrics["Azure Monitor"]
        Traffic["Inference Requests"]
    end
    
    Traffic --> Scheduler
    Metrics --> Predictor
    Predictor --> Optimizer
    Optimizer --> Scheduler
    Scheduler --> GPU1
    Scheduler --> GPU2
    Scheduler --> GPU3
    GPU1 --> LLM1
    GPU2 --> LLM2
    GPU3 --> Embed
    
    style Dynamo fill:#E8F5E9,stroke:#2E7D32
    style InferencePool fill:#E3F2FD,stroke:#1565C0

Core Components

ComponentFunctionKey Features
Workload PredictorForecasts inference demandTime-series ML, pattern recognition
Resource OptimizerCalculates optimal GPU allocationCost-aware, SLO-driven
GPU SchedulerPlaces workloads on nodesTensor parallelism aware, memory packing
Autoscaler ControllerManages node pool scalingPredictive scale-up, graceful drain

Deploying Dynamo Planner on AKS

Prerequisites

# Create an AKS cluster with GPU node pool
az aks create   --resource-group rg-llm-production   --name aks-llm-cluster   --node-count 2   --node-vm-size Standard_D4s_v3   --generate-ssh-keys

# Add GPU node pool with A100 GPUs
az aks nodepool add   --resource-group rg-llm-production   --cluster-name aks-llm-cluster   --name gpupool   --node-count 0   --node-vm-size Standard_NC96ads_A100_v4   --enable-cluster-autoscaler   --min-count 0   --max-count 10   --node-taints "nvidia.com/gpu=true:NoSchedule"

# Install NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator   --namespace gpu-operator   --create-namespace

Installing Dynamo Planner

# Add the NVIDIA Dynamo Helm repository
helm repo add nvidia-dynamo https://helm.ngc.nvidia.com/nvidia/dynamo

# Install Dynamo Planner with Azure integration
helm install dynamo-planner nvidia-dynamo/dynamo-planner   --namespace dynamo-system   --create-namespace   --set cloud.provider=azure   --set cloud.azure.subscriptionId=$AZURE_SUBSCRIPTION_ID   --set cloud.azure.resourceGroup=rg-llm-production   --set cloud.azure.aksCluster=aks-llm-cluster   --set metrics.azureMonitor.enabled=true   --set metrics.azureMonitor.workspaceId=$LOG_ANALYTICS_WORKSPACE_ID   --set optimizer.costOptimization.enabled=true   --set optimizer.costOptimization.maxHourlyCost=500

Configuring Model Deployments

Dynamo Planner uses custom resources to define LLM deployments with inference-specific requirements:

apiVersion: dynamo.nvidia.com/v1
kind: InferenceDeployment
metadata:
  name: llama-70b-chat
  namespace: llm-inference
spec:
  model:
    name: meta-llama/Llama-3-70B-Instruct
    source: huggingface
    quantization: awq-int4  # 4-bit quantization for memory efficiency
    
  serving:
    engine: vllm  # or tensorrt-llm, triton
    maxConcurrentRequests: 256
    maxSequenceLength: 8192
    
  resources:
    gpu:
      type: nvidia-a100-80gb
      count: 4  # Tensor parallel across 4 GPUs
      memoryFraction: 0.9
    
  scaling:
    minReplicas: 1
    maxReplicas: 8
    targetLatencyP99: 2000ms
    targetThroughput: 100  # tokens/second per replica
    
  slo:
    availability: 99.9
    latencyP50: 500ms
    latencyP99: 2000ms
    
  cost:
    maxHourlyCost: 100
    preferSpotInstances: true
    spotFallbackToOnDemand: true

Multi-Model Deployment

apiVersion: dynamo.nvidia.com/v1
kind: InferencePool
metadata:
  name: production-llm-pool
spec:
  deployments:
  - name: llama-70b-chat
    weight: 60  # 60% of traffic
    priority: high
    
  - name: mistral-8x7b-instruct
    weight: 30  # 30% of traffic
    priority: medium
    
  - name: embedding-model
    weight: 10  # 10% of traffic
    priority: low
    canShareGPU: true  # Allow co-location with other models
    
  routing:
    strategy: latency-aware  # or round-robin, cost-optimized
    stickySession: false
    
  sharedResources:
    nodePool: gpupool
    maxNodes: 10
    enablePacking: true  # Pack small models on same GPU

Advanced Optimization Features

Predictive Scaling

Dynamo Planner uses ML to predict traffic 15 minutes ahead, pre-warming GPU nodes before demand spikes:

apiVersion: dynamo.nvidia.com/v1
kind: ScalingPolicy
metadata:
  name: predictive-scaling
spec:
  targetRef:
    kind: InferenceDeployment
    name: llama-70b-chat
    
  predictive:
    enabled: true
    lookAheadMinutes: 15
    confidenceThreshold: 0.8
    historicalDataDays: 30
    
  schedules:
    # Pre-warm for known traffic patterns
    - name: business-hours
      cron: "0 8 * * 1-5"  # 8 AM weekdays
      minReplicas: 4
      
    - name: weekend-reduction
      cron: "0 0 * * 0,6"  # Midnight Saturday/Sunday
      maxReplicas: 2
      
  reactive:
    scaleUpThreshold: 70  # CPU/GPU utilization %
    scaleDownThreshold: 30
    scaleUpCooldown: 60s
    scaleDownCooldown: 300s

Cost Optimization

apiVersion: dynamo.nvidia.com/v1
kind: CostPolicy
metadata:
  name: production-cost-controls
spec:
  budget:
    hourly: 500
    daily: 10000
    monthly: 200000
    
  strategies:
    - name: spot-instances
      enabled: true
      maxSpotPercentage: 70
      fallbackToOnDemand: true
      interruptionHandling: graceful-drain
      
    - name: gpu-consolidation
      enabled: true
      consolidationWindow: 5m
      minUtilizationForConsolidation: 30
      
    - name: model-offloading
      enabled: true
      offloadIdleModelsAfter: 10m
      offloadTarget: cpu  # or disk
      
  alerts:
    - threshold: 80  # % of budget
      action: notify
      channels: ["slack", "pagerduty"]
      
    - threshold: 95
      action: scale-down-non-critical
      
    - threshold: 100
      action: reject-new-requests
💡
COST SAVING TIP

Enable model-offloading for development and staging environments. Dynamo Planner can offload models to CPU RAM or NVMe during idle periods, reducing GPU costs by up to 80% for non-production workloads.

Monitoring and Observability

Dynamo Planner integrates with Azure Monitor for comprehensive observability:

apiVersion: dynamo.nvidia.com/v1
kind: ObservabilityConfig
metadata:
  name: production-observability
spec:
  metrics:
    azureMonitor:
      enabled: true
      customMetrics:
        - name: llm_tokens_per_second
        - name: llm_time_to_first_token
        - name: llm_queue_depth
        - name: gpu_memory_utilization
        
  tracing:
    enabled: true
    samplingRate: 0.1  # 10% of requests
    exporter: azure-monitor
    
  logging:
    level: info
    includePrompts: false  # Privacy: don't log prompts
    includeTokenCounts: true
    
  dashboards:
    grafana:
      enabled: true
      autoProvision: true
    azureDashboard:
      enabled: true
      resourceGroup: rg-llm-production

Key Metrics to Monitor

MetricDescriptionTarget
Time to First Token (TTFT)Latency before first token generated<500ms
Tokens Per Second (TPS)Generation throughput>50/request
Queue DepthPending requests<100
GPU Memory UtilizationVRAM usage percentage80-90%
Cost Per 1K TokensInference cost efficiency<$0.01

Integration with Azure AI Services

// C# client using Dynamo-managed endpoint
using Azure.AI.Inference;

var client = new ChatCompletionsClient(
    new Uri("https://aks-llm-cluster.eastus.inference.ml.azure.com"),
    new AzureKeyCredential(Environment.GetEnvironmentVariable("DYNAMO_API_KEY"))
);

var response = await client.CompleteAsync(new ChatCompletionsOptions
{
    DeploymentName = "llama-70b-chat",  // Maps to InferenceDeployment
    Messages =
    {
        new ChatRequestSystemMessage("You are a helpful assistant."),
        new ChatRequestUserMessage("Explain Kubernetes pod scheduling.")
    },
    MaxTokens = 1000,
    Temperature = 0.7f
});

Console.WriteLine(response.Value.Choices[0].Message.Content);
Console.WriteLine($"Tokens used: {response.Value.Usage.TotalTokens}");
Console.WriteLine($"Latency: {response.Value.Usage.CompletionTokens} tokens in {stopwatch.ElapsedMilliseconds}ms");

Key Takeaways

  • NVIDIA Dynamo Planner provides AI-driven resource optimization specifically designed for LLM inference workloads on Kubernetes.
  • Predictive scaling pre-warms GPU nodes before traffic spikes, eliminating cold start delays.
  • Cost controls enable budget limits, spot instance utilization, and automatic model offloading for idle workloads.
  • Multi-model deployments with GPU packing maximize utilization across heterogeneous model sizes.
  • Azure integration provides native monitoring, managed identity authentication, and AKS node pool management.

Conclusion

NVIDIA Dynamo Planner addresses the operational complexity of running LLMs in production on Kubernetes. By combining predictive scaling, cost optimization, and GPU-aware scheduling, it enables enterprises to deploy large language models with confidence. The tight integration with Azure Kubernetes Service and Azure Monitor makes it particularly attractive for organizations already invested in the Microsoft ecosystem. For teams struggling with GPU utilization, inference latency, or cloud costs, Dynamo Planner represents a significant step toward production-grade AI infrastructure.

References


Discover more from C4: Container, Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.