Model Denial of Service: The Resource Exhaustion Attack You're Not Monitoring

March 3, 2026 8 min read AI Security

Secure By DeZign Podcast

Episode • Hosted by Pax

5 minutes • Listen to Pax break down the key insights from this article

Your threat model accounts for prompt injection, data poisoning, and jailbreaks. But are you monitoring for adversarial inputs designed to burn through compute budgets? Model denial of service attacks exploit the economic reality of LLM inference: some queries cost 100x more than others — and attackers know it.

The Compute Cost Asymmetry

Traditional denial of service attacks overwhelm systems with volume. Model DoS attacks exploit a different asymmetry: the variable computational cost of inference. A carefully crafted prompt can force an LLM to consume orders of magnitude more resources than a benign query — without triggering rate limits, WAF rules, or traditional DDoS defenses.

The attack surface is threefold:

  • Sponge examples: Inputs engineered to maximize intermediate computations
  • Context stuffing: Exploiting maximum token limits to inflate processing time
  • Repetition amplification: Triggering exponential token generation in output

Unlike infrastructure DoS, model DoS attacks succeed with low request volume. A single malicious query can cost more than 10,000 legitimate ones.

Attack Taxonomy

1. Sponge Examples

First documented in academic research (Shumailov et al., 2021), sponge examples are inputs specifically crafted to maximize model computation. For transformer architectures, this means exploiting attention mechanism complexity.

The attack works by forcing the model to attend to every token in the input with equal weight, preventing attention head optimization. Normal queries have sparse attention patterns; sponge examples make attention dense.

Example: Adversarial sponge prompt (simplified)
Analyze the relationship between each word and every other word in this sentence:
[10,000 semantically unrelated tokens designed to prevent attention pruning]

Now provide a summary where you reference each token at least once.

Inference cost for this query can be 50-100x higher than semantic equivalent queries, while appearing superficially benign to content filters.

2. Context Stuffing

Most production LLMs support context windows from 8K to 200K tokens. Attackers can stuff the context limit with MITRE ATLAS T1659-style malicious padding:

Context stuffing pattern
System: You are a helpful assistant.

User: [196,000 tokens of Lorem Ipsum or Wikipedia dumps]

Given the above context, answer this simple question: What is 2+2?

The actual question is trivial, but processing cost scales with total context length. At $0.03/1K tokens for GPT-4 Turbo input, a single stuffed query costs ~$6. Multiply by request volume during business hours.

3. Generation Amplification

Output token generation is typically 3-10x more expensive than input processing. Attackers exploit this by crafting prompts that trigger maximum-length responses:

Amplification via repetition trigger
List every three-digit number from 100 to 999, with a detailed explanation 
of why each number is mathematically interesting. Format each as:

Number: XXX
Mathematical properties: [detailed analysis]
Historical significance: [detailed history]
[repeat for all 900 numbers]

The model attempts to comply, burning through output token budgets. Even with max_tokens limits, the attacker forces consumption of the maximum allowed.

Full Technical Deep-Dive Available

Get the complete guide including detection architectures, rate limiting strategies, cost anomaly algorithms, production monitoring dashboards, and incident response playbooks.

Get the PDF — $27

Includes diagrams, code samples, and detection rules • PDF delivered instantly

Detection Strategies

Traditional security tooling won't catch model DoS attacks. WAFs see benign HTTP POST requests. SIEMs see normal application logs. You need inference-aware detection.

Architecture: Cost Anomaly Detection

Effective detection requires instrumentation at the inference layer. The detection architecture must capture per-request resource consumption and flag statistical outliers.

Client (Potentially Malicious) API Gateway + Cost Meter Log: token_in, token_out LLM Inference GPT-4 / Claude / etc Instrumented runtime Anomaly Detector • Rolling cost baseline • Z-score threshold • User behavior profile Alert on 3σ outlier Block on 5σ + volume SIEM / DataDog Cost dashboard Budget alerts POST /chat inference metrics cost logs

Metric Collection

Instrument your inference layer to capture:

  • Input token count: Actual tokens processed (not just Content-Length)
  • Output token count: Tokens generated in response
  • Inference latency: Wall-clock time from request to first token
  • Total cost: Calculated from provider pricing (input + output)
  • User/API key: Attribution for cost aggregation
  • Request fingerprint: Hash of prompt structure (not content)
Inference instrumentation (Python/FastAPI)
import time
from prometheus_client import Histogram, Counter

llm_cost_histogram = Histogram('llm_inference_cost_usd', 
                               'LLM inference cost per request',
                               buckets=[0.001, 0.01, 0.1, 1.0, 10.0])
llm_token_counter = Counter('llm_tokens_total', 
                            'Total tokens processed',
                            ['direction', 'user_id'])

async def metered_inference(prompt: str, user_id: str, model: str):
    start = time.time()
    
    # Call LLM
    response = await llm_client.chat(model=model, prompt=prompt)
    
    # Extract token counts from response headers or usage metadata
    input_tokens = response.usage.prompt_tokens
    output_tokens = response.usage.completion_tokens
    
    # Calculate cost (example: GPT-4 Turbo pricing)
    if model == "gpt-4-turbo":
        cost_input = input_tokens * 0.01 / 1000  # $0.01/1K
        cost_output = output_tokens * 0.03 / 1000  # $0.03/1K
        total_cost = cost_input + cost_output
    
    # Record metrics
    llm_cost_histogram.observe(total_cost)
    llm_token_counter.labels(direction='input', user_id=user_id).inc(input_tokens)
    llm_token_counter.labels(direction='output', user_id=user_id).inc(output_tokens)
    
    # Log structured event for SIEM
    logger.info("llm_inference", extra={
        "user_id": user_id,
        "model": model,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "cost_usd": total_cost,
        "latency_ms": (time.time() - start) * 1000
    })
    
    return response

Anomaly Detection Algorithm

Use statistical outlier detection on a sliding window. The key insight: model DoS attacks produce cost outliers, not just volume spikes.

Z-score based cost anomaly detector
import numpy as np
from collections import deque

class CostAnomalyDetector:
    def __init__(self, window_size=1000, alert_threshold=3.0, block_threshold=5.0):
        self.window = deque(maxlen=window_size)
        self.alert_threshold = alert_threshold
        self.block_threshold = block_threshold
    
    def check(self, cost: float, user_id: str) -> tuple[str, float]:
        """
        Returns: (action, z_score)
        action: 'allow' | 'alert' | 'block'
        """
        if len(self.window) < 100:  # Warmup period
            self.window.append(cost)
            return ('allow', 0.0)
        
        # Calculate baseline statistics
        costs = np.array(self.window)
        mean = np.mean(costs)
        std = np.std(costs)
        
        if std == 0:  # Edge case: all costs identical
            std = 0.001
        
        # Compute Z-score for current request
        z_score = (cost - mean) / std
        
        # Decision logic
        if z_score >= self.block_threshold:
            action = 'block'
            logger.warning(f"BLOCK: user={user_id} cost=${cost:.4f} z={z_score:.2f}")
        elif z_score >= self.alert_threshold:
            action = 'alert'
            logger.info(f"ALERT: user={user_id} cost=${cost:.4f} z={z_score:.2f}")
        else:
            action = 'allow'
        
        # Update rolling window only if not blocked
        if action != 'block':
            self.window.append(cost)
        
        return (action, z_score)

# Usage in inference path
detector = CostAnomalyDetector()

async def protected_inference(prompt: str, user_id: str):
    # Estimate cost before full inference (rough heuristic)
    estimated_cost = estimate_cost(prompt)
    
    action, z_score = detector.check(estimated_cost, user_id)
    
    if action == 'block':
        raise HTTPException(429, "Request blocked: resource exhaustion pattern detected")
    
    response = await metered_inference(prompt, user_id)
    
    # Check actual cost post-inference
    actual_cost = calculate_cost(response)
    action_actual, _ = detector.check(actual_cost, user_id)
    
    if action_actual == 'alert':
        send_alert(user_id, actual_cost, z_score)
    
    return response

Per-User Budget Enforcement

Complement statistical detection with hard budget caps. This prevents slow-burn attacks that stay just below anomaly thresholds but accumulate massive costs over time.

Redis-backed budget tracker
import redis
from datetime import datetime, timedelta

class BudgetTracker:
    def __init__(self, redis_client, daily_limit_usd=100.0):
        self.redis = redis_client
        self.daily_limit = daily_limit_usd
    
    def get_key(self, user_id: str) -> str:
        today = datetime.utcnow().strftime("%Y-%m-%d")
        return f"llm_budget:{user_id}:{today}"
    
    def check_and_deduct(self, user_id: str, cost: float) -> bool:
        """
        Returns True if budget allows, False if exceeded.
        Atomically deducts cost from daily budget.
        """
        key = self.get_key(user_id)
        
        # INCRBYFLOAT is atomic
        spent = float(self.redis.incrbyfloat(key, cost))
        
        # Set expiry on first write (24h + buffer)
        if spent == cost:  # First spend of the day
            self.redis.expire(key, int(timedelta(hours=26).total_seconds()))
        
        if spent > self.daily_limit:
            logger.warning(f"Budget exceeded: user={user_id} spent=${spent:.2f}")
            return False
        
        return True
    
    def get_remaining(self, user_id: str) -> float:
        key = self.get_key(user_id)
        spent = float(self.redis.get(key) or 0)
        return max(0, self.daily_limit - spent)

# Integration
budget_tracker = BudgetTracker(redis_client)

async def budget_protected_inference(prompt: str, user_id: str):
    estimated_cost = estimate_cost(prompt)
    
    if not budget_tracker.check_and_deduct(user_id, estimated_cost):
        remaining = budget_tracker.get_remaining(user_id)
        raise HTTPException(429, f"Daily budget exceeded. Remaining: ${remaining:.2f}")
    
    return await protected_inference(prompt, user_id)

Mitigation Architecture

Defense in depth requires controls at multiple layers. No single mitigation stops all model DoS variants.

Layer 1: Input Validation

Reject obviously malicious inputs before inference:

  • Token count limits: Hard caps on input length (e.g., 8K tokens for chat, 32K for document analysis)
  • Repetition detection: Flag inputs with >30% repeated n-grams
  • Entropy checks: Block low-entropy padding (e.g., 10K tokens of "Lorem ipsum")
Pre-inference input validation
import tiktoken
from collections import Counter

def validate_input(prompt: str, max_tokens=8000) -> tuple[bool, str]:
    """Returns (is_valid, error_message)"""
    
    # Token count check
    encoding = tiktoken.encoding_for_model("gpt-4")
    tokens = encoding.encode(prompt)
    
    if len(tokens) > max_tokens:
        return (False, f"Input exceeds {max_tokens} token limit")
    
    # Repetition check (trigram analysis)
    words = prompt.lower().split()
    if len(words) > 100:
        trigrams = [tuple(words[i:i+3]) for i in range(len(words)-2)]
        trigram_counts = Counter(trigrams)
        most_common_freq = trigram_counts.most_common(1)[0][1] if trigrams else 0
        repetition_ratio = most_common_freq / len(trigrams) if trigrams else 0
        
        if repetition_ratio > 0.3:
            return (False, "Excessive repetition detected")
    
    # Entropy check (simplified)
    char_counts = Counter(prompt)
    entropy = -sum((count/len(prompt)) * np.log2(count/len(prompt)) 
                   for count in char_counts.values())
    
    if entropy < 2.0:  # Very low entropy (e.g., "aaaaaa...")
        return (False, "Input appears to be padding/noise")
    
    return (True, "")

Layer 2: Controlled Generation

Limit output token generation to prevent amplification attacks:

  • max_tokens enforcement: Set conservative defaults (e.g., 500 tokens for chat, 2000 for summarization)
  • Temperature capping: Lower temperature reduces randomness and can prevent runaway generation
  • Stop sequences: Define stop tokens to prevent infinite lists/repetition
Generation parameter hardening
# Safe defaults for production
SAFE_GENERATION_PARAMS = {
    "max_tokens": 500,          # Conservative limit
    "temperature": 0.7,         # Balanced creativity/control
    "top_p": 0.9,               # Nucleus sampling
    "frequency_penalty": 0.3,   # Reduce repetition
    "presence_penalty": 0.2,    # Encourage topic diversity
    "stop": ["\n\n\n", "---"],  # Halt on excessive whitespace
}

def safe_llm_call(prompt: str, user_id: str, **overrides):
    """
    LLM call with safety defaults. User overrides are clamped.
    """
    params = SAFE_GENERATION_PARAMS.copy()
    
    # Allow user to increase max_tokens, but cap at 2000
    if "max_tokens" in overrides:
        params["max_tokens"] = min(overrides["max_tokens"], 2000)
    
    # Prevent temperature > 1.0 (can cause instability)
    if "temperature" in overrides:
        params["temperature"] = min(max(overrides["temperature"], 0.0), 1.0)
    
    return llm_client.chat(prompt=prompt, **params)

Layer 3: Rate Limiting

Even with per-request cost controls, you need temporal rate limits:

  • Request rate: Max requests per user per minute (e.g., 20/min)
  • Token throughput: Max input+output tokens per user per hour (e.g., 100K/hour)
  • Concurrent requests: Limit active inference calls per user (e.g., 3 concurrent)

Implement using Redis INCR with TTL for distributed rate limiting, or leverage API gateway features like AWS API Gateway throttling.

Layer 4: Monitoring & Alerting

Real-time visibility into cost trends is critical. Key metrics:

  • Cost per request (p50, p95, p99): Detect distribution shifts
  • Top spenders by user: Identify compromised accounts or abusive users
  • Hourly cost burn rate: Alert on unexpected budget acceleration
  • Inference latency: Sponge examples often correlate with high latency
DataDog dashboard query (example)
# Anomaly detection on cost per request
avg:llm.inference.cost{env:prod} by {user_id}.anomalies(
  algorithm='agile',
  deviations=3,
  direction='above'
)

# Budget burn rate alert
sum:llm.inference.cost{env:prod}.as_rate() > 10  # $10/hour threshold

Incident Response Playbook

When a model DoS attack is detected:

Immediate Actions (T+0 to T+5 min)

  1. Identify the attack vector: Check logs for user_id, API key, source IP
  2. Block the attacker: Revoke API key or add IP to blocklist
  3. Assess blast radius: How much budget was consumed? How many requests?
  4. Enable emergency rate limits: Temporarily reduce global limits if needed

Containment (T+5 to T+30 min)

  1. Review attack pattern: Was it sponge examples, context stuffing, or amplification?
  2. Update detection rules: Add new signatures to input validation
  3. Notify stakeholders: Alert finance/leadership if budget impact is significant
  4. Check for lateral movement: Did attacker compromise multiple accounts?

Recovery (T+30 min to T+24 hours)

  1. Restore normal service: Remove emergency limits once threat is mitigated
  2. Implement permanent fixes: Deploy updated validation rules to production
  3. Conduct post-mortem: Document attack timeline, cost impact, lessons learned
  4. Review budget allocation: Adjust daily limits if needed to prevent future incidents

Real-World Attack Scenarios

Scenario 1: Compromised API Key

An attacker obtains a legitimate user's API key (via phishing, GitHub leak, or insecure storage). They automate context-stuffed queries to maximize cost before detection.

Defense: Per-key budget limits + anomaly detection. Even if the key is valid, sudden cost spikes trigger alerts and automatic suspension.

Scenario 2: Free Tier Abuse

A service offers a free tier with "unlimited" requests but no cost caps. Attacker creates multiple accounts and uses sponge examples to drain compute resources.

Defense: Implement token throughput limits (not just request counts) on free tiers. Example: 10K input tokens/day max for free users.

Scenario 3: Competitive Attack

A competitor deliberately targets your LLM service during peak hours to degrade performance for legitimate users while inflating your operational costs.

Defense: Concurrent request limits + CAPTCHA challenges on high-cost queries. Prioritize known good users (OAuth-authenticated) over anonymous API traffic.

Cost Impact Analysis

To quantify the risk, model your exposure:

  • Average query cost: Baseline cost for legitimate traffic (e.g., $0.02/request)
  • Sponge example multiplier: Cost amplification from adversarial input (50-100x)
  • Attack duration: Time until detection and mitigation (5-30 minutes)
  • Request rate: Attacker's query frequency (e.g., 10/sec if rate limits allow)
Cost exposure calculation
# Example calculation
baseline_cost = 0.02          # $/request for normal traffic
amplification = 75            # Sponge example multiplier
attack_duration_min = 15
requests_per_min = 20

total_requests = attack_duration_min * requests_per_min
attack_cost_per_request = baseline_cost * amplification
total_cost = total_requests * attack_cost_per_request

print(f"Potential attack cost: ${total_cost:.2f}")
# Output: Potential attack cost: $450.00

# With budget cap of $100/day, attacker is blocked at $100
# Without cap, attack runs until manual intervention

For a production service handling 10K requests/day at $0.02 each (daily cost: $200), a successful 15-minute model DoS attack can cost 2.25x your normal daily budget — without any infrastructure scaling.

Integration with Existing Security Stack

WAF Integration

While WAFs can't detect sponge examples (content-agnostic), they can enforce baseline input limits:

  • Body size limits: Cap HTTP POST body at 1MB to prevent context stuffing
  • Rate limiting: AWS WAF rate-based rules (e.g., 100 requests/5 min per IP)
  • Geo-blocking: Restrict API access to expected regions if applicable

Example: AWS WAF rate-based rule configuration.

SIEM Correlation

Feed LLM cost events into your SIEM (Splunk, Sentinel, Sumo Logic) for correlation with other security signals:

  • High cost + failed authentication attempts = compromised account
  • High cost + unusual geolocation = potential attacker
  • High cost + API key created <24h ago = suspicious new user
Splunk query for correlation (example)
index=llm_logs event_type="inference" 
| stats avg(cost_usd) as avg_cost, sum(cost_usd) as total_cost by user_id 
| where avg_cost > 1.0 OR total_cost > 50 
| join user_id 
  [search index=auth_logs event_type="failed_login" 
   | stats count as failed_logins by user_id] 
| where failed_logins > 5 
| table user_id, avg_cost, total_cost, failed_logins 
| sort - total_cost

Zero Trust Architecture

Apply NIST SP 800-207 zero trust principles to LLM access:

  • Least privilege: Grant minimum token limits required for user's role
  • Continuous verification: Re-authenticate for high-cost queries
  • Assume breach: Design detection to catch insider threats, not just external attackers

Regulatory and Compliance Considerations

Model DoS attacks have compliance implications:

  • SOC 2 Type II: Availability controls require DoS mitigation (CC6.1, CC7.2)
  • ISO 27001: A.12.2.1 (Controls against malware) extends to resource exhaustion
  • PCI DSS: Requirement 6.5.9 (denial of service protection) applies to payment-adjacent LLMs
  • Cloud provider SLAs: Excessive cost spikes may trigger usage reviews or account suspension

Document your model DoS controls in security policies and runbooks. Auditors increasingly ask about AI-specific threats.

Future Threat Landscape

As LLMs scale and proliferate, expect evolution in model DoS tactics:

  • Multi-modal attacks: Adversarial images/audio that maximize vision/speech model compute
  • Distributed campaigns: Botnets issuing coordinated low-volume sponge queries
  • Model-specific exploits: Attacks tailored to architecture quirks (e.g., GPT vs. Claude vs. Gemini)
  • Supply chain targeting: Poisoning third-party RAG corpuses to inject expensive retrieval queries

Stay current with OWASP Top 10 for LLMs and MITRE ATLAS for emerging attack patterns.

Key Takeaways

  • Traditional DoS defenses don't work: Model DoS succeeds with low request volume and bypasses WAFs
  • Instrumentation is mandatory: You can't defend what you can't measure — log token counts and costs per request
  • Defense in depth: Combine input validation, generation controls, budget caps, and anomaly detection
  • Budget limits are non-negotiable: Every API key must have a daily/monthly cost cap
  • Monitor cost distributions: Focus on p95/p99 metrics, not just averages
  • Prepare incident response: Have a playbook ready before the first attack hits

References