Model Denial of Service: The Resource Exhaustion Attack You're Not Monitoring
Secure By DeZign Podcast
Episode • Hosted by Pax
5 minutes • Listen to Pax break down the key insights from this article
Your threat model accounts for prompt injection, data poisoning, and jailbreaks. But are you monitoring for adversarial inputs designed to burn through compute budgets? Model denial of service attacks exploit the economic reality of LLM inference: some queries cost 100x more than others — and attackers know it.
Full Technical Deep-Dive Available
Get the complete guide including detection architectures, rate limiting strategies, cost anomaly algorithms, production monitoring dashboards, and incident response playbooks.
Get the PDF — $27Includes diagrams, code samples, and detection rules • PDF delivered instantly
Detection Strategies
Traditional security tooling won't catch model DoS attacks. WAFs see benign HTTP POST requests. SIEMs see normal application logs. You need inference-aware detection.
Architecture: Cost Anomaly Detection
Effective detection requires instrumentation at the inference layer. The detection architecture must capture per-request resource consumption and flag statistical outliers.
Metric Collection
Instrument your inference layer to capture:
- Input token count: Actual tokens processed (not just Content-Length)
- Output token count: Tokens generated in response
- Inference latency: Wall-clock time from request to first token
- Total cost: Calculated from provider pricing (input + output)
- User/API key: Attribution for cost aggregation
- Request fingerprint: Hash of prompt structure (not content)
import time
from prometheus_client import Histogram, Counter
llm_cost_histogram = Histogram('llm_inference_cost_usd',
'LLM inference cost per request',
buckets=[0.001, 0.01, 0.1, 1.0, 10.0])
llm_token_counter = Counter('llm_tokens_total',
'Total tokens processed',
['direction', 'user_id'])
async def metered_inference(prompt: str, user_id: str, model: str):
start = time.time()
# Call LLM
response = await llm_client.chat(model=model, prompt=prompt)
# Extract token counts from response headers or usage metadata
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
# Calculate cost (example: GPT-4 Turbo pricing)
if model == "gpt-4-turbo":
cost_input = input_tokens * 0.01 / 1000 # $0.01/1K
cost_output = output_tokens * 0.03 / 1000 # $0.03/1K
total_cost = cost_input + cost_output
# Record metrics
llm_cost_histogram.observe(total_cost)
llm_token_counter.labels(direction='input', user_id=user_id).inc(input_tokens)
llm_token_counter.labels(direction='output', user_id=user_id).inc(output_tokens)
# Log structured event for SIEM
logger.info("llm_inference", extra={
"user_id": user_id,
"model": model,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"cost_usd": total_cost,
"latency_ms": (time.time() - start) * 1000
})
return response
Anomaly Detection Algorithm
Use statistical outlier detection on a sliding window. The key insight: model DoS attacks produce cost outliers, not just volume spikes.
import numpy as np
from collections import deque
class CostAnomalyDetector:
def __init__(self, window_size=1000, alert_threshold=3.0, block_threshold=5.0):
self.window = deque(maxlen=window_size)
self.alert_threshold = alert_threshold
self.block_threshold = block_threshold
def check(self, cost: float, user_id: str) -> tuple[str, float]:
"""
Returns: (action, z_score)
action: 'allow' | 'alert' | 'block'
"""
if len(self.window) < 100: # Warmup period
self.window.append(cost)
return ('allow', 0.0)
# Calculate baseline statistics
costs = np.array(self.window)
mean = np.mean(costs)
std = np.std(costs)
if std == 0: # Edge case: all costs identical
std = 0.001
# Compute Z-score for current request
z_score = (cost - mean) / std
# Decision logic
if z_score >= self.block_threshold:
action = 'block'
logger.warning(f"BLOCK: user={user_id} cost=${cost:.4f} z={z_score:.2f}")
elif z_score >= self.alert_threshold:
action = 'alert'
logger.info(f"ALERT: user={user_id} cost=${cost:.4f} z={z_score:.2f}")
else:
action = 'allow'
# Update rolling window only if not blocked
if action != 'block':
self.window.append(cost)
return (action, z_score)
# Usage in inference path
detector = CostAnomalyDetector()
async def protected_inference(prompt: str, user_id: str):
# Estimate cost before full inference (rough heuristic)
estimated_cost = estimate_cost(prompt)
action, z_score = detector.check(estimated_cost, user_id)
if action == 'block':
raise HTTPException(429, "Request blocked: resource exhaustion pattern detected")
response = await metered_inference(prompt, user_id)
# Check actual cost post-inference
actual_cost = calculate_cost(response)
action_actual, _ = detector.check(actual_cost, user_id)
if action_actual == 'alert':
send_alert(user_id, actual_cost, z_score)
return response
Per-User Budget Enforcement
Complement statistical detection with hard budget caps. This prevents slow-burn attacks that stay just below anomaly thresholds but accumulate massive costs over time.
import redis
from datetime import datetime, timedelta
class BudgetTracker:
def __init__(self, redis_client, daily_limit_usd=100.0):
self.redis = redis_client
self.daily_limit = daily_limit_usd
def get_key(self, user_id: str) -> str:
today = datetime.utcnow().strftime("%Y-%m-%d")
return f"llm_budget:{user_id}:{today}"
def check_and_deduct(self, user_id: str, cost: float) -> bool:
"""
Returns True if budget allows, False if exceeded.
Atomically deducts cost from daily budget.
"""
key = self.get_key(user_id)
# INCRBYFLOAT is atomic
spent = float(self.redis.incrbyfloat(key, cost))
# Set expiry on first write (24h + buffer)
if spent == cost: # First spend of the day
self.redis.expire(key, int(timedelta(hours=26).total_seconds()))
if spent > self.daily_limit:
logger.warning(f"Budget exceeded: user={user_id} spent=${spent:.2f}")
return False
return True
def get_remaining(self, user_id: str) -> float:
key = self.get_key(user_id)
spent = float(self.redis.get(key) or 0)
return max(0, self.daily_limit - spent)
# Integration
budget_tracker = BudgetTracker(redis_client)
async def budget_protected_inference(prompt: str, user_id: str):
estimated_cost = estimate_cost(prompt)
if not budget_tracker.check_and_deduct(user_id, estimated_cost):
remaining = budget_tracker.get_remaining(user_id)
raise HTTPException(429, f"Daily budget exceeded. Remaining: ${remaining:.2f}")
return await protected_inference(prompt, user_id)
Mitigation Architecture
Defense in depth requires controls at multiple layers. No single mitigation stops all model DoS variants.
Layer 1: Input Validation
Reject obviously malicious inputs before inference:
- Token count limits: Hard caps on input length (e.g., 8K tokens for chat, 32K for document analysis)
- Repetition detection: Flag inputs with >30% repeated n-grams
- Entropy checks: Block low-entropy padding (e.g., 10K tokens of "Lorem ipsum")
import tiktoken
from collections import Counter
def validate_input(prompt: str, max_tokens=8000) -> tuple[bool, str]:
"""Returns (is_valid, error_message)"""
# Token count check
encoding = tiktoken.encoding_for_model("gpt-4")
tokens = encoding.encode(prompt)
if len(tokens) > max_tokens:
return (False, f"Input exceeds {max_tokens} token limit")
# Repetition check (trigram analysis)
words = prompt.lower().split()
if len(words) > 100:
trigrams = [tuple(words[i:i+3]) for i in range(len(words)-2)]
trigram_counts = Counter(trigrams)
most_common_freq = trigram_counts.most_common(1)[0][1] if trigrams else 0
repetition_ratio = most_common_freq / len(trigrams) if trigrams else 0
if repetition_ratio > 0.3:
return (False, "Excessive repetition detected")
# Entropy check (simplified)
char_counts = Counter(prompt)
entropy = -sum((count/len(prompt)) * np.log2(count/len(prompt))
for count in char_counts.values())
if entropy < 2.0: # Very low entropy (e.g., "aaaaaa...")
return (False, "Input appears to be padding/noise")
return (True, "")
Layer 2: Controlled Generation
Limit output token generation to prevent amplification attacks:
- max_tokens enforcement: Set conservative defaults (e.g., 500 tokens for chat, 2000 for summarization)
- Temperature capping: Lower temperature reduces randomness and can prevent runaway generation
- Stop sequences: Define stop tokens to prevent infinite lists/repetition
# Safe defaults for production
SAFE_GENERATION_PARAMS = {
"max_tokens": 500, # Conservative limit
"temperature": 0.7, # Balanced creativity/control
"top_p": 0.9, # Nucleus sampling
"frequency_penalty": 0.3, # Reduce repetition
"presence_penalty": 0.2, # Encourage topic diversity
"stop": ["\n\n\n", "---"], # Halt on excessive whitespace
}
def safe_llm_call(prompt: str, user_id: str, **overrides):
"""
LLM call with safety defaults. User overrides are clamped.
"""
params = SAFE_GENERATION_PARAMS.copy()
# Allow user to increase max_tokens, but cap at 2000
if "max_tokens" in overrides:
params["max_tokens"] = min(overrides["max_tokens"], 2000)
# Prevent temperature > 1.0 (can cause instability)
if "temperature" in overrides:
params["temperature"] = min(max(overrides["temperature"], 0.0), 1.0)
return llm_client.chat(prompt=prompt, **params)
Layer 3: Rate Limiting
Even with per-request cost controls, you need temporal rate limits:
- Request rate: Max requests per user per minute (e.g., 20/min)
- Token throughput: Max input+output tokens per user per hour (e.g., 100K/hour)
- Concurrent requests: Limit active inference calls per user (e.g., 3 concurrent)
Implement using Redis INCR with TTL for distributed rate limiting, or leverage API gateway features like AWS API Gateway throttling.
Layer 4: Monitoring & Alerting
Real-time visibility into cost trends is critical. Key metrics:
- Cost per request (p50, p95, p99): Detect distribution shifts
- Top spenders by user: Identify compromised accounts or abusive users
- Hourly cost burn rate: Alert on unexpected budget acceleration
- Inference latency: Sponge examples often correlate with high latency
# Anomaly detection on cost per request
avg:llm.inference.cost{env:prod} by {user_id}.anomalies(
algorithm='agile',
deviations=3,
direction='above'
)
# Budget burn rate alert
sum:llm.inference.cost{env:prod}.as_rate() > 10 # $10/hour threshold
Incident Response Playbook
When a model DoS attack is detected:
Immediate Actions (T+0 to T+5 min)
- Identify the attack vector: Check logs for user_id, API key, source IP
- Block the attacker: Revoke API key or add IP to blocklist
- Assess blast radius: How much budget was consumed? How many requests?
- Enable emergency rate limits: Temporarily reduce global limits if needed
Containment (T+5 to T+30 min)
- Review attack pattern: Was it sponge examples, context stuffing, or amplification?
- Update detection rules: Add new signatures to input validation
- Notify stakeholders: Alert finance/leadership if budget impact is significant
- Check for lateral movement: Did attacker compromise multiple accounts?
Recovery (T+30 min to T+24 hours)
- Restore normal service: Remove emergency limits once threat is mitigated
- Implement permanent fixes: Deploy updated validation rules to production
- Conduct post-mortem: Document attack timeline, cost impact, lessons learned
- Review budget allocation: Adjust daily limits if needed to prevent future incidents
Real-World Attack Scenarios
Scenario 1: Compromised API Key
An attacker obtains a legitimate user's API key (via phishing, GitHub leak, or insecure storage). They automate context-stuffed queries to maximize cost before detection.
Defense: Per-key budget limits + anomaly detection. Even if the key is valid, sudden cost spikes trigger alerts and automatic suspension.
Scenario 2: Free Tier Abuse
A service offers a free tier with "unlimited" requests but no cost caps. Attacker creates multiple accounts and uses sponge examples to drain compute resources.
Defense: Implement token throughput limits (not just request counts) on free tiers. Example: 10K input tokens/day max for free users.
Scenario 3: Competitive Attack
A competitor deliberately targets your LLM service during peak hours to degrade performance for legitimate users while inflating your operational costs.
Defense: Concurrent request limits + CAPTCHA challenges on high-cost queries. Prioritize known good users (OAuth-authenticated) over anonymous API traffic.
Cost Impact Analysis
To quantify the risk, model your exposure:
- Average query cost: Baseline cost for legitimate traffic (e.g., $0.02/request)
- Sponge example multiplier: Cost amplification from adversarial input (50-100x)
- Attack duration: Time until detection and mitigation (5-30 minutes)
- Request rate: Attacker's query frequency (e.g., 10/sec if rate limits allow)
# Example calculation
baseline_cost = 0.02 # $/request for normal traffic
amplification = 75 # Sponge example multiplier
attack_duration_min = 15
requests_per_min = 20
total_requests = attack_duration_min * requests_per_min
attack_cost_per_request = baseline_cost * amplification
total_cost = total_requests * attack_cost_per_request
print(f"Potential attack cost: ${total_cost:.2f}")
# Output: Potential attack cost: $450.00
# With budget cap of $100/day, attacker is blocked at $100
# Without cap, attack runs until manual intervention
For a production service handling 10K requests/day at $0.02 each (daily cost: $200), a successful 15-minute model DoS attack can cost 2.25x your normal daily budget — without any infrastructure scaling.
Integration with Existing Security Stack
WAF Integration
While WAFs can't detect sponge examples (content-agnostic), they can enforce baseline input limits:
- Body size limits: Cap HTTP POST body at 1MB to prevent context stuffing
- Rate limiting: AWS WAF rate-based rules (e.g., 100 requests/5 min per IP)
- Geo-blocking: Restrict API access to expected regions if applicable
Example: AWS WAF rate-based rule configuration.
SIEM Correlation
Feed LLM cost events into your SIEM (Splunk, Sentinel, Sumo Logic) for correlation with other security signals:
- High cost + failed authentication attempts = compromised account
- High cost + unusual geolocation = potential attacker
- High cost + API key created <24h ago = suspicious new user
index=llm_logs event_type="inference"
| stats avg(cost_usd) as avg_cost, sum(cost_usd) as total_cost by user_id
| where avg_cost > 1.0 OR total_cost > 50
| join user_id
[search index=auth_logs event_type="failed_login"
| stats count as failed_logins by user_id]
| where failed_logins > 5
| table user_id, avg_cost, total_cost, failed_logins
| sort - total_cost
Zero Trust Architecture
Apply NIST SP 800-207 zero trust principles to LLM access:
- Least privilege: Grant minimum token limits required for user's role
- Continuous verification: Re-authenticate for high-cost queries
- Assume breach: Design detection to catch insider threats, not just external attackers
Regulatory and Compliance Considerations
Model DoS attacks have compliance implications:
- SOC 2 Type II: Availability controls require DoS mitigation (CC6.1, CC7.2)
- ISO 27001: A.12.2.1 (Controls against malware) extends to resource exhaustion
- PCI DSS: Requirement 6.5.9 (denial of service protection) applies to payment-adjacent LLMs
- Cloud provider SLAs: Excessive cost spikes may trigger usage reviews or account suspension
Document your model DoS controls in security policies and runbooks. Auditors increasingly ask about AI-specific threats.
Future Threat Landscape
As LLMs scale and proliferate, expect evolution in model DoS tactics:
- Multi-modal attacks: Adversarial images/audio that maximize vision/speech model compute
- Distributed campaigns: Botnets issuing coordinated low-volume sponge queries
- Model-specific exploits: Attacks tailored to architecture quirks (e.g., GPT vs. Claude vs. Gemini)
- Supply chain targeting: Poisoning third-party RAG corpuses to inject expensive retrieval queries
Stay current with OWASP Top 10 for LLMs and MITRE ATLAS for emerging attack patterns.
Key Takeaways
- Traditional DoS defenses don't work: Model DoS succeeds with low request volume and bypasses WAFs
- Instrumentation is mandatory: You can't defend what you can't measure — log token counts and costs per request
- Defense in depth: Combine input validation, generation controls, budget caps, and anomaly detection
- Budget limits are non-negotiable: Every API key must have a daily/monthly cost cap
- Monitor cost distributions: Focus on p95/p99 metrics, not just averages
- Prepare incident response: Have a playbook ready before the first attack hits
References
- Shumailov et al. (2021) - Sponge Examples: Energy-Latency Attacks on Neural Networks
- OWASP Top 10 for Large Language Model Applications
- MITRE ATLAS T1659 - Denial of ML Service
- NIST SP 800-207 - Zero Trust Architecture
- NIST AI Risk Management Framework
- AWS API Gateway Throttling Documentation
- Redis INCR Command for Rate Limiting