
Auditing Your AI Stack for Single-Provider Lock-In After Anthropic’s Access Cut
Why Anthropic’s access cut is a lock-in test, not just a news item
The report says Anthropic cut access to its AI models after a U.S. national security order. I am not getting into the politics of that decision here. I care about the failure shape, because it is the same shape you have to defend against in your own stack.
If a provider can deny access for policy, legal, regional, billing, or account reasons, then it is not just an API you call. It is a hard dependency in your production graph. For many AI products, that dependency reaches beyond the chat endpoint. It can include embeddings, moderation, reranking, batch evaluation jobs, and the tool-use layer between your app and the model.
That is why this kind of event is a lock-in test. It asks a blunt question: if one model vendor disappears from your allowed path today, what actually breaks?
What changed in practice and why model access is part of your production dependency graph
In a classic web stack, external services are expected to fail. Payment processors go down. Email gets delayed. Feature flags misbehave. The answer is retries, fallbacks, queues, and clear error handling.
A lot of AI stacks are not built that way yet. They assume the model provider will always be reachable, always allowed, and always returning the same output shape. That is a risky assumption because access is not only a network issue. It is also a policy issue.
From a systems view, model access sits beside any other third-party dependency:
- your app sends requests to it
- your app makes business decisions from its responses
- your app may transform those responses into downstream actions
- your app may batch critical jobs through it overnight
- your app may cache responses in formats that assume one vendor’s tokenizer, schema, or stop behavior
So when access is cut, the blast radius can include live customer traffic, background processing, and internal workflows.
The failure mode to watch for: hidden assumptions that one provider will always stay available
The real risk is not that one provider fails once. The risk is that your architecture bakes in assumptions that only hold while that provider is reachable.
I usually see this in three places:
- the application assumes one model name and one response format forever
- the router treats a “primary” as if it were permanent
- downstream code assumes provider-specific features are universal, like tool calling, citations, or structured JSON
Once those assumptions exist, failover gets harder. You are no longer swapping a backend. You are rewriting expectations.
Inventory every place your stack depends on a single model provider
Before you design fallback, you need an inventory. I would not start with prompts. I would start with traffic paths.
Map direct SDK calls, proxy layers, and agent frameworks separately
Do not treat “we use Anthropic” as one dependency. Split it into layers.
- Direct SDK calls from app services
- A proxy or router service that fans out to one or more providers
- Agent frameworks that hide provider calls behind abstractions
- Internal batch jobs and cron tasks
- Client-side calls, if any, that reach a provider directly
This matters because each layer fails differently. A direct SDK call is easy to find but hard to reroute. A router is easier to control but can become its own single point of failure. An agent framework can make failover look easy while burying provider-specific coupling in the prompt or tool schema.
A simple inventory table helps. I like to track it like this:
| Workload | Provider | Call path | Output contract | Fallback exists? | Notes |
|---|---|---|---|---|---|
| Chat | Anthropic | app server → SDK | markdown + citations | partial | parser expects XML tags |
| Embeddings | Anthropic | batch worker → SDK | vector[] | no | cached by model name |
| Moderation | Anthropic | API gateway → router | boolean + score | yes | threshold differs by provider |
| Reranking | Anthropic | search service → SDK | ranked list | no | latency sensitive |
| Eval jobs | Anthropic | CI worker → SDK | JSON metrics | yes | non-prod only |
The table is not the point. The point is to show where “one provider” really means five separate workloads.
Trace where provider-specific features leak into prompts, tools, and response parsing
Lock-in usually sneaks in through details that look harmless.
For example:
- prompts that tell the model to use a vendor-specific citation format
- tool definitions that match one provider’s schema quirks
- response parsers that expect a particular JSON wrapper or tag style
- stop sequences tuned for a single model family
- retries that assume one provider’s timeout behavior
If you switch providers later, these are the first places to break. I have seen teams think they are portable because they call a generic generate() function. Then they discover the prompt says “return output in Claude XML blocks,” the parser strips those blocks, and the whole thing falls apart when another model returns clean JSON.
Build a dependency table for chat, embeddings, moderation, reranking, and eval jobs
I recommend grouping AI workloads by function, not by vendor. A dependency table should make that obvious:
| Function | Business criticality | Portability | Typical hidden coupling |
|---|---|---|---|
| Chat | High | Medium | prompt style, citations, tool calls |
| Embeddings | High | Medium | vector dimension, normalization |
| Moderation | High | High | threshold tuning, labels |
| Reranking | Medium | Medium | ranking format, latency budget |
| Eval jobs | Low | High | output schema, cost assumptions |
This gives you a migration order. Moderation and eval jobs are often easier to move first. Chat is usually the hardest because it carries the most user-visible behavior and the most prompt coupling.
Identify the brittle seams in a modern AI architecture
Once you know where the dependency lives, look for the seams that make it brittle.
Model name pinning and version drift
A lot of stacks pin to a model name because it feels stable. But the name is not the same thing as behavior stability.
The usual problems are:
- the model name changes but the prompt stays the same
- a fallback model has a similar name but different context limits
- the tokenizer changes token counts and truncation behavior
- output style drifts enough to break downstream parsers
If your code assumes a specific context window, token limit, or response prefix, you are not really depending on an LLM. You are depending on one model version.
I would test for this by comparing:
- maximum context length
- tool-call formatting
- newline and whitespace sensitivity
- refusal behavior
- citation or source formatting
- JSON validity rate under the same prompt set
Tool-call schema coupling and structured-output assumptions
Tool use is where many portable AI stacks stop being portable.
One provider may return a nested tool request object. Another may encode the same idea differently. Your application may rely on fields like:
- function name casing
- argument ordering
- whether arguments are strings or parsed objects
- how the model signals “I need another step”
- whether multiple tool calls can happen in one turn
The fix is not to hope the vendor abstraction handles everything. The fix is to define your own tool contract and validate every model response against it.
A good internal contract is stricter than any single provider’s API:
{
"type": "object",
"required": ["action", "arguments"],
"properties": {
"action": { "type": "string" },
"arguments": { "type": "object" }
},
"additionalProperties": false
}
If the model cannot satisfy that schema, reject the response and retry through a safer path. Do not let malformed tool output leak into production state.
Latency budgets, retry logic, and rate-limit behavior that only fit one vendor
Provider-specific latency can lock you in just as hard as prompt syntax.
Look for:
- retry windows tuned to one provider’s typical p95
- backoff intervals that are too aggressive for another provider
- streaming assumptions that rely on one vendor’s chunk cadence
- concurrency limits that only work under one quota policy
- budget logic that treats one vendor as “cheap enough” for everything
If your retry policy is tied to a specific error mix, switching providers can create self-inflicted overload. A different provider might return fewer 429s but more schema failures. If your retry logic treats all errors the same, you will amplify the wrong problem.
State, caching, and conversation memory tied to provider-specific formats
Conversation state is another common trap.
Teams often cache:
- raw model messages
- tool calls
- chain-of-thought-like metadata they should not persist
- provider-specific message envelopes
- embeddings keyed to a single model version
That works until you change providers. Then your cache keys are wrong, your stored messages are unreadable, or your state replay logic depends on fields another model never returns.
A safer pattern is to store normalized conversation state in your own format, then adapt it at the edge of the provider call.
Audit your routing layer the way you would audit a payment gateway
If model access affects revenue or operations, your routing layer deserves the same care as payments.
Compare direct-to-provider calls with a centralized router or abstraction layer
Direct calls are simpler at first. A centralized router is more work, but it gives you control.
| Approach | Pros | Cons |
|---|---|---|
| Direct-to-provider | simple, low latency, easy to ship | hard to fail over, duplicated logic |
| Central router | policy control, observability, failover | more moving parts, needs testing |
| Managed abstraction | fast to adopt, less code | hidden behavior, vendor-shaped limits |
The router is often the right answer if you treat it like infrastructure and not just a convenience wrapper. It should know why a request goes to a given provider, not just where to send it by default.
Check whether failover is real, partial, or only enabled in non-production paths
I have seen teams claim they have failover because a test flag exists in staging. That is not failover.
Verify these paths separately:
- production chat traffic
- background batch jobs
- embedding generation
- moderation checks
- manual admin workflows
- incident-mode overrides
If failover only works for one path, document that clearly. Partial failover is still useful, but pretending it is full failover gives you false confidence.
Verify that routing decisions are based on policy, not hardcoded defaults
A hardcoded default is lock-in with nicer naming.
Your routing layer should be able to say:
- send high-context requests to provider A
- send structured JSON tasks to provider B
- route batch jobs to the cheapest compatible provider
- stop routing to a provider when policy or access changes
- preserve a manual override for incident response
A simple policy object might look like this:
routes:
chat:
primary: anthropic
fallback: openai
policy: capability_based
embeddings:
primary: openai
fallback: local
policy: cost_and_latency
moderation:
primary: anthropic
fallback: local_rules
policy: safety_first
The key is that the decision is explicit. You want to be able to explain why a request went somewhere else.
Design fallback paths that survive a provider outage or access cutoff
Fallback is not one thing. Different workloads need different patterns.
Primary-secondary routing versus parallel hot standby
Primary-secondary routing is the usual starting point. One provider handles traffic, another is ready when the first fails.
Parallel hot standby is stronger. You keep both paths warm enough that switching is fast and predictable.
The tradeoff is cost and complexity. Hot standby means more testing, more integration work, and more operational discipline. But it is often worth it for customer-facing paths.
A practical rule:
- use primary-secondary for low-risk or internal workloads
- use parallel standby for critical user-facing interactions
- use queue-based fallback for batch jobs that can wait
Capability-based fallback instead of one-to-one model replacement
Do not think in terms of “Claude replacement,” “GPT replacement,” or exact model swap.
Think in terms of capabilities:
- long context
- tool use
- structured JSON
- low latency
- low cost
- safe moderation
- batch throughput
A fallback provider does not need to match everything. It needs to match the capability the task actually needs. That is usually a narrower requirement than teams assume.
For example:
- if the task is summarization, a smaller model may be good enough
- if the task is moderation, rule-based or local classification may be enough
- if the task is tool orchestration, the fallback must preserve schema discipline
- if the task is document reasoning, context length matters more than raw model size
Graceful degradation for long-context, tool use, and structured generation
You need separate degraded modes.
For long-context tasks:
- shrink the context window
- retrieve fewer documents
- summarize prior turns before retrying
- return a partial answer with a clear notice if needed
For tool use:
- disable high-risk side effects
- fall back to read-only tool paths
- queue the action for human review if the tool contract cannot be honored
For structured generation:
- validate output strictly
- retry with a narrower schema
- return a controlled error if the schema cannot be met
A graceful degraded mode is better than silent corruption. A wrong JSON object can be much worse than a visible failure.
When to return a controlled error instead of silently switching models
Do not always fail over automatically.
There are cases where silent fallback is the wrong move:
- legal or policy-sensitive responses
- customer-facing outputs where consistency matters more than availability
- actions with financial or security impact
- workflows where a lower-capability model would create misleading output
In those cases, the safest response may be: “This path is temporarily unavailable.” That is not a product failure. That is an honest system boundary.
Test the system with failure injection and contract checks
If you do not test failure, you do not know whether fallback exists.
Simulate hard failures, quota exhaustion, denied access, and policy blocks
Do not just simulate timeouts. Simulate the failure mode that matters here: access denial.
Test:
- HTTP 403 or equivalent access denied
- account suspended
- quota exhausted
- policy block on a specific request class
- model removed from allowed list
- region-restricted access
- malformed provider response after a partial success
You want to know what happens at each layer:
- Does the router detect the error?
- Does it retry the same provider unnecessarily?
- Does it switch providers?
- Does it preserve the user-visible contract?
- Does it leak internal provider details to the client?
Use golden prompts to compare output shape across providers
Golden prompts are useful when you care about output shape more than exact wording.
I keep a small suite of prompts that exercise:
- short answer
- long-context answer
- tool call
- JSON response
- refusal behavior
- citation formatting
Run them through each provider and compare:
- schema validity
- token length
- completion time
- truncation behavior
- safety or refusal differences
- downstream parser success
You are not trying to prove outputs are identical. You are trying to prove your system can survive the differences.
Assert invariants for JSON schema, citation format, tool arguments, and safety filters
The parser should enforce invariants at the boundary.
Examples:
- JSON must parse on the first pass or be rejected
- citations must match an expected format
- tool arguments must match a schema before execution
- unsafe content must be filtered before any side effect
These checks belong in automated tests, not just runtime code. A contract test should fail when the model output changes enough to break your integration.
Measure what breaks first: accuracy, latency, or downstream parser assumptions
When you test fallback, watch the first failure.
Usually it is one of three things:
- accuracy drops but the pipeline still works
- latency crosses a user-facing threshold
- the parser fails because the new provider does not speak the same output dialect
That ordering tells you where to invest. If parsers fail first, simplify the contract. If latency fails first, adjust budgets and queueing. If accuracy fails first, separate tasks by capability.
Rework prompts and tool contracts so they are portable
Portability is easier to achieve in prompt design than in post hoc migration.
Remove provider-specific wording that nudges one model into special behavior
A lot of prompts quietly encode vendor behavior.
Examples include instructions like:
- “use the provider’s XML format”
- “follow the model’s default citation style”
- “respond with the exact assistant tag”
- “rely on the model’s native function call format”
Those phrases create brittle expectations. Rewrite prompts so they describe your desired output, not the vendor’s.
Keep system prompts modular so routing can swap policies without rewriting everything
I like to split prompts into modules:
- role and policy
- task instructions
- output schema
- safety constraints
- tool-use rules
That makes it easier to swap providers or routes without reauthoring the whole prompt. It also makes testing cleaner because you can compare modules separately.
Define strict output schemas and reject malformed responses early
If you want portability, strictness helps.
A strict schema gives you:
- predictable downstream parsing
- easier provider comparison
- clearer failure handling
- less accidental vendor coupling
The tradeoff is that you may reject more outputs at first. That is fine. Rejection at the boundary is cheaper than corrupted state later.
Harden observability so lock-in shows up before an outage does
You cannot manage provider concentration if you cannot see it.
Log provider, model, route decision, retry count, and fallback reason on every request
Every AI request should carry a trace of how it was handled:
- provider
- model
- route policy
- selected fallback
- retry count
- error code
- fallback reason
- request class
- user or workload type
That log is what lets you answer: “How much of our traffic depends on one provider?” and “What happened when access changed?”
Build dashboards for vendor concentration, error spikes, and fallback frequency
Useful dashboards include:
- percentage of traffic by provider
- fallback rate by workload
- access-denied rate by provider
- parser failure rate by model
- p95 latency by route
- batch job completion delay during fallback
If you only monitor uptime, you will miss policy blocks and subtle degradation.
Add alerts for access-denied responses that look like policy or account changes rather than outages
A policy block often looks different from an outage in telemetry. It may show up as:
- immediate authorization failure
- repeated denial on all requests from one account
- region-specific rejection
- model access revoked while the network is healthy
Alert on those patterns separately. You want to know whether a provider is down or whether your access path has changed.
Create a migration plan for reducing concentration risk over time
You do not reduce lock-in by declaring it. You reduce it by staging the work.
Classify workloads by portability: low-risk, medium-risk, and provider-entangled
Start with a simple classification:
- low-risk: moderation, eval jobs, internal experiments
- medium-risk: embeddings, summarization, reranking
- provider-entangled: complex chat, tool orchestration, customer-facing generation
Move the low-risk workloads first. They build operational confidence without threatening the core product.
Separate experimental features from customer-facing critical paths
This matters more than teams admit.
If experimental AI features share the same provider, prompts, and contract as production paths, a migration can put the whole product at risk. Keep the experiment stack isolated enough that you can swap it or lose it without affecting the critical path.
Keep an exit checklist for switching embeddings, chat, and batch workflows independently
The exit plan should be concrete. For each workload, record:
- what data must be migrated
- what schema changes are needed
- what new tests must pass
- what cache invalidation is required
- what observability changes are needed
- what manual approvals are needed before cutover
You want independent exits. If embeddings can move while chat stays put, that is a real reduction in concentration risk.
What a resilient AI stack looks like after the cut
The lesson from Anthropic’s access cut is not “never use a major provider.” The lesson is that access is a policy surface, not a promise.
A practical checklist for architecture, testing, and vendor governance
A resilient stack usually has these properties:
- a centralized routing policy
- normalized internal prompt and tool contracts
- provider-agnostic state formats
- strict schema validation at the boundary
- tested fallback for critical workloads
- failure injection for access denial and quota exhaustion
- dashboards for concentration and fallback use
- an exit plan by workload, not by vendor logo
If your architecture can answer “what happens if this provider disappears?” without hand-waving, you are in much better shape.
The minimum controls I would expect before trusting one provider in production
If I were reviewing an AI stack for production readiness, I would want at least this:
- An inventory of every provider dependency.
- A routing layer with explicit policy decisions.
- Tested fallback for user-facing and batch workloads.
- Strict parsing and schema validation.
- Monitoring for access-denied and policy-block events.
- Separate handling for critical and experimental paths.
- A migration plan for each major workload class.
Without those controls, “we can switch later” is mostly a story we tell ourselves.
Further Reading: official provider docs, incident reports, and routing-layer references
Those references are useful because they sit on three different layers: provider behavior, AI security risk, and routing abstractions. If you are auditing for lock-in, you need all three.


