Testing Anthropic's Claude Fable 5 for Automated Detection Rule Generation

AI Usage (91%)

On June 14, Anthropic’s public reporting said it was expanding its AI cybersecurity models with Claude Fable 5 and Mythos 5. The headline is worth a look, but the narrower question for defenders is the one that matters: can a model turn messy incident notes into detection content that survives linting, replay, and analyst review?

That is the test I care about. Not whether the model sounds smart, and not whether it can write a convincing paragraph about threat hunting. The bar is much higher: can it draft a rule that maps to real telemetry, avoid inventing fields, keep false positives under control, and give a human enough context to ship it safely?

What Anthropic’s June 14 announcement changes for detection engineering

What the reporting actually says about Claude Fable 5 and Mythos 5

The source reporting is sparse, and that matters. It says Anthropic expanded its AI cybersecurity models with Claude Fable 5 and Mythos 5. It does not include a full benchmark suite, a public red-team review, or a detailed breakdown of where each model works best.

So treat the announcement as a signal, not proof. A model family can be positioned for cybersecurity and still miss the practical test that detection teams run every day: turning noisy observations into a deployable rule.

For detection engineering, the only output that matters is something you can validate against:

known-good logs
known-bad logs
backend schema
your SIEM’s syntax
your analysts’ patience

Why model announcements matter only if the output survives validation

I have seen too many AI-generated detections that look fine on the page and fail everywhere that counts.

The usual failure chain looks like this:

The model invents a field name that does not exist in your source.
The rule passes casual review because the logic sounds right.
The query compiles nowhere.
Someone “fixes” it by weakening the selectors.
The rule now alerts on half the environment.

That is why model announcements only matter if the output survives validation. A good model can shorten the first draft. It cannot remove the need for a compiler, a replay harness, and a human who knows which telemetry sources are real.

Test setup: defining the rule-generation problem before touching the model

Choosing a target format: Sigma, Splunk SPL, KQL, or plain logic

I usually start by forcing the model to draft in a format that is easy to inspect and easy to translate. For a cross-platform workflow, Sigma is usually the best first stop because it keeps the detection logic readable and fairly portable.

Here is how I think about the options:

Format	Best use	Risk
Sigma	Portable draft detection logic	Backend field mappings can still break it
Splunk SPL	Direct use in Splunk-heavy shops	Easy to overfit to Splunk-specific field names
KQL	Microsoft-centric telemetry and hunting	Query looks good even when normalization is weak
Plain logic	Early-stage reasoning and analyst discussion	Not deployable until translated

My rule is simple: ask the model for one canonical format, then translate only after the logic has been reviewed.

If you ask for three backends at once, the model often smooths away details that should stay visible. That makes validation harder later.

Building a safe benchmark corpus with benign, noisy, and adversarial examples

A model for detection rule generation should not be judged only on obvious attack patterns. That test is too easy. I prefer a small corpus with three classes:

benign examples: normal admin behavior, software deployment, scheduled maintenance
noisy examples: legitimate but unusual actions that often trigger detections
adversarial examples: the behaviors the rule is supposed to catch

The key is to keep the examples safe and scoped. You do not need live payloads or destructive commands. You need enough structure to test whether the model understands the pattern.

A practical benchmark file might include:

Case type	Example behavior	Expected outcome
Benign	An endpoint management tool launches a shell for updates	No alert or a filtered alert
Noisy	An admin runs a script with long command-line arguments	Alert only if other suspicious fields line up
Adversarial	A user shell spawns a second shell with encoded arguments	Alert with medium or high confidence

The corpus should also include edge cases:

alternate parent processes
renamed binaries
different field naming conventions
missing command-line fields
null values
partially normalized events

If the model’s rule only works when every field is perfect, it is not a production rule. It is a lab demo.

Establishing scoring criteria for precision, recall, and analyst effort

A lot of teams score detections too loosely. “It found the thing” is not enough.

I like to score rule candidates on three axes:

Metric	What it means	What failure looks like
Precision	How many alerts are worth analyst time	Alert flood from normal admin work
Recall	How many relevant cases the rule catches	Obvious variants slip through
Analyst effort	How much manual cleanup is required	Every alert needs a paragraph of explanation

That third metric is the one people forget. A rule can have decent recall and still be useless if every match forces an analyst to reconstruct the context from scratch.

For AI-generated content, I also add a fourth check: telemetry confidence. If the model cannot explain which log source it expects, I downgrade the result immediately.

Prompt design for automated detection rule generation

Asking for structured output instead of free-form advice

The first prompt mistake is obvious once you see it: people ask the model to suggest detections and then wonder why they got a vague checklist.

I prefer a structured prompt that forces the model to output fields I can validate. Something like this:

You are writing a detection rule draft for a SIEM.

Return:
1. rule_title
2. target_format
3. detection_logic
4. required_fields
5. assumptions
6. exclusions
7. response_notes

Rules:
- Do not invent telemetry sources.
- If a required field is unknown, say so.
- Keep the logic specific enough to compile.
- Separate detection logic from response guidance.

That structure helps in two ways. First, it keeps the model from drifting into generic advice. Second, it makes the failure modes visible. If it cannot name the required fields, you know the draft is still too hand-wavy.

Forcing the model to name assumptions, fields, and required telemetry

This is the most useful constraint I add.

A rule is only as good as the telemetry underneath it. If the model says look for suspicious shell execution but never specifies whether it needs process creation logs, command-line logging, or parent-child process relationships, then you do not have a rule. You have a guess.

I want the model to answer these questions directly:

Which event source does this rely on?
Which fields must be present?
Which fields are optional?
What normalization assumptions are being made?
What should happen when a field is missing?

That last question matters a lot. Missing fields are common in real environments. A useful draft should degrade gracefully instead of quietly becoming meaningless.

Separating detection logic from response guidance

Detection content and response content are not the same thing.

The model might be good at writing:

a selector for suspicious command-line behavior
a threshold for repeated failures
a correlation window across events

It might also write response notes like:

isolate the host
check for persistence
review adjacent authentication events

Those are useful, but they should not be mixed into the logic itself. I usually keep them in separate sections so a reviewer can change one without accidentally changing the other.

That separation also cuts down on prompt sludge. If every prompt asks for both the rule and the incident response plan, the model starts blending them together and the detection gets weaker.

First-pass generation: what the model should be good at

Repeated patterns and obvious abuse chains

If Claude Fable 5 or Mythos 5 is going to be useful for detection generation, the first thing it should handle well is pattern compression.

That means it should recognize recurring structures like:

shell spawned from an office app
script interpreter launched with suspicious arguments
repeated authentication failures followed by a successful login
unusual privilege escalation after lateral movement indicators

These are not clever detections. They are the bread and butter of triage. A model should be able to draft them quickly without overcomplicating the logic.

The good sign is not novelty. It is consistency.

Mapping natural-language tactics to concrete field conditions

This is where a model can save time if it behaves.

Given a sentence like “detect suspicious PowerShell usage,” a good draft should translate that into field conditions such as:

process image or executable path
command-line content
parent process
user context
maybe child process behavior if available

A weak draft stays at the tactic level:

“look for malicious PowerShell”
“flag encoded commands”
“watch for suspicious execution”

That sounds fine until you try to build a real query. Then you realize the model never committed to a field or a log source.

Producing draft rules that are readable by humans first

I want the first-pass output to be understandable without a decoder ring.

A strong draft should read like a senior analyst wrote it in a hurry, not like a marketing copy generator was asked to improvise telemetry. The human review stage still needs to answer:

what is the trigger?
why is it suspicious?
what is excluded?
how noisy will this be?

If the rule is readable, the rest of the workflow gets easier. If it is not readable, every downstream step turns into a translation exercise.

Validation workflow: how to test whether a rule is actually useful

Static checks against schema, required fields, and syntax

Before I replay anything, I run static validation.

At minimum, the candidate rule should pass:

sigma validate rule.yml
sigma convert -t splunk rule.yml
sigma convert -t kql rule.yml

The exact tool names are less important than the workflow. The draft has to satisfy a schema and compile into the target backend without manual repair.

I also check for missing required fields in the rule itself:

title
status
log source
detection block
false positive notes
level
tags or metadata required by your pipeline

If a model cannot produce a rule that satisfies those constraints, it is not ready for unattended generation.

Replay testing against historical logs and known benign traffic

Static validation only proves the rule is syntactically valid. It does not prove it is useful.

I replay the candidate against:

historical incident data
a small set of confirmed benign logs
a small set of known relevant events

That gives you an early read on precision and recall.

A simple evaluation loop looks like this:

for each candidate rule:
  run static validation
  replay against benign corpus
  replay against labeled incident corpus
  count true positives, false positives, and misses
  record analyst review notes

What you want to see is not perfection. You want a rule that is directionally correct and cheap to refine.

Finding overbroad matches, brittle selectors, and missing context

Most AI-generated rules fail in one of three ways:

Overbroad matches
The selector is so generic that it catches normal admin traffic.
Brittle selectors
The rule depends on one exact binary path or one exact string fragment.
Missing context
The rule finds suspicious events, but not enough supporting data to make a triage decision.

A useful replay test should show you which of those problems you have.

If a rule fires on every software deployment, it is probably too broad. If it only fires when the executable name matches one exact path, it is brittle. If the alert has no parent process, user, or host context, analysts will spend too much time chasing it.

Tuning the output without turning it into prompt sludge

Tightening match conditions with environment-specific fields

This is where many teams make the prompt worse instead of the rule better.

If the model returns something too generic, the instinct is to pile on more instructions. That often produces prompt sludge. The better move is to give the model environment-specific telemetry facts:

exact field names from your SIEM
which sources are normalized
which logs are missing on some hosts
which processes or tools are common in your environment

That lets the model tighten the logic without guessing.

For example, “watch for PowerShell” is weak. “Use process creation logs, include parent process, and exclude the signed endpoint management tool we know launches scripts during patching” is much more useful.

Adding exclusions, thresholds, and correlation windows

A lot of useful detections are not single-event rules. They need thresholds or correlations.

Examples:

multiple failed logins within a short window
repeated script launches from the same host
suspicious process creation followed by network activity
a parent-child chain that only matters if it repeats across accounts

The model should be asked to justify each threshold, not just invent one. If it suggests three attempts in five minutes, I want to know whether that comes from source behavior, known service noise, or a guess.

The same goes for exclusions. Good exclusions are narrow and documented. Bad exclusions quietly carve out the interesting cases.

Deciding when the model should stop guessing and ask for more data

This is one of the best signs of maturity.

A model that can say “I need the process creation source and the command-line field names before I can draft this safely” is more useful than a model that confidently invents a query.

I treat that as a feature, not a failure.

⚠️

If the model keeps guessing field names, it is safer to stop and provide the telemetry schema than to let it fabricate a working-looking rule.

That restraint matters in security work. A wrong answer with high confidence is worse than a partial answer that admits uncertainty.

Common failure modes when using a model for detection content

Hallucinated fields, nonexistent event sources, and fake certainty

This is the classic LLM failure, and it is especially dangerous in detection engineering.

The model may produce field names that look standard when they are not. It may reference a log source your environment does not collect. It may imply that a backend supports a function it does not.

The fake certainty is the real problem. The draft looks authoritative, so reviewers waste time debugging the wrong layer.

Defensive response:

require schema-aware prompts
lint against known field dictionaries
reject rules that mention unsupported sources
force the model to list assumptions explicitly

Rules that look clever but cannot be deployed

Some AI-generated detections are technically interesting and operationally useless.

Common reasons:

they depend on a field you do not normalize
they combine too many conditions and never fire
they require correlation data you do not retain
they assume full process ancestry where only partial ancestry exists

These are the clever but dead rules. They usually mean the model learned detection language but not your environment.

False positives from generic behavior and low-fidelity telemetry

A rule can also be too generic because the underlying telemetry is weak.

If all you have is process name and host, the model may draft a broad rule that catches half the help desk. If you add command-line logging, parent process, and user context, the same threat pattern becomes much easier to isolate.

This is why I always separate model quality from telemetry quality. A bad model can make good telemetry look messy. A weak log source can make a good model look bad.

How to integrate AI-generated rules into a defensive workflow

Analyst review gates and change control

AI-generated detection content should go through the same review discipline as any other production rule.

My minimum gates are:

source prompt is saved
draft is reviewed by an analyst
syntax validation passes
replay test results are recorded
approval is captured before deployment

That creates a paper trail and keeps helpful edits from bypassing review.

Versioning, comments, and traceability back to the source prompt

If a model helped draft the rule, I want that traceable.

I usually store:

the prompt summary
the generated draft
the final edited rule
a short note on what was changed and why

That helps later when the rule drifts or starts generating noise. You can see whether the issue came from the model, the environment, or the manual edits.

Versioning also helps when the same pattern shows up with a new log source. You can reuse the reasoning without blindly copying the old output.

Safe rollout patterns in a SIEM or detection-as-code pipeline

A sensible rollout is staged:

start in monitor-only mode
ship to a low-volume environment first
compare alert volume to baseline
review a sample of matches
only then promote to broader coverage

If your pipeline supports it, tag AI-generated drafts as such until they are fully reviewed. That makes it easier to track whether the generation workflow is improving or just adding noise.

What good looks like in practice

Example of a model-generated draft and the corrections it needs

A first-pass draft for suspicious shell execution might look something like this:

title: Suspicious Shell Launch
status: experimental
logsource:
  category: process_creation
detection:
  selection:
    Image|endswith:
      - '\powershell.exe'
      - '\cmd.exe'
    CommandLine|contains:
      - '-enc'
      - 'IEX'
  condition: selection
falsepositives:
  - Admin scripts
level: medium

This is not terrible as a sketch. It is also not ready.

The usual corrections are:

add the alternate shell binary you actually see in your environment
narrow the command-line checks to reduce accidental matches
exclude known software deployment tools
require a parent process or user context if available
document the telemetry assumptions

Example of a tuned rule that becomes deployable

After tuning, the same logic may look more like this:

title: Suspicious Shell Execution with Encoded Arguments
status: test
logsource:
  category: process_creation
detection:
  selection:
    Image|endswith:
      - '\powershell.exe'
      - '\pwsh.exe'
    CommandLine|contains:
      - ' -enc '
      - 'FromBase64String'
  filter_admin_tools:
    ParentImage|endswith:
      - '\your-endpoint-tool.exe'
      - '\your-deployment-agent.exe'
  condition: selection and not filter_admin_tools
falsepositives:
  - Scripted admin activity
  - Endpoint management actions
level: high

The point is not that this is the perfect rule. The point is that it is more honest about assumptions, easier to review, and less likely to flood the queue.

Defensive takeaways from the Claude Fable 5 test

Where automation helps and where human review stays mandatory

If Anthropic’s Claude Fable 5 and Mythos 5 are genuinely useful for security work, I expect them to help most with the first draft:

mapping a tactic to likely fields
proposing a candidate Sigma rule
listing exclusions and assumptions
suggesting response notes

That saves analyst time.

But human review stays mandatory anywhere the model can fail quietly:

field mapping
backend compatibility
false positive tuning
deployment scope
incident severity

The best outcome is not the model writes the rules. The best outcome is the model reduces the blank page problem.

What to measure after deployment so drift does not go unnoticed

A detection rule should not be treated as finished once it ships.

After deployment, I watch:

alert volume over time
percent of alerts closed as benign
time to triage
percentage of alerts missing context
how often the rule needs backend-specific fixes
whether a new software rollout changes the noise profile

That is where drift shows up. A rule that was clean in week one can become useless after a logging change or a new admin tool rollout.

If the model helped draft the rule, those metrics also tell you whether the drafting workflow is learning anything. If every AI-generated rule needs the same kind of cleanup, the prompt is not improving.

Conclusion: using the model as a drafting aid, not an oracle

The June 14 announcement is interesting because it suggests Anthropic is treating cybersecurity as a first-class use case for its newer models. That may matter for detection engineering, but only if the model can do the boring work: structured drafts, honest assumptions, and rules that survive validation.

For me, the right test is simple:

can it produce a rule in a real format?
can it name the telemetry it needs?
can it avoid invented fields?
can it pass static checks?
can it replay cleanly against real logs?
can an analyst understand why it exists?

If the answer is yes, the model is useful. If the answer is no, it is just another confident generator of plausible security text.

Short checklist for teams evaluating AI rule generation

Pick one canonical output format first.
Build a small corpus of benign, noisy, and relevant examples.
Force the model to name assumptions and required fields.
Run static validation before any replay testing.
Compare alert volume against historical baselines.
Require analyst review and versioned traceability.
Measure drift after deployment, not just draft quality.

That workflow is the real evaluation. Everything else is branding.