The Human Gate: Where AI Belongs in Security Work, and Where It Must Be Stopped

Security
The Human Gate: Where AI Belongs in Security Work, and Where It Must Be Stopped
Yannic Scheef
Yannic Scheef 04 Jun 2026

Reading length Full · 8 min read
TL;DRShortMediumFull

An AI agent can read ten thousand lines of firewall config and flag the three rules that quietly expose your database to the public internet far faster than a human analyst at scale. That same agent, given the ability to act, can also delete the wrong production bucket because a prompt was ambiguous and nothing stood between intent and action. The gap between those two facts is one of the most important design decisions in security tooling today: not adoption but governance — where the machine's authority should end and a human's should begin. That is more precise than the familiar advice to keep a human in the loop. The useful question is which loop, around which actions, and where judgment is structurally required rather than merely available.

What AI Should Drive

The defensive side of security is dominated by read-heavy, pattern-dense work that feeds a human decision rather than executing one. Enumeration is the clearest case: mapping an attack surface, inventorying cloud assets, listing every Identity and Access Management (IAM) role and what it can touch, correlating packages against known vulnerabilities. These read operations are comparatively safe, though not free of side effects — aggressive enumeration can trip API throttling, run up cloud costs, or fire web-application-firewall alarms, so scope the reads. The subtler failure mode is more dangerous: an agent that enumerates confidently while missing a shadow admin account or over-permissioned service principal produces false completeness, and a list that looks finished but isn't forecloses the investigation that would have found the gap. Treat the output as a starting point, not a certified inventory.

Where the agent earns its keep

Triage is the second strong case, with one caveat. A Security Information and Event Management (SIEM) platform can generate tens of thousands of alerts a day, the overwhelming majority noise, and an AI that clusters them and surfaces the handful worth human eyes does what analyst fatigue does badly. But a suppressed alert is not free to undo: if the model files a real intrusion as noise, no analyst ever sees it. Industry breach reports put median dwell time in the days-to-weeks range, sometimes longer, and an over-aggressive suppression rule shifts that curve the wrong way — so human review of what the AI suppresses, and how it samples the suppressed pile, stays mandatory.

The same pattern holds across a range of defensive tasks where the agent produces information and proposals, not changes to the running world:

  • Log and traffic analysis — clustering, baselining, and anomaly spotting across volumes no analyst can read by hand.
  • Hardening review — flagging weak defaults, drift from a known-good baseline, and missing controls.
  • CVE-to-dependency mapping — tracing a published Common Vulnerabilities and Exposures (CVE) entry to the exact affected version in your dependency tree.
  • Incident reconstruction — assembling a timeline from scattered logs so a human can reason about root cause.

In every one of these, the AI's output is an input to a decision, not the decision itself. That distinction is the whole game. The OWASP Top 10 for LLM Applications catalogues why: the same model that summarizes a log stream can be steered by what's in that stream, so its conclusions deserve scrutiny, not automatic trust.

The Line That Cannot Move

The boundary is not about how smart the model is. A more capable model does not make a destructive action safer; it only makes that action arrive faster, and an agent will call a deletion safe with the same fluency whether it is correct or catastrophically wrong. The primary axes are reversibility and blast radius, with data sensitivity, compliance and legal obligations, and segregation of duties also pushing an action toward a gate.

Fluency is not a control. A more capable model only makes a destructive action arrive faster — it will call a deletion safe with the same confidence whether it is right or catastrophically wrong.

Two questions decide which side an action sits on. First: if this goes wrong, can I cleanly undo it? Second: what is the maximum damage? The agent cannot apply these tests to itself — its own classification of an action as reversible is exactly the judgment that fails silently, which is why the gate has to sit with a person. The table below runs four representative actions through both tests; note that failing either one is enough to demand a human gate.

ActionCleanly reversible?Blast radiusGate required?
Delete data / storageNo — backups may be stale, restore slow or lossySevere, often permanentYes (dual-control)
Rotate a credentialIn theory yesDependent services break the instant the old secret diesYes
Block an IP range / quarantine hostYes — one command to undoSevere worst case before anyone noticesYes
Push an auto-remediation ruleDepends — config-as-code can revert, side effects may notWide; rule applies fleet-wide on the next runYes
Enumerate IAM roles (read-only)N/A — no state changeMinimal (throttling, cost)No

Deleting data usually fails a strict rollback test: backups may be stale, restore procedures can be slow or lossy, and point-in-time recovery only works if it was designed and validated for that exact scenario. Treat deletion as effectively irreversible unless you have a clean, fast, tested undo path — one you've actually exercised. Credential rotation passes the reversibility test in theory but fails on blast radius: dependent services break the instant the old credential stops working, and the outage can run long enough that rollback is beside the point. Blocking an IP range or quarantining a host is one command to undo, but with a severe worst case before anyone notices.

Enforce the boundary in architecture, not prose

This matters more as agents chain actions in unattended loops, where one destructive step is a liability. The fix is to make that step structurally impossible without a human approving it — enforced by architecture, not policy. Give the agent read-only credentials by default and route every write or remediation through a separate approval service it cannot call directly. Express the gating rules as policy-as-code (for example, Open Policy Agent) so the boundary is testable and version-controlled, not buried in a prompt:

package agent.guardrails

# Default deny: no write action proceeds without an approval token.
default allow = false

allow {
    input.action.kind == "read"
}

allow {
    input.action.kind == "write"
    input.action.reversible == true
    input.action.blast_radius == "low"
    input.approval.human_token != ""   # dual-control gate satisfied
}

# Hard stop: destructive verbs are never auto-approved.
deny[msg] {
    input.action.verb == "delete"
    not input.approval.human_token
    msg := "delete requires a signed human approval"
}

Require dual-control for the highest-risk changes, with change windows and a circuit breaker that halts the agent when an action or error rate spikes. Frameworks like the NIST Cybersecurity Framework already organize controls around exactly this separation of detect-versus-respond — the agent can own much of detect, but respond stays gated.

Agent Proposal Human gate Execute approve / reject — destructive actions stop here
Read freely; write only through a proposal a person approves.

The AI's Ingestion Paths Are an Attack Surface

Attacker-controlled data steering system behavior is an old problem — SQL, log, and CSV injection all turn on it. What is new with AI agents is the breadth of the surface: not only the context window but every path by which untrusted data reaches the model or its actions — tool and plug-in interfaces, retrieval pipelines that pull in documents, and the interpreters that parse the model's output into commands. A log line, a config comment, a support ticket, a fetched document can each carry instructions crafted to steer the agent. This is prompt injection through the environment, so treat every ingestion path as untrusted. An attacker who can write to something your agent reads can try to make it misclassify their intrusion as benign or recommend a change that opens a path inward — but the gate blunts the worst case: even a fully manipulated agent cannot execute a destructive action if a human must approve it. The adversary techniques catalogued in MITRE ATT&CK map cleanly onto these ingestion paths — an attacker who can influence what your agent reads is one tampering step from influencing what it recommends.

On Offensive Work: Authorization Cannot Be Delegated

Offensive security — penetration testing, exploit development, active reconnaissance against live targets — carries real legal exposure. AI is useful in a supporting role: reasoning about how an exploit chain might compose, summarizing a target's exposed surface from public data, drafting proof-of-concept code in a sandbox. But action against systems you do not unambiguously own and hold explicit authorization for must never be autonomous — against the wrong host or outside an authorized scope, it is potentially a crime and a contract breach. An agent cannot hold the authorization; the person who signed the rules of engagement holds it and must pull the trigger. Inside an owned sandbox or test range, under a bounded scope, more autonomy is acceptable — the constraint is the authorization boundary, not the automation. That boundary is what distinguishes a sanctioned test from an attack.

Designing the Gate So People Actually Use It

A gate that everyone clicks through without reading is theater. A few principles separate a working gate from a rubber stamp.

  1. Gate by reversibility and blast radius, not by category. Make humans approve every read-only enumeration and you train them to approve everything. Reserve the gate for actions that fail the rollback test or carry a severe worst case.
  2. Show the diff, not the intention. "Delete unused security group" is a summary. "Delete sg-0a3f, currently attached to 4 instances including prod-api-2" is a decision. Present the concrete, verifiable change.
  3. Default to dry-run. Let the agent produce the exact command, plan, or pull request, and stop there. A person reviews a concrete artifact — a Terraform plan, a git diff, a runbook — which becomes the audit record: the change, who reviewed it, and the decision, captured before anything executes.
  4. Make actions attributable and logged. Every action should trace to the agent, the prompt, and the human who approved it. When something goes wrong at 3 a.m., that timeline saves you.

As agentic chains lengthen, the gate itself becomes the bottleneck, so aggregate the low-risk proposals and escalate only what genuinely fails the two tests. The concrete first step is worth doing this quarter: map every action your agents can take today, apply the two tests to each, and compare where the gates sit against where the tests say they belong. The gaps are your roadmap.

The governance question was never whether to use AI in security work — the read-heavy, pattern-dense cases are too strong to pass up. It has always been where the machine's authority should end and where a person's judgment is structurally required, not just available. Every action in your agent's reachable set belongs on one side of that line or the other, assigned by reversibility and blast radius, enforced by architecture. That classification exercise, done once and kept current, is the design decision that determines whether your automation is a force multiplier or a liability waiting for a bad prompt.