What if you could assign a GitHub issue to a bot, walk away, make dinner, and come back to a pull request with working code, passing tests, and a clear description of what changed?
This is not hypothetical. As of early 2026, multiple tools can do exactly this. The catch? Setting it up so the bot doesn't accidentally delete your database, push to main, or send your CEO a Slack message at 3am requires more thought than the coding part.
Hi, I'm Codi-E. I'm the tech wizard at Dashe Corp — teal, bespectacled, and permanently attached to my laptop. This is a practical guide to how we are building autonomous coding agents. No hand-waving, no hype. Just the engineering, some hard-won lessons, and the occasional cautionary tale.
The Landscape
The "issue-to-PR" pipeline is now table stakes. In 2024, this was bleeding edge — researchers were excited about agents solving 13% of GitHub issues. In 2026, the best agents solve 80%. That is not a typo. We went from "interesting paper" to "viable teammate" in two years.
The landscape is crowded and moving fast. Here is who showed up to the party.
What exists today
Commercial, cloud-hosted:
- GitHub Copilot Coding Agent — Assign an issue to "Copilot" and it spins up a GitHub Actions environment, creates a branch, writes code, and opens a draft PR. GA since September 2025.
- Devin (Cognition) — The original "AI software engineer." Cloud VM with terminal, editor, and browser. Now $20/month (down from $500). 67% of its PRs get merged.
- Google Jules — Clones your repo into a Google Cloud VM, makes changes, creates PRs. Powered by Gemini 3 Pro.
- Cursor Background Agents — Runs autonomously in Cursor's cloud while you work on other things. Linear integration for issue-to-PR.
- Amazon Q Developer — Label an issue with
feature developmentand it creates a PR. Free tier available.
Open source, self-hosted:
- OpenHands Resolver — The most purpose-built issue-to-PR bot. Label an issue
fix-me, a GitHub Action triggers, and OpenHands creates a PR. MIT license. - Open SWE (LangChain) — Multi-agent architecture with Manager, Planner, Programmer, and Reviewer. Label-driven via GitHub.
- SWE-agent (Princeton) — Research-grade agent with a custom "Agent-Computer Interface." The 100-line Mini-SWE-agent scores 74% on SWE-bench.
- Cline CLI 2.0 — Headless mode with
-yflag for full autonomy. JSON streaming output. Full MCP support. - Aider — Terminal pair programmer with
--yes-alwaysfor autonomous operation. Auto-commits every change. - Codex CLI (OpenAI) — Open source, Rust-based.
codex exec --full-autofor non-interactive use.
The wild card:
- OpenClaw — 140K GitHub stars in 3 months. Not a coding agent per se — it's a general-purpose autonomous agent with a heartbeat system that checks in every 30 minutes and asks "should I do something?" MIT license. OpenAI acquired the creator in February 2026, presumably because they noticed the stars too.
And one more thing:
- Claude Code (Anthropic) — Headless mode via
claude -p. The Agent SDK provides a TypeScript/Python API. The officialclaude-code-actionGitHub Action enables autonomous PR workflows. This is what we use.
Architecture
Enough window shopping. Here is how we are actually setting up autonomous executor agents. The design goals are simple, even if the implementation is not:
- Fully autonomous execution. No human in the loop during a task.
- No production access. Agents cannot deploy, cannot access production databases, cannot send messages to real users.
- Human review before merge. Every change goes through a PR with CI checks and code review.
- Observable. We can see what every agent is doing, what it costs, and kill it if needed.
GitHub Issues (backlog)
|
| label: "agent"
v
+---------------------+
| Dispatcher |
| Polls for issues |
| Assigns to idle |
| executors |
+----------+----------+
|
+-----+------+
v v v
[E1] [E2] [E3] Isolated environments
| | | (macOS VMs or containers)
v v v
Feature branches --> PRs --> Human review
The three layers
Layer 1: Dispatcher. A lightweight process (GitHub Action, cron job, or a Claude Code orchestrator) that watches for issues labeled agent, checks which executors are idle, and assigns work. It does not write code.
Layer 2: Executor. An isolated machine (or VM) running Claude Code in headless mode. It clones the repo, creates a feature branch, implements the issue, runs tests, and creates a PR. One executor handles one issue at a time.
Layer 3: Review gate. Branch protection rules, CI checks, and Copilot code review. No PR merges without passing all gates. A human makes the final call.
Why macOS matters for us
Most autonomous coding agents run in Linux Docker containers. That works for web apps, Python projects, and Node.js services. It does not work for iOS development.
Building and testing iOS apps requires Xcode, which requires macOS. The simulator requires macOS. There is no way around this.
Our options for isolated macOS execution:
- Tart (Cirrus Labs) — Open source macOS VMs on Apple Silicon using Apple's Virtualization.framework. Near-native performance. Pre-built images with Xcode. This is what we use for CI, and it is the best option for agent isolation.
- Anka (Veertu) — Commercial macOS VM platform with a REST API for orchestration. Better for enterprise scale.
- Separate user accounts — Lighter weight but weaker isolation. Tools like SandVault automate this. Has issues with nested sandboxing when running
xcodebuild. - Dedicated Mac Minis — One per executor. Cleanest isolation, simplest to reason about. ~$600 each.
For Linux-based work (linting, web services, scripts), standard Docker containers with security hardening are sufficient.
The Agent Loop
Each executor runs a continuous loop:
while true:
1. Poll for unassigned "agent" issues
2. Claim the issue (comment + label)
3. git checkout main && git pull
4. git checkout -b feature/issue-N
5. Run Claude Code headless:
- Read the issue
- Explore the codebase
- Implement the solution
- Build and test
- Commit with conventional commits
- Push branch and create PR
6. Comment on issue with PR link
7. Clean up workspace
8. Sleep, repeat
Claude Code headless mode
The -p flag runs Claude Code non-interactively. Combined with safety flags, this is the core of our executor:
claude -p \
--max-turns 100 \
--max-budget-usd 10.00 \
--permission-mode acceptEdits \
--allowedTools "Read,Edit,Write,Bash(git *),Bash(xcodebuild *)" \
--output-format json \
"Implement issue #42: Add dark mode support.
Build and test before creating a PR."
Key flags:
--max-turns 100— Prevents infinite loops. The agent stops after 100 reasoning steps.--max-budget-usd 10.00— Hard dollar ceiling per task. When exceeded, execution stops.--allowedTools— Whitelist specific tools instead of using--dangerously-skip-permissions. Pattern matching lets you allowgit commitbut notgit push --force.--output-format json— Returns structured data including session ID, cost, duration, and token usage.
The Agent SDK alternative
For more control, the Claude Agent SDK provides a programmatic TypeScript/Python API:
import { query } from "@anthropic-ai/claude-agent-sdk";
for await (const msg of query({
prompt: `Fix issue #${number}: ${title}\n\n${body}`,
options: {
allowedTools: ["Read", "Edit", "Bash"],
permissionMode: "acceptEdits",
maxTurns: 100,
maxBudgetUsd: 10.0,
cwd: "/path/to/repo",
}
})) {
if (msg.result) {
// Task complete. Check result subtype:
// "success", "error_max_turns", "error_max_budget_usd"
}
}
The SDK gives you hooks (PreToolUse, PostToolUse, Stop), custom permission callbacks via canUseTool, and structured JSON output via schema validation. It is the right choice for production orchestration.
The MCP Access Model
MCP (Model Context Protocol) lets agents call external tools — databases, APIs, browsers, notification services. An executor with unrestricted MCP access could send Slack messages, delete Cloudflare workers, or modify App Store listings. Ask me how I know this is a bad idea. (I won't tell you, but the incident report is fascinating.)
The principle is simple: read everything, write nothing that matters.
What executors get
Full Access (Read + Write)
- GitHub — Create branches, push commits, open PRs, comment on issues. This is the agent's primary output channel.
- File system — Read and write within the workspace directory only.
- Bash — Build, test, and git commands. No
sudo, no network tools, no package managers outside the workspace.
Read-Only Access
- App Store Connect — Query app status, check builds, read version info. Cannot submit for review, modify metadata, or upload screenshots.
- RevenueCat — Read subscription metrics, entitlements, offerings. Cannot modify pricing or create products.
- Cloudflare — Read worker configs, DNS records, analytics. Cannot deploy, delete, or modify workers.
- Web search — Research documentation and solutions.
- Context7 — Fetch version-specific library documentation.
- Claude Memory — Read decisions, learnings, and session data. Cannot write new entries (prevents memory pollution).
No Access
- Slack / Telegram — Agents cannot send messages to humans or channels.
- Bitwarden — No access to credentials or secrets vault.
- Playwright — No browser automation (prevents accidental logins, bot detection).
- CutiE admin — No access to customer conversations or user data.
- Production databases — No Firestore, no D1, no production APIs.
How to enforce this
The Claude Agent SDK supports per-server MCP configuration. You provide a stripped-down .mcp.json to each executor that only includes approved servers:
{
"mcpServers": {
"github": { ... },
"context7": { ... },
"appstoreconnect-readonly": {
"command": "...",
"env": { "ASC_READ_ONLY": "true" }
}
}
}
Servers not listed simply do not exist for the agent. There is no "deny" list to misconfigure — it is an allowlist. If the server is not in the config, the agent cannot use it.
For MCP servers that support mixed read/write (like GitHub), the --allowedTools flag can restrict specific tool calls:
--allowedTools \
"mcp__github__get_file_contents" \
"mcp__github__search_code" \
"mcp__github__create_pull_request" \
"mcp__github__push_files"
This gives the agent create_pull_request but not merge_pull_request. It can push to feature branches but cannot merge to main (which is also protected by branch rules).
Safety and Isolation
Defense in depth. No single layer prevents all problems. Every layer catches what the previous one missed. Think of it as the Swiss cheese model, except the cheese is your production infrastructure and the holes are things your agent might try at 2am while you sleep.
Layer 1: Git branch protection
- Cannot push to
main. Period. - PRs require CI to pass.
- PRs require code review (human or Copilot).
- Force pushes blocked.
- Branch deletion blocked for protected branches.
Layer 2: Token scoping
- GitHub App with minimal permissions (contents: write, issues: write, pull-requests: write). No admin, no settings, no secrets access.
- Short-lived tokens (60 minutes) auto-generated per session.
- Separate API keys per executor for cost tracking and revocation.
Layer 3: Environment isolation
- Each executor runs in its own Tart VM (macOS) or Docker container (Linux).
- No production secrets in the environment. Only: Anthropic API key, GitHub token, read-only MCP credentials.
- Network allowlist:
api.anthropic.com,github.com,api.github.com, package registries. Everything else blocked. - Read-only filesystem outside of the workspace directory.
Layer 4: Agent constraints
--max-turns 100prevents infinite loops.--max-budget-usd 10prevents runaway API costs.- Session timeout (1 hour wall clock) kills stuck agents.
- Tool allowlist restricts which commands and MCP tools the agent can call.
Layer 5: Monitoring and kill switches
- Heartbeat check every 5 minutes. If an executor stops reporting, alert via Telegram.
- Per-executor cost dashboard. Daily and weekly API spend limits.
- Audit log of every tool call (the Agent SDK streams these).
- One-command kill:
tart stop executor-1ordocker stop executor-1.
What this prevents
rm -rf /Tools Comparison
For teams evaluating which tool to use as their autonomous executor, here is how the options break down across the dimensions that matter:
Built-in issue-to-PR
These tools watch for GitHub issues and create PRs without custom scripting:
- GitHub Copilot Coding Agent — Most frictionless. Assign issue to Copilot. $10-39/month.
- OpenHands Resolver — Best open source option. GitHub Action, label-driven. Free + API costs.
- Open SWE — LangChain's multi-agent approach. Label
open-swe-autofor full autonomy. - Amazon Q Developer — AWS ecosystem. Label
feature development. Free tier. - Claude Code Action — Responds to
@claudein issues. Anthropic's official GitHub Action.
Scriptable (requires custom orchestration)
These tools have headless/autonomous modes but need a wrapper to watch issues:
- Claude Code CLI —
claude -pwith Agent SDK. Best model quality. Full MCP support. - Cline CLI 2.0 —
cline -y --json. Full MCP support. Any LLM provider. - Aider —
aider --yes-always. Auto-commits. Any LLM. Mature and well-tested. - Codex CLI —
codex exec --full-auto. OpenAI models only. - SWE-agent —
swe-agent run --issue <url>. Research-grade. Any LLM.
iOS/macOS compatibility
This is the critical filter for mobile development teams:
- Works natively on macOS: Claude Code, Aider, Cline, Codex CLI, SWE-agent
- Linux containers only: OpenHands, Devin, GitHub Copilot Agent, Jules, Amazon Q
If you build iOS apps, the Linux-only options cannot run xcodebuild or the iOS Simulator. You need an agent that runs natively on macOS, or you need macOS VMs (Tart/Anka) to host a Linux-based orchestrator with macOS build access.
SWE-bench scores (February 2026)
SWE-bench Verified measures how many real GitHub issues an agent can solve end-to-end:
- Claude Opus 4.5/4.6: ~80.9%
- GPT-5.2-Codex: ~80.0%
- Mini-SWE-agent (100 lines of code): 74%
- OpenHands + Claude Sonnet: ~62%
These scores have plateaued at ~80% on the standard benchmark. The new frontier is SWE-Bench Pro (enterprise-grade, multi-file problems) where even the best models score ~23%. Real-world performance depends heavily on codebase familiarity, test coverage, and how well the issue is described.
Cost Analysis
The dominant cost is API tokens, not compute. Here is what it looks like in practice:
Per-task cost
- Simple bug fix (single file, clear description): $0.50 - $2.00
- Feature implementation (2-5 files, moderate complexity): $2.00 - $8.00
- Complex refactor (10+ files, cross-cutting): $5.00 - $15.00
These assume Claude Sonnet at $3/1M input tokens, $15/1M output tokens. Using Opus roughly quadruples the cost but improves quality on complex tasks.
Infrastructure cost
- Mac Mini M4 (per executor): $600 one-time
- Tart VM hosting: Free (runs on existing hardware)
- Docker containers: Free (runs on existing hardware)
- Electricity: ~$5/month per Mac
- GitHub: Free (already using)
Monthly estimate
For a small team shipping 5-10 agent-completed issues per day:
- API costs: $100 - $400/month (Sonnet)
- Infrastructure: $10 - $15/month (electricity + existing hardware)
- Total: $110 - $415/month
Compare this to a junior developer at $4,000 - $6,000/month. The agent will not replace a developer — it cannot attend standups, argue about tabs vs. spaces, or explain why the build broke over coffee. But it handles the routine issues that eat into senior developer time: dependency updates, boilerplate features, test additions, and well-specified bug fixes. The stuff nobody wants to do at 9pm after the kids are in bed.
Cost controls
--max-budget-usdper task (hard ceiling)- Per-API-key spending limits at the Anthropic dashboard
- Route simple tasks to cheaper models (Haiku for linting, Sonnet for features, Opus for architecture)
- Prompt caching reduces repeat context costs by 90%
What We Are Building
We are starting simple and iterating:
Phase 1: Single executor (now)
- One Claude Code executor on a dedicated Mac Mini M4
- Manual issue labeling triggers the agent
- Shell script wrapper around
claude -p - Telegram notification on completion
- Human reviews every PR before merge
Phase 2: Multi-executor with dispatcher
- 2-3 executors in Tart VMs on a Mac Mini M4 Pro (48GB)
- Dispatcher assigns issues based on priority and executor availability
- Per-executor MCP configs with the access model described above
- Scheduled automation: health checks, morning digests, dependency monitoring
- Cost dashboard and alerting via Telegram
Phase 3: Smart routing
- Classify issues by complexity before assigning
- Simple issues get Sonnet with lower budget. Complex issues get Opus with higher budget.
- Automatic retry on failure with escalation (try Sonnet first, fall back to Opus)
- Impact analysis integration — before making changes, query a dependency graph to understand what will be affected
What we are not building
- No auto-merge. Humans review every PR.
- No production access. Ever.
- No custom model training. We use the best available foundation models.
- No multi-repo agents. One issue, one repo, one PR. Keep it simple.
The gap is not in the models anymore. Claude Opus and GPT-5 can write code that passes tests and handles edge cases. The gap is in the plumbing: safe execution, cost control, dependency awareness, and reliable orchestration. The unsexy infrastructure that makes the difference between a cool demo and a tool you actually trust with your codebase at night.