We Built an AI That Fixes Its Own Bugs

Every time a customer reports a bug, you lose twice. Once when the bug happens, and again when you drop everything to fix it. For a solo founder running a services business, that second cost is brutal. And it's not just us — 85% of customer service leaders are now exploring AI-powered support solutions for exactly this reason.^[1]

We spent months watching this pattern repeat. A customer would message. Someone would read it. Track down the code. Write a fix. Create a PR. Merge it. Deploy it. Tell the customer it's sorted. Thirty to sixty minutes, gone.

When you're building AI for other people, you notice your own operational inefficiencies pretty quick. So we built something to fix it. Not a full replacement for human judgment (customers still need to approve any code change), but something that automates the tedious 90% so we can focus on the 10% that actually needs a human brain.

The Old Workflow

It looked like this:

Customer message comes in via chat widget (usually 2-3 sentences, sometimes a screenshot)
Someone reads it, interprets whether it's actually a bug or just a feature request
If it's a bug, they jump to the codebase and start searching for the relevant code
Root cause analysis — digging through logs, checking recent commits, running local tests
Write the fix (could be one line or twenty, depending)
Create a PR, write tests, get it approved
Merge and deploy
Tell the customer it's fixed

Most of these steps are mechanical. Parse the problem, find the code, apply the solution. A computer is genuinely better at this than a human. But it needs a human to verify the final output is sensible before it ships.

What If the AI Could Just... Do It?

The lightbulb moment was simple: AI is excellent at code. It's mediocre at judgment. So we built a system where AI does all the code work, and humans do all the judgment work.

Here's how it works now:

Step 1: Intake and Triage

Customer message hits the chat widget. An AI reads it and classifies: is this a bug, a feature request, or just a question? For triage, we use a lightweight LLM call (Claude 3.5 Haiku, ~0.001 USD per request). Accuracy is high because the categories are clear and binary.

Step 2: Code Diagnosis

If it's a bug, the AI gets read access to our GitHub repository. It uses the GitHub API to pull down relevant files based on the error message or description. It then analyzes the code, runs a mental simulation, and identifies the likely root cause.

This is where we use Claude 3.5 Sonnet (more expensive, more accurate for complex reasoning). The AI doesn't just guess — it produces a detailed diagnosis: "The bug is in the payment webhook handler. It's checking for the presence of a transaction ID in the wrong place, so legitimate transactions are being marked as duplicates."

Step 3: Fix Generation

Once the AI has diagnosed the issue, it generates an exact code fix. Not pseudocode, not suggestions — actual, deployable code. It produces a unified diff showing exactly what lines changed.

Step 4: Human Approval

The fix appears on our internal dashboard. We read it. We think "yep, that's right" or "wait, that's not actually the issue." The dashboard shows a side-by-side diff, so approval takes 30 seconds instead of 30 minutes of code review.

Step 5: Automatic Deployment

One click. The AI creates a Git branch, applies the changes, opens a PR automatically, merges it (because we trust it at this point), and deploys to production. We then notify the customer: "Fixed and live. Thanks for reporting that."

Total time from message to deployed fix: under 60 seconds of human time. The machine handles the other 29 minutes. For context, industry benchmarks show AI-enabled support teams resolve tickets in a median of 32 minutes, compared to up to 36 hours for teams without AI.^[2]

The Architecture

The stack is simpler than you'd think:

7 database tables storing message history, diagnostic reports, fixes, and approvals
2 API integrations: GitHub (to read code and create PRs) and our internal deployment system
5 AI prompt templates for triage, diagnosis, fix generation, validation, and commit message writing
1 internal dashboard (single-page app in Vue) for approving or rejecting fixes
1 message queue (Redis) for handling async API calls so the chat widget never blocks

The whole thing runs on a single AWS EC2 instance. It costs about $140/month to operate. Our chat widget is 8KB and zero dependencies — it's just vanilla JS posting JSON to this backend.

What We Measure

Triage accuracy: The AI correctly classifies a message as "bug vs. feature request" about 98% of the time. The 2% are edge cases (like "your docs are confusing" — is that a bug or feedback?). We manually review those.

Fix generation: When the AI produces code, it's deployable code in roughly 95% of cases. The other 5% are close but need a small human tweak. Still faster than writing from scratch.

Time to deploy: Median 90 seconds from customer message to "fixed in production." That includes time for the AI to think, for us to notice the dashboard notification, and for us to click approve.

Cost per fix: About $0.08 in API calls (AI models, GitHub API). Zero in human time once we've approved it. Compare that to the old "30-60 minutes of developer time" cost. Across the industry, AI customer service investments deliver an average ROI of $3.50 for every $1 invested, with top performers achieving 8x returns.^[3]

What Was Actually Hard

The code part was easy. The hard part was the prompting.

Getting the AI to diagnose bugs accurately requires being precise about what information to pass it. If you just dump the entire codebase and say "fix the bug," it'll generate plausible-sounding code that's completely wrong. You need to teach it to ask clarifying questions, to search for specific error patterns, to trace code execution paths.

We spent weeks refining the diagnostic prompt. The current version is about 600 tokens of instructions, context, and examples. It explains things like: "If the error mentions a null reference, always check variable initialization. If it mentions a timeout, look for unoptimized queries."

The second hard thing was handling false positives. If we deploy bad code once, customers notice immediately. So we built validation steps: the fix has to compile, it has to pass our test suite, and it has to not introduce obvious new problems. That validation is itself an AI step (Claude reviews the fix before we even show it to a human).

What Surprised Us

We thought we'd need to retrain or fine-tune a model. We didn't. Claude out-of-the-box handles code diagnosis and generation at a level we didn't expect. No custom training, no specialised model, just good prompting.

We also thought we'd need a human in the loop more often than we do. In practice, the AI's fixes are right often enough that approval feels routine. It's more like "I glanced at this and it's obviously correct" rather than "I'm carefully reviewing this code change."

What Comes Next

Right now, a human still explicitly approves every fix. We could probably auto-approve low-risk fixes (changing a text error message, adding a simple validation check). But there's a good reason to keep the approval gate: it forces us to actually understand what's broken instead of just trusting the AI.

The other direction is expanding it beyond bug fixes. Feature requests and performance improvements could follow the same pattern. "Customer wants dark mode" could become an AI PR that implements dark mode CSS, asks for approval, and ships.

We also want to expose this to customers as a service. Not every company can build this. But many could benefit from it. That's the kind of system we build for clients through our AI agents service — not "AI makes your decisions for you," but "AI handles the tedious parts so humans can focus on the important parts." You can read the full FL-Support case study for the technical details.

The Real Insight

This whole thing only works because we kept humans in charge. The AI isn't the system — the AI plus the human approval process is the system. We could have tried to fully automate bug fixes and deploy directly. We didn't because judgment still matters.

That's actually the north star for how we think about AI in business. Not "how do we replace humans" but "how do we eliminate the tedious 90% so humans can focus on the judgment that actually matters."

Every business has support workflows that look like this. Someone reads a request, interprets it, finds the relevant information, makes a decision, and communicates the result. If your version takes 30-60 minutes and involves a lot of searching and context switching, it's probably ripe for AI. Gartner predicts that by 2029, agentic AI will autonomously resolve 80% of common customer service issues, leading to a 30% reduction in operational costs.^[4]