Every time a customer reports a bug, you lose twice. Once when the bug happens, and again when you drop everything to fix it. For a solo founder running a services business, that second cost is brutal.
We spent months watching this pattern repeat. A customer would message. Someone would read it. Track down the code. Write a fix. Create a PR. Merge it. Deploy it. Tell the customer it's sorted. Thirty to sixty minutes, gone.
When you're building AI for other people, you notice your own operational inefficiencies pretty quick. So we built something to fix it. Not a full replacement for human judgment (customers still need to approve any code change), but something that automates the tedious 90% so we can focus on the 10% that actually needs a human brain.
The Old Workflow
It looked like this:
- Customer message comes in via chat widget (usually 2-3 sentences, sometimes a screenshot)
- Someone reads it, interprets whether it's actually a bug or just a feature request
- If it's a bug, they jump to the codebase and start searching for the relevant code
- Root cause analysis — digging through logs, checking recent commits, running local tests
- Write the fix (could be one line or twenty, depending)
- Create a PR, write tests, get it approved
- Merge and deploy
- Tell the customer it's fixed
Most of these steps are mechanical. Parse the problem, find the code, apply the solution. A computer is genuinely better at this than a human. But it needs a human to verify the final output is sensible before it ships.
What If the AI Could Just... Do It?
The lightbulb moment was simple: AI is excellent at code. It's mediocre at judgment. So we built a system where AI does all the code work, and humans do all the judgment work.
Here's how it works now:
Step 1: Intake and Triage
Customer message hits the chat widget. An AI reads it and classifies: is this a bug, a feature request, or just a question? For triage, we use a lightweight LLM call (Claude 3.5 Haiku, ~0.001 USD per request). Accuracy is high because the categories are clear and binary.
Step 2: Code Diagnosis
If it's a bug, the AI gets read access to our GitHub repository. It uses the GitHub API to pull down relevant files based on the error message or description. It then analyzes the code, runs a mental simulation, and identifies the likely root cause.
This is where we use Claude 3.5 Sonnet (more expensive, more accurate for complex reasoning). The AI doesn't just guess — it produces a detailed diagnosis: "The bug is in the payment webhook handler. It's checking for the presence of a transaction ID in the wrong place, so legitimate transactions are being marked as duplicates."
Step 3: Fix Generation
Once the AI has diagnosed the issue, it generates an exact code fix. Not pseudocode, not suggestions — actual, deployable code. It produces a unified diff showing exactly what lines changed.
Step 4: Human Approval
The fix appears on our internal dashboard. We read it. We think "yep, that's right" or "wait, that's not actually the issue." The dashboard shows a side-by-side diff, so approval takes 30 seconds instead of 30 minutes of code review.
Step 5: Automatic Deployment
One click. The AI creates a Git branch, applies the changes, opens a PR automatically, merges it (because we trust it at this point), and deploys to production. We then notify the customer: "Fixed and live. Thanks for reporting that."
Total time from message to deployed fix: under 60 seconds of human time. The machine handles the other 29 minutes.
The Architecture
The stack is simpler than you'd think:
- 7 database tables storing message history, diagnostic reports, fixes, and approvals
- 2 API integrations: GitHub (to read code and create PRs) and our internal deployment system
- 5 AI prompt templates for triage, diagnosis, fix generation, validation, and commit message writing
- 1 internal dashboard (single-page app in Vue) for approving or rejecting fixes
- 1 message queue (Redis) for handling async API calls so the chat widget never blocks
The whole thing runs on a single AWS EC2 instance. It costs about $140/month to operate. Our chat widget is 8KB and zero dependencies — it's just vanilla JS posting JSON to this backend.
What We Measure
Triage accuracy: The AI correctly classifies a message as "bug vs. feature request" about 98% of the time. The 2% are edge cases (like "your docs are confusing" — is that a bug or feedback?). We manually review those.
Fix generation: When the AI produces code, it's deployable code in roughly 95% of cases. The other 5% are close but need a small human tweak. Still faster than writing from scratch.
Time to deploy: Median 90 seconds from customer message to "fixed in production." That includes time for the AI to think, for us to notice the dashboard notification, and for us to click approve.
Cost per fix: About $0.08 in API calls (AI models, GitHub API). Zero in human time once we've approved it. Compare that to the old "30-60 minutes of developer time" cost.
What Was Actually Hard
The code part was easy. The hard part was the prompting.
Getting the AI to diagnose bugs accurately requires being incredibly specific about what information to pass it. If you just dump the entire codebase and say "fix the bug," it'll generate plausible-sounding code that's completely wrong. You need to teach it to ask clarifying questions, to search for specific error patterns, to trace code execution paths.
We spent weeks refining the diagnostic prompt. The current version is about 600 tokens of instructions, context, and examples. It explains things like: "If the error mentions a null reference, always check variable initialization. If it mentions a timeout, look for unoptimized queries."
The second hard thing was handling false positives. If we deploy bad code once, customers notice immediately. So we built validation steps: the fix has to compile, it has to pass our test suite, and it has to not introduce obvious new problems. That validation is itself an AI step (Claude reviews the fix before we even show it to a human).
What Surprised Us
We thought we'd need to retrain or fine-tune a model. We didn't. Claude out-of-the-box handles code diagnosis and generation at a level we didn't expect. No custom training, no specialised model, just good prompting.
We also thought we'd need a human in the loop more often than we do. In practice, the AI's fixes are right often enough that approval feels routine. It's more like "I glanced at this and it's obviously correct" rather than "I'm carefully reviewing this code change."
What Comes Next
Right now, a human still explicitly approves every fix. We could probably auto-approve low-risk fixes (changing a text error message, adding a simple validation check). But there's a good reason to keep the approval gate: it forces us to actually understand what's broken instead of just trusting the AI.
The other direction is expanding it beyond bug fixes. Feature requests and performance improvements could follow the same pattern. "Customer wants dark mode" could become an AI PR that implements dark mode CSS, asks for approval, and ships.
We also want to expose this to customers as a service. Not every company can build this. But many could benefit from it. That's the kind of system we build for clients — not "AI makes your decisions for you," but "AI handles the tedious parts so humans can focus on the important parts."
The Real Insight
This whole thing only works because we kept humans in charge. The AI isn't the system — the AI plus the human approval process is the system. We could have tried to fully automate bug fixes and deploy directly. We didn't because judgment still matters.
That's actually the north star for how we think about AI in business. Not "how do we replace humans" but "how do we eliminate the tedious 90% so humans can focus on the judgment that actually matters."
Every business has support workflows that look like this. Someone reads a request, interprets it, finds the relevant information, makes a decision, and communicates the result. If your version takes 30-60 minutes and involves a lot of searching and context switching, it's probably ripe for AI.