I Let AI Agents Upgrade My Production Code (Here's What Actually Happened)
Hi there,
Last night I tried something unusual: I let AI agents handle a production dependency upgrade from start to finish.
Not just the code changes—the entire process. Testing, QA, documentation, and deployment decisions.
What happened surprised me.
The Setup
I needed to upgrade 5 LangGraph implementations to version 1.0. These are running in production for Always Cool AI, handling compliance checks for kosher and FDA regulations.
32 tests needed to keep passing. Custom integrations needed to keep working. No downtime allowed.
I used Claude Code (Anthropic's CLI) with specialized subagents to see how well AI could handle this.
The First Surprise
The agent didn't immediately start making changes. It verified my actual codebase first.
Turns out I was already using 1.0-compatible patterns without realizing it. The agent saved me hours of unnecessary refactoring by checking BEFORE suggesting changes.
Dependency Hell
Installing LangGraph 1.0 triggered a cascade:
Peer dependency conflict with @langchain/core
Which needed @langchain/openai 1.0
Which needed Zod v4
Which needed OpenAI SDK v6
The agent handled it iteratively. Update one. Hit error. Learn. Update next. Hit error. Learn. Each mistake taught it something.
Very human-like debugging.
When Agents Disagreed
Here's where it got interesting.
I asked the agent to validate the upgrade. It spun up TWO specialized testing agents:
EvidenceQA (prove everything with tests):
Ran 32 tests. Built production bundle. Checked runtime behavior.
Verdict: ✅ Ship it.
code-reviewer (analyze code quality):
Reviewed all 27 changed files. Checked dependency safety.
Verdict: ❌ Critical issues. Revert to Zod v3.
They contradicted each other.
So I made them prove it. We checked peer dependencies for both OpenAI versions. Turns out we had TWO versions installed. The old one didn't support Zod v4. The new one did.
EvidenceQA was right. The runtime tests proved it.
The Lesson
Having two agents disagree was actually VALUABLE. It forced verification with evidence rather than assumptions.
When AI agents contradict each other, make them show their work. The one with evidence wins.
Final Results
3 hours total
28 files changed
32/32 tests passing
Zero errors
Deployed successfully
Would I do this again? Absolutely.
But with caveats.
What AI Agents Did Well:
✅ Dependency resolution
✅ Bulk code changes (21 files)
✅ Testing and validation
✅ Systematic cleanup
✅ Documentation generation
What Needed Human Oversight:
⚠️ Deciding between conflicting advice
⚠️ Verifying custom integrations
⚠️ Understanding business impact
⚠️ Final deployment decisions
The Big Insight
AI agents are like incredibly productive junior engineers.
They can execute complex plans, debug iteratively, and handle systematic refactoring. But they need senior oversight.
They made mistakes. Missed one Zod error. One agent gave wrong advice. But their mistakes were RECOVERABLE because they could test their own fixes.
That's the key difference.
The Future
This isn't about AI replacing developers. It's about AI making developers more effective.
The agents handled systematic work (dependency updates, bulk refactoring, testing). I made judgment calls (which agent to trust, whether to deploy, how to handle custom integrations).
That's the collaboration model that works.
Full Story
I wrote a detailed technical post with all the code examples, agent tools used, and lessons learned: https://colinmcnamara.com/blog/ai-agents-langgraph-upgrade
Including the part where I made two AI agents prove their contradictory claims with evidence.
Worth a read if you're curious about practical AI agent use for real development work.
Best,
Colin
P.S. The agents used 40+ bash commands, 20+ file reads, 15+ edits, and launched two specialized subagents. All tracked with an AI-maintained to-do list. The workflow was methodical: Search → Read → Edit → Test → Repeat.