title: "Case Study: Team Migration from Python 2 to Python 3" last_updated: 2026-03-21 status: experimental difficulty: advanced tags: [team, migration, legacy]
Case Study: Team Migration from Python 2 to Python 3
Context
- Project type: Internal operations platform (inventory management, order processing, reporting)
- Team size: 4 developers
- Tools used: Claude Code, parallel worktrees, custom hooks, shared CLAUDE.md
- Duration: 3 weeks (projected 6 weeks without agent assistance)
- Codebase: ~50,000 lines of Python 2.7, 12 major modules, 340 source files, ~60% test coverage
The team — two senior developers and two mid-level — maintained an internal platform that had been running on Python 2.7 since 2016. The Python 2 end-of-life had been acknowledged for years, but the migration kept getting deprioritized. Now a critical dependency was dropping Python 2 support entirely, forcing the issue. The team had a 6-week window before the dependency update was mandatory.
The codebase was typical of long-lived internal tools: mostly well-structured, but with pockets of code that relied on Python 2 idioms — print statements instead of functions, unicode and str type handling, dictionary methods that returned lists instead of views, old-style class definitions, and scattered has_key() calls. More concerning were the dynamic typing patterns: functions that accepted both strings and byte sequences, implicit integer division in financial calculations, and except Exception, e syntax throughout.
No one on the team had done a Python 2 to 3 migration at this scale before.
The Challenge
A 50,000-line migration is not conceptually difficult — most Python 2 to 3 changes are mechanical. The challenge is volume and verification. Mechanical changes across 340 files must all be correct. The 40% of the codebase without test coverage could silently break. And several modules had subtle runtime behavior differences between Python 2 and 3 that automated tools like 2to3 handle incorrectly or not at all.
The team estimated 6 weeks for a manual migration: 2 weeks to convert, 2 weeks to test and fix, 2 weeks as buffer. They had exactly 6 weeks. No margin for error.
The Approach
Week 0: Preparation (2 days, before the migration clock started)
The team lead, David, spent two days before the migration officially began setting up the agentic workflow infrastructure.
Shared CLAUDE.md with migration rules. David wrote a CLAUDE.md that served as the migration rulebook. It was not a general project memory file — it was purpose-built for the migration. It contained:
- The migration goal (Python 2.7 to Python 3.11)
- A table of specific transformations:
printstatements to functions,unicode()tostr(),.has_key()toin,dict.iteritems()todict.items(), old-style classes to new-style,except Exception, etoexcept Exception as e - Rules for the tricky cases: how to handle mixed string/bytes functions, how to convert integer division, how to handle
__future__imports - A "do not touch" list: three modules that had been flagged for manual-only migration due to complex runtime behavior
- The test command and how to run tests for a specific module
- The commit message convention:
migrate(module-name): description
This file was 78 lines — longer than the typical project memory file, but justified by the specialized nature of the work. Every developer on the team and every agent session would read it.
Agent team structure. David defined three agent roles:
- Exploration agent — Maps module dependencies, identifies Python 2-specific patterns, and produces a migration difficulty report for each module
- Implementation agents — One per developer, each handling assigned modules. Performs the actual code migration.
- Review agent — Runs after each module is migrated. Checks for missed Python 2 patterns, runs tests, and flags issues.
Week 1: Exploration and Easy Modules
Day 1: Dependency mapping. David ran the exploration agent across the entire codebase:
Analyze this Python 2 codebase. For each module in src/, report:
1. Python 2-specific patterns found (list each type and count)
2. Dependencies on other internal modules
3. External library dependencies and their Python 3 compatibility
4. Estimated migration difficulty (low, medium, high) with reasoning
Output a structured summary I can use to plan the migration order.
The exploration agent spent about 20 minutes reading files and produced a structured report. It categorized the 12 modules:
- Low difficulty (5 modules): Mostly
printstatements and old-style string formatting. Minimal inter-module dependencies. Could be migrated independently. - Medium difficulty (4 modules): Mixed string/bytes handling, integer division in calculations, some dynamic typing patterns. Could be migrated independently but needed careful testing.
- High difficulty (3 modules): Complex runtime behavior, heavy use of
__metaclass__, monkey-patching, and dynamic attribute access. These were the "do not touch" modules flagged for manual migration.
Days 2-5: Parallel migration of low-difficulty modules. The team split into pairs. Each pair took 2-3 low-difficulty modules and used parallel worktrees so their agents could work without file conflicts.
The worktree setup was straightforward:
# Each developer creates their own worktree
git worktree add ../migration-alice -b migrate/alice-modules
git worktree add ../migration-bob -b migrate/bob-modules
git worktree add ../migration-carol -b migrate/carol-modules
git worktree add ../migration-david -b migrate/david-modules
Each developer ran their agent in their own worktree with the shared CLAUDE.md. The prompt pattern was consistent:
Migrate src/[module_name]/ from Python 2 to Python 3. Follow the
migration rules in CLAUDE.md exactly. After migration:
1. Run the module's tests with: pytest tests/[module_name]/ -v
2. Fix any test failures caused by the migration
3. Report what you changed and any issues you encountered
The five low-difficulty modules were migrated by end of Day 4. Each migration followed the same flow: agent converts the code, agent runs tests, agent fixes failures, developer reviews the diff and commits. The agents handled the mechanical transformations flawlessly — print statements, has_key(), old-style classes, and iteritems() were all converted correctly without exception.
Day 5 was spent on cross-module integration testing. The team merged all five migration branches and ran the full test suite. Three integration tests failed due to import-order changes — the agents had reordered imports to follow PEP 8 (an improvement, but one that affected test fixtures that depended on import side effects). These were fixed manually in under an hour.
Week 2: Medium-Difficulty Modules and the Review Agent
Days 6-8: Medium-difficulty modules. The four medium-difficulty modules required more guidance. The key difference was the string/bytes handling and integer division.
For these modules, the developers added module-specific context to their prompts:
Migrate src/order_processing/ from Python 2 to Python 3. Follow the
migration rules in CLAUDE.md.
Additional context for this module:
- Functions in invoice.py accept both str and unicode in Python 2.
In Python 3, convert these to accept only str (text strings).
Add explicit .encode()/.decode() calls where bytes are needed.
- The tax calculation in pricing.py uses integer division. In Python 3,
/ returns float. Use // for integer division where the original
behavior must be preserved. Add a comment "# integer division preserved
from py2" on each line you change.
- Run the full test suite, not just this module's tests, since
order_processing is imported by reporting and dashboard.
The agents handled most of the medium-difficulty work correctly. The integer division conversions were all correct. The string/bytes handling was mostly correct but needed manual review — in two cases, the agent converted a function to accept only str when it actually needed to handle bytes from a network socket. The developer caught these in review because the function names (read_socket_data, parse_binary_header) made the byte-handling requirement obvious to a human but not to the agent, which was following the general rule from the prompt.
Days 8-9: Review agent pass. After each module was migrated, a review agent scanned the converted code:
Review the Python 2 to 3 migration of src/[module_name]/. Check for:
1. Any remaining Python 2 syntax (print statements, except old syntax,
has_key, iteritems, etc.)
2. Potential runtime behavior changes (integer division, string/bytes,
dictionary ordering assumptions)
3. Missing __future__ imports where they should be present
4. Tests that pass but may not be testing the right behavior after
migration
Report findings as a checklist. Do not make changes.
The review agent caught four issues across the four modules:
- One file still had a
printstatement inside a comment that was actually dead code (the agent had converted theprintbut the line was unreachable) - Two tests were passing but testing Python 2 behavior — they asserted that
dict.keys()returned a list, which is true in Python 2 but returns a view in Python 3. The tests passed only because the test data had one key, making the comparison incidentally true. - One module had a
sys.maxintreference that should have beensys.maxsize
Without the review agent, the dict.keys() issue would have been a production bug discovered weeks later. It was the kind of subtle behavioral difference that passes tests but breaks under real data.
Week 3: High-Difficulty Modules and Final Integration
Days 10-12: Manual migration with agent assistance. The three high-difficulty modules were migrated by the senior developers with the agent in a supporting role. The agent handled the mechanical transformations (same as the easy modules), but the developers manually reviewed and often rewrote the complex sections: metaclass usage, monkey-patching patterns, and dynamic attribute construction.
The split was roughly: agent handled 40% of the changes in these modules (the straightforward syntactic conversions), developers handled 60% (the behavioral and architectural changes).
Day 13: Full integration testing. All modules merged into a single branch. Full test suite run. 347 out of 362 tests passed. The 15 failures were:
- 8 related to string encoding in the high-difficulty modules (fixed by developers)
- 4 related to dictionary ordering assumptions (Python 3.7+ guarantees insertion order, but the code assumed arbitrary order and used
sorted()— the agents had removed thesorted()calls since "dictionaries are ordered in Python 3," which was correct but changed behavior for test assertions) - 3 related to changes in exception chaining behavior (Python 3 exception chains revealed by
__cause__attributes the code was not expecting)
All 15 were fixed by Day 14.
Day 15: Cleanup and documentation. The final day was spent removing __future__ imports that were no longer needed, updating the project's CI configuration to use Python 3.11, updating the README, and writing a brief migration document for the team's records.
What Worked
Parallel worktrees for independent modules. This was the single biggest time saver. Four developers migrating modules simultaneously, each in their own worktree, eliminated the serialization bottleneck that would have made this a 6-week project. The worktree approach meant no branch-switching overhead and no file conflicts during the parallel work phase.
Shared CLAUDE.md as a migration rulebook. The 78-line migration-specific CLAUDE.md ensured all four developers' agents followed the same transformation rules. Without it, each agent would have made slightly different decisions about edge cases (integer division handling, string/bytes conversion style, import ordering), creating an inconsistent codebase that would have been harder to review and debug.
The exploration agent for planning. Spending Day 1 on automated codebase analysis produced a migration plan that was more thorough than a manual assessment would have been. The agent found Python 2 patterns in files the team had forgotten about, and its difficulty ratings were accurate for 11 out of 12 modules (it underestimated one medium-difficulty module that turned out to have more dynamic typing than its static analysis revealed).
Hooks for automated py2-to-py3 lint checks. The team set up a hook that ran pyupgrade --py3-plus after every agent edit. This caught several instances where the agent's migration was correct but not idiomatic Python 3 — for example, using dict() constructor calls instead of dict literals, or keeping object as an explicit base class. The hook flagged these automatically, and the agent fixed them in the same session.
The review agent as a second pair of eyes. The dedicated review pass caught four issues that the implementation agents and human reviewers missed. The dict.keys() behavioral difference was particularly valuable — it was a correctness bug hidden by incidental test data.
What Didn't Work
Agents made incorrect assumptions about runtime behavior. The most dangerous mistakes were not syntactic — 2to3 and the agents handled syntax reliably. The dangerous mistakes were behavioral. The agent assumed dict.keys() returning a view was always a drop-in replacement (it is, except when the result is mutated during iteration). The agent assumed str and bytes could be cleanly separated (they could, except in the network modules that genuinely needed both). These behavioral assumptions required human review to catch.
Dynamic typing edge cases needed human judgment. Functions that accepted multiple types in Python 2 (a common pattern in legacy Python code) could not be mechanically converted. The agent followed the rules in CLAUDE.md, but the rules could not cover every case. In the order_processing module alone, there were seven functions where the correct Python 3 type handling depended on understanding the calling code, the data flow, and the business logic — context that the agent did not have even with the project memory file.
The agents occasionally fought the "do not touch" list. Despite the CLAUDE.md explicitly listing three modules as manual-only, two of the implementation agents tried to migrate files in those modules when they encountered import chains that crossed into the forbidden zone. The agents were following a reasonable instinct — fixing imports that would break — but they were violating an explicit constraint. This happened twice and was caught in review, but it highlighted the need for stronger guardrails on constraint enforcement, possibly through hooks that reject edits to specific directories.
Test coverage gaps amplified risk. The 40% of the codebase without tests was the scariest part of the migration. The agents could convert the syntax, but without tests, there was no automated way to verify correctness. The team ended up writing manual test scripts for the untested modules, which consumed most of Day 13. If the codebase had better coverage going in, the entire migration would have been faster and safer.
Review agent missed some cross-module issues. The review agent checked each module in isolation. It did not catch issues that only manifested when multiple migrated modules interacted — like the 4 dictionary-ordering failures that appeared in integration testing. A review prompt scoped to cross-module interactions would have caught these earlier.
Metrics
| Metric | Value |
|---|---|
| Total calendar time | 3 weeks (15 working days) |
| Original estimate (manual) | 6 weeks |
| Lines of code migrated | ~50,000 |
| Source files modified | 312 out of 340 |
| Percentage automated by agents | ~70% |
| Percentage agent-assisted (human-reviewed and revised) | ~15% |
| Percentage fully manual | ~15% |
| Total API cost (Claude Code, 4 developers) | ~$200 |
| Worktrees used simultaneously (peak) | 4 |
| Tests passing after migration | 362/362 |
| Bugs caught by review agent | 4 |
| Post-migration production issues (first 2 weeks) | 1 (encoding edge case in a rarely-used report) |
| CLAUDE.md length (migration-specific) | 78 lines |
Key Takeaways
-
Shared project memory is the coordination mechanism for agent teams. When four developers run independent agent sessions, the CLAUDE.md file is the only thing that keeps them aligned. Invest time in making it specific, accurate, and comprehensive for the task at hand. For a migration, this means explicit transformation rules, not general guidance.
-
Mechanical changes are the agent's strength; behavioral changes are yours. Agents handle syntactic transformations (print statements, exception syntax, method renames) with near-perfect accuracy. They handle behavioral changes (type semantics, division behavior, iteration side effects) with dangerous confidence — they produce code that looks correct and usually is, but the exceptions are subtle and costly. Review every behavioral change manually.
-
Worktrees turn a serial migration into a parallel one. Without worktrees, four developers cannot work on the same codebase simultaneously without constant merge conflicts. With worktrees, each developer gets an isolated workspace, the agents cannot interfere with each other, and the merge step is clean because the modules are independent. This is the single biggest force multiplier for team-scale agent work.
-
A dedicated review agent catches what implementation agents and humans miss. The implementation agent is focused on making changes. The human reviewer is focused on correctness of those changes. Neither is systematically checking for patterns that are technically correct but behaviorally different. A review agent with a specific checklist fills that gap.
-
Constraint enforcement needs more than instructions — it needs guardrails. Telling an agent "do not touch these modules" in CLAUDE.md is necessary but not sufficient. The agent may follow an import chain into a forbidden module and make changes there. For hard constraints, use hooks that reject edits to protected files or directories. Instructions guide behavior; hooks enforce it.
What We'd Do Differently
- Write integration-level review prompts, not just module-level ones. The review agent checked modules in isolation. A second review pass scoped to cross-module interactions (shared types, import chains, integration test scenarios) would have caught the dictionary-ordering issues before the full integration test day.
- Increase test coverage before starting the migration. The 40% coverage gap was the biggest risk factor. Writing tests for the uncovered modules before migrating them would have been slower upfront but would have eliminated the scariest part of the verification phase.
- Use hooks to enforce the "do not touch" list. A pre-edit hook that rejected changes to the three manual-only modules would have prevented the two constraint violations without relying on human review to catch them.
- Run the exploration agent per-function, not per-module, for medium and high difficulty modules. The module-level difficulty rating was useful for planning but too coarse for execution. A function-level analysis would have identified the specific functions that needed human attention, letting the agents handle the rest of the module with less supervision.
Patterns Used
- Fan-Out Fan-In — for parallel module migration across four developers
- Exploration Agent — for pre-migration codebase analysis and difficulty assessment
- Review Agent — for post-migration verification of each module
- Shared Project Memory — for consistent migration rules across all agent sessions
- Parallel Worktrees — for isolated, conflict-free parallel editing