I Refactored a 50K-Line Codebase with Gemini CLI: What Worked and What Didn't
A real-world case study of using Gemini CLI to refactor a large Node.js codebase. Lessons learned, metrics, and honest assessment of AI-assisted refactoring.
Zhihao MuIntroduction
Six months ago, I was staring at a codebase that had accumulated four years of engineering debt. It was a Node.js/Express backend powering a B2B SaaS platform — 50,000 lines of pure JavaScript, no TypeScript, no consistent error handling, and a test coverage percentage that I am too embarrassed to write here. The team had grown from two engineers to nine, and every new hire spent their first two weeks just trying to understand what userDataHelper.js actually did.
We had a refactoring project on the roadmap for eight months. It kept getting deprioritised. Then a production incident forced our hand: a type coercion bug in the billing module sent incorrect invoice totals to roughly 200 customers. Nobody had caught it in code review because nobody could hold the full call chain in their head. We fixed it manually in three hours, sent apology emails, and finally got executive sign-off to spend a full sprint on structural improvements.
I decided to try Gemini CLI as a primary assistant for this effort. I had been using it for isolated tasks — writing tests, explaining unfamiliar modules — but I wanted to see whether it could handle a sustained, multi-day refactoring engagement across a large codebase. This article documents what I found: the genuine wins, the frustrating failures, and the metrics that tell the real story.
TL;DR
- Gemini CLI's 1M+ token context window was the decisive advantage — feeding it the full codebase at once eliminated the context-switching that kills AI-assisted refactoring.
- JS-to-TypeScript migration was the highest-leverage task: Gemini CLI automated approximately 70% of the mechanical type annotation work.
- Function extraction and API route reorganisation worked excellently when bounded to well-defined modules.
- Complex business logic — especially multi-step conditional flows with implicit state — consistently produced incorrect refactoring suggestions.
- Database migration scripts were completely off-limits for AI assistance: the risk profile was too high and the output quality too unreliable.
- Overall result: 18% reduction in lines of code, test coverage up from 14% to 61%, build time cut from 4m 12s to 2m 38s over six weeks.
- The key insight: Gemini CLI is a force-multiplier for mechanical, pattern-based refactoring. It is not a replacement for domain knowledge on complex business rules.
The Starting Point
The codebase's problems were the kind that accumulate invisibly over time. No single commit introduced them. They were the natural entropy of a startup moving fast without architectural guardrails.
No type safety. The entire backend was plain JavaScript. Functions received objects and passed them along, often mutating properties in ways that callers did not expect. A function named processOrder might receive an order object, but nobody enforced what shape that object had to be. We had seen three separate production bugs in the past year that traced back directly to a missing property or an unexpected undefined.
Monolithic route handlers. Our Express route files had grown into god objects. One file — routes/admin.js — was 1,847 lines long and contained database queries, business logic, email sending, and Stripe API calls all interleaved. There was no service layer. Everything lived in route callbacks.
Callback-style async code. The codebase was written before async/await was idiomatic in Node.js. Large sections still used nested callbacks or .then() chains that were three to four levels deep, making error handling irregular and stack traces nearly useless.
No meaningful test suite. We had 14% code coverage, almost entirely on utility functions. The core business logic — billing, order processing, user provisioning — had zero tests. Every deployment was effectively a leap of faith.
Here is a representative example of what we were dealing with in the order module:
// routes/orders.js (before) — a typical handler from the codebase
router.post('/orders', authenticate, function(req, res) {
var userId = req.user.id;
var items = req.body.items;
db.query('SELECT * FROM users WHERE id = ?', [userId], function(err, users) {
if (err) {
console.log(err);
return res.status(500).send('error');
}
var user = users[0];
if (!user.subscription_active) {
return res.status(403).send('no subscription');
}
var total = 0;
items.forEach(function(item) {
total += item.price * item.qty;
});
// Apply discount — TODO: move this somewhere better
if (user.plan === 'pro' && total > 100) {
total = total * 0.9;
}
db.query('INSERT INTO orders (user_id, total, status) VALUES (?, ?, ?)',
[userId, total, 'pending'],
function(err, result) {
if (err) {
console.log(err);
return res.status(500).send('error');
}
stripe.charges.create({
amount: Math.round(total * 100),
currency: 'usd',
customer: user.stripe_customer_id
}, function(err, charge) {
if (err) {
// Order created but payment failed — silent data inconsistency
return res.status(402).send('payment failed');
}
res.json({ orderId: result.insertId, chargeId: charge.id });
});
}
);
});
});
This handler has at least five distinct problems: no input validation on req.body.items, swallowed errors with console.log, business logic hardcoded in the route, a silent data inconsistency when the charge fails after the order is inserted, and callback nesting that makes the control flow almost unreadable. Multiply this across 40 route files and you have our starting point.
Setting Up Gemini CLI for Refactoring
I did not just open Gemini CLI and start typing. Effective large-scale refactoring with an AI assistant requires deliberate session management. Here is the setup I arrived at after some trial and error.
Starting each session from the project root. Gemini CLI builds its context from the directory you launch it in. Starting from the root gives it access to your package.json, existing type definitions, configuration files, and the full directory tree it can explore. I always ran:
cd ~/projects/api-backend
gemini
Using Plan Mode for multi-file tasks. Gemini CLI's Plan Mode (triggered with /plan in the REPL) is designed for tasks that span multiple files. Instead of immediately writing code, the model generates a structured plan — which files it will touch, what changes it will make in each, and in what order. For refactoring, this was invaluable. It let me review the approach before any files were modified and catch misunderstandings early.
> /plan Migrate routes/orders.js from callbacks to async/await.
Extract business logic into a new OrderService class.
Add TypeScript types for the Order and OrderItem interfaces.
Do not touch the database layer yet.
Gemini CLI would then output a plan like:
Plan:
1. Create types/order.ts — define Order, OrderItem, and OrderCreatePayload interfaces
2. Create services/OrderService.ts — extract calculateTotal(), applyDiscount(), createOrder()
3. Rewrite routes/orders.js → routes/orders.ts — import OrderService, use async/await,
add input validation with zod
4. Update routes/index.js to import from the new .ts file
Files to be created: types/order.ts, services/OrderService.ts
Files to be modified: routes/orders.js (renamed to .ts), routes/index.js
Files NOT touched: db/, models/, tests/ (per instruction)
This plan review step saved me several hours over the course of the project. On two occasions, Gemini CLI's initial plan included modifications to files I had explicitly excluded, and I caught them before any code was written.
Providing a GEMINI.md with project conventions. I created a GEMINI.md at the project root to give the model persistent context about our coding standards:
# Project: API Backend
## Language & Runtime
- Node.js 20, TypeScript 5.x (target for migration from JS)
- Express 4.x for HTTP routing
- PostgreSQL via pg-promise
## Conventions
- All new files must be TypeScript (.ts)
- Use Zod for runtime input validation
- Services live in src/services/, types in src/types/
- Use async/await, never callbacks or raw .then() chains
- Error handling: throw typed errors, catch at route level with errorHandler middleware
- Never use console.log — use the logger utility from src/utils/logger.ts
- SQL queries live in src/queries/, never inline in services
## What NOT to do
- Do not generate database migration files (handled separately)
- Do not modify anything in src/legacy/ without explicit instruction
- Do not change existing test files — only add new ones
This file was loaded at startup and visibly influenced the model's outputs. Before adding it, Gemini CLI would occasionally generate code using console.log or inline SQL. After, these violations essentially disappeared.
What Worked Well
JavaScript to TypeScript Migration
This was the highest-value use of Gemini CLI across the entire project. The mechanical work of adding type annotations to existing JavaScript — inferring parameter types from how functions are called, defining interfaces for data shapes, adding return type annotations — is exactly the kind of pattern-matching task that AI models excel at.
The workflow I settled on was module-by-module. I would give Gemini CLI a single JavaScript file and ask it to produce a fully annotated TypeScript version:
> Convert src/utils/dateHelper.js to TypeScript.
Infer types from usage patterns.
Add JSDoc comments for each exported function.
Do not change any runtime logic.
Here is a before/after example from dateHelper.js:
// Before: dateHelper.js
function formatForDisplay(date, includeTime) {
if (!date) return '';
var d = new Date(date);
if (isNaN(d.getTime())) return 'Invalid date';
var formatted = d.toLocaleDateString('en-US', { year: 'numeric', month: 'short', day: 'numeric' });
if (includeTime) {
formatted += ' ' + d.toLocaleTimeString('en-US', { hour: '2-digit', minute: '2-digit' });
}
return formatted;
}
function businessDaysBetween(startDate, endDate) {
var start = new Date(startDate);
var end = new Date(endDate);
var count = 0;
var cur = new Date(start);
while (cur < end) {
var day = cur.getDay();
if (day !== 0 && day !== 6) count++;
cur.setDate(cur.getDate() + 1);
}
return count;
}
module.exports = { formatForDisplay, businessDaysBetween };
// After: dateHelper.ts (generated by Gemini CLI, reviewed and accepted)
/**
* Formats a date value for human-readable display.
* Returns an empty string for null/undefined inputs and 'Invalid date'
* for values that cannot be parsed as a valid Date.
*
* @param date - A Date object, ISO 8601 string, or Unix timestamp (ms)
* @param includeTime - When true, appends HH:MM in 12-hour format
* @returns Formatted date string, e.g. "Apr 10, 2026" or "Apr 10, 2026 02:30 PM"
*/
export function formatForDisplay(
date: Date | string | number | null | undefined,
includeTime = false
): string {
if (date == null) return '';
const d = new Date(date);
if (isNaN(d.getTime())) return 'Invalid date';
let formatted = d.toLocaleDateString('en-US', {
year: 'numeric',
month: 'short',
day: 'numeric',
});
if (includeTime) {
formatted += ' ' + d.toLocaleTimeString('en-US', { hour: '2-digit', minute: '2-digit' });
}
return formatted;
}
/**
* Counts the number of business days (Monday–Friday) between two dates.
* The start date is inclusive; the end date is exclusive.
*
* @param startDate - The start of the range
* @param endDate - The end of the range (exclusive)
* @returns Number of business days in the range
*/
export function businessDaysBetween(
startDate: Date | string,
endDate: Date | string
): number {
const start = new Date(startDate);
const end = new Date(endDate);
let count = 0;
const cur = new Date(start);
while (cur < end) {
const day = cur.getDay();
if (day !== 0 && day !== 6) count++;
cur.setDate(cur.getDate() + 1);
}
return count;
}
The output was almost always production-ready with minor tweaks. Across 38 utility and helper files, I accepted approximately 85% of the generated TypeScript directly. The remaining 15% required corrections — mostly around union types that were too broad, or cases where Gemini CLI inferred any instead of a more specific type.
Extracting Functions from Monolithic Route Handlers
The large route handlers were next. The pattern I used was to feed Gemini CLI the full route file and ask it to identify extraction candidates first, before writing any code:
> Analyse routes/admin.js and identify which blocks of code
should be extracted into service functions.
List each candidate with: function name, line range, reason for extraction,
and suggested destination file.
Do not generate any code yet — just the analysis.
The analysis step produced a prioritised list that I could review and edit before authorising the extraction. This two-step approach — analyse first, then generate — consistently produced better results than asking Gemini CLI to do both in a single prompt.
After reviewing and trimming the list, I would authorise specific extractions:
> Extract the user permission checking logic (lines 45–89 in routes/admin.js)
into a new function called checkAdminPermissions in services/AdminService.ts.
The function should accept a userId: string and an action: AdminAction
(define the AdminAction enum too).
Return a Promise<boolean>.
Update the call site in routes/admin.js to use the new function.
The generated code was consistently clean and properly typed. Gemini CLI handled the enum definition, the function signature, the async wrapper, and the call-site update correctly in a single pass.
API Route Reorganisation
Our Express router was structured by entity (users, orders, products) but the handlers themselves had grown to mix concerns. We had user authentication logic in the order routes because they had been added in a hurry. We had admin-only endpoints mixed in with public ones in the same file.
I asked Gemini CLI to propose a new directory structure for routes/:
> The current routes/ directory has 12 files mixed by entity.
Propose a new directory structure that separates:
- Public routes (no auth required)
- Authenticated user routes
- Admin-only routes
- Webhook handlers (Stripe, third-party integrations)
Show the proposed structure as a file tree.
Then explain what would need to move and why.
The proposal was sensible and matched what an experienced backend engineer would have suggested. More usefully, after I approved the structure, I could ask Gemini CLI to execute the migration file by file, and it maintained consistency — each route file it generated used the same middleware stack, the same error-handling pattern, and the same import style throughout.
What Didn't Work
Complex Business Logic Refactoring
The billing module was where things broke down. Our subscription billing logic had accumulated edge cases over four years: trial periods, grandfathered pricing tiers, mid-cycle upgrades, proration calculations for seats, and a special handling path for enterprise customers negotiated before we had a standard pricing model. The logic was spread across three files and referenced a dozen environment variables whose interactions were not documented anywhere.
I asked Gemini CLI to help refactor the proration calculation:
> Refactor the proration logic in billing/subscriptions.js.
It should be extracted into a pure function called calculateProration
that accepts a ProrationType enum, current plan details, new plan details,
and the change date. Make it testable.
The generated function looked clean on the surface. It was TypeScript, it had tests, it handled the basic cases correctly. But when I ran it against our actual subscription records in staging, it produced wrong numbers for enterprise customers. The model had not understood that the enterprise_override flag in the user record bypassed the standard proration formula entirely — this logic was implicit in a comment three files away and a runtime check that the model had not picked up in its analysis.
I spent two hours debugging the discrepancy before I found it. The experience reinforced a hard rule: for any business logic where the correctness depends on implicit domain knowledge that is not explicitly encoded in the code, do not trust AI-generated refactoring without exhaustive testing against real production data.
Here is a simplified version of what went wrong:
// What Gemini CLI generated — mathematically correct but wrong for our domain
export function calculateProration(
type: ProrationType,
currentPlan: Plan,
newPlan: Plan,
changeDate: Date
): number {
const daysRemaining = getDaysRemaining(changeDate, currentPlan.billingCycleEnd);
const dailyRate = (newPlan.monthlyPrice - currentPlan.monthlyPrice) / 30;
return Math.round(dailyRate * daysRemaining * 100) / 100;
}
// What it should have included — the enterprise bypass we had in place
export function calculateProration(
type: ProrationType,
currentPlan: Plan,
newPlan: Plan,
changeDate: Date,
user: User // <— this parameter was not in the original extraction scope
): number {
// Enterprise customers on negotiated contracts bypass standard proration
if (user.enterprise_override) {
return 0;
}
const daysRemaining = getDaysRemaining(changeDate, currentPlan.billingCycleEnd);
const dailyRate = (newPlan.monthlyPrice - currentPlan.monthlyPrice) / 30;
return Math.round(dailyRate * daysRemaining * 100) / 100;
}
The model did not hallucinate — it generated code that was internally consistent with the logic it could see. The problem was that the full business rule was never fully visible to it. This is a fundamental limitation, not a bug to be fixed with a better prompt.
Database Migration Scripts
I never got a single usable database migration out of Gemini CLI. I tried three different approaches: asking it to generate a raw SQL migration, asking it to generate a pg-promise transaction, and asking it to generate a Knex migration file. All three attempts produced code that I would not run in production.
The issues were not syntactic — the SQL was valid and the Knex API calls were correct. The issues were semantic: the generated migrations did not account for the actual state of our production database, they made assumptions about column nullability that did not match our existing data, and one of them would have produced a table lock on our largest table during business hours with no fallback rollback path.
Database migrations require precise knowledge of production data characteristics — row counts, null distribution, index usage patterns, constraint violations that exist in the data but not in the schema definition. An AI model working from the schema file alone simply does not have this information. This is not a Gemini CLI problem; it is a structural problem with AI-assisted data migrations in general.
My rule after this experience: Gemini CLI can help me think through a migration strategy and review a migration script I have written myself, but it will never write the migration for me.
Metrics and Results
We ran the refactoring project over six weeks, with Gemini CLI as the primary tool for all mechanical transformation tasks. Here is what changed:
| Metric | Before | After | Change | |---|---|---|---| | Total lines of code | 50,247 | 41,203 | -18% | | TypeScript coverage | 0% | 78% | +78pp | | Test coverage (overall) | 14% | 61% | +47pp | | Test coverage (core business logic) | 0% | 52% | +52pp | | Build time (CI) | 4m 12s | 2m 38s | -37% | | Route files > 500 lines | 7 | 1 | -86% | | Average function length | 42 lines | 18 lines | -57% | | Production errors (30-day avg) | 23/day | 8/day | -65% | | Mean time to onboard new engineer | ~12 days | ~5 days | -58% |
The lines-of-code reduction came primarily from deduplication. Gemini CLI was effective at identifying near-duplicate utility functions scattered across the codebase — there were seven different implementations of "format currency for display" — and consolidating them into a single canonical version. It found these duplicates much faster than a manual code review would have.
The test coverage jump was the metric I am most proud of. A large portion of the new tests were generated by Gemini CLI given the refactored service functions. When you extract a pure function like calculateOrderTotal(items, discountRules) from a route handler, it becomes trivially easy to test it in isolation. The refactoring created the conditions for the tests; Gemini CLI wrote many of the tests themselves.
Build time fell because TypeScript's incremental compilation is faster than the original Babel transpilation setup, and the smaller, better-separated modules enabled more aggressive code splitting.
Production error rate dropped partly because of the TypeScript migration catching type errors at build time rather than runtime, and partly because the extracted service layer made it possible to add proper error handling that had been impossible in the original callback-nested structure.
Lessons Learned
1. Context is the most important variable. The quality of Gemini CLI's output was directly proportional to how much relevant context it had. Sessions where I had provided GEMINI.md, run from the project root, and explained the broader goal before requesting specific changes were dramatically more successful than sessions where I jumped straight to a narrow request. Invest in context setup upfront — it pays back many times over.
2. Use the analyse-then-generate workflow for non-trivial changes. For anything more complex than a one-file utility conversion, always ask Gemini CLI to produce a plan or analysis before generating code. Review the plan, correct misunderstandings, and only then ask for the code. This two-step pattern caught the majority of errors before they were written into files.
3. The model's confidence is not a reliable quality signal. Gemini CLI generated incorrect business logic with the same confident tone it used for correct utility functions. Never assess output quality based on how the model presents the output. Assess it based on whether the output is correct for your specific domain. Run tests, not vibes.
4. Mechanical refactoring is where AI assistance is most valuable. Type annotation, function extraction, callback-to-async conversion, duplicate consolidation — these are pattern transformations with objectively correct outputs. Gemini CLI handled them reliably at scale. Business logic refactoring — where correctness depends on domain knowledge that is implicit rather than explicit in the code — is where it consistently fell short.
5. Scope your requests tightly. Open-ended requests ("refactor the billing module") produced inconsistent results. Tightly-scoped requests ("extract the tax calculation from line 145–189 of billing/invoices.js into a pure function called computeTaxAmount in billing/taxService.ts") produced reliable, reviewable outputs. The more precisely I defined the transformation, the more predictable the result.
6. Build a review habit, not a trust habit. The most effective workflow was treating Gemini CLI's output the same way I treat a pull request from a capable but junior engineer: review everything, question anything that touches business logic, and run the full test suite before merging. The risk is not that the model is incompetent — it is that its competence can lull you into skipping review.
7. Track your session context carefully. Gemini CLI's context window is large, but long sessions accumulate noise. I found that sessions longer than roughly 90 minutes of active exchanges started producing outputs that contradicted earlier decisions or lost track of constraints I had established. Starting a fresh session with a clean, curated GEMINI.md was almost always more productive than continuing an aged session.
FAQ
Q: Did Gemini CLI make the refactoring faster overall, or did debugging its mistakes eat the time savings?
Net faster, but not uniformly. For the mechanical tasks — TypeScript migration, function extraction from well-defined modules, test generation for pure functions — I estimate a 3–4x speed increase compared to doing the work manually. For the complex business logic attempts, I estimate a slight net negative: the time spent prompting, reviewing, and then debugging incorrect outputs exceeded what manual refactoring would have taken. The lesson is to be selective about where you apply AI assistance.
Q: Did you use any other AI tools alongside Gemini CLI?
Yes. I used Claude Code for about 20% of the work, specifically for the agentic tasks where I wanted the model to run the test suite and iterate on failures without my involvement. Gemini CLI was my primary tool for analysis and code generation; Claude Code was my primary tool for "implement this and make the tests pass." The two tools complement each other well.
Q: How did the rest of the team feel about AI-assisted refactoring?
Mixed reactions initially, which I expected. Two engineers were enthusiastic from day one. Two were sceptical and wanted to review every AI-generated PR more carefully than they reviewed human-written code — which is actually the right instinct and I encouraged it. The scepticism faded as the TypeScript migration produced zero regressions over the first two weeks. By week four, everyone was using Gemini CLI for at least some portion of their daily work.
Q: Would you do it again the same way?
Mostly yes, with one change: I would establish the TypeScript configuration and the service layer architecture before starting the AI-assisted work. The biggest friction points came from asking Gemini CLI to make decisions about project structure that I should have made and documented first. The model is better at executing within a defined structure than at inventing the structure itself.
Q: What does the codebase still need that the refactoring did not address?
The database layer is still largely unrefactored. The raw SQL queries in src/queries/ are functional but not type-safe, and the connection pool management has technical debt we did not touch. We also have five legacy integration modules in src/legacy/ that are excluded from the TypeScript migration because they use APIs from deprecated third-party services that will be decommissioned this year. Once those services are gone, the legacy modules go with them — no point refactoring what you're about to delete.
Conclusion
Refactoring a 50,000-line codebase is never going to be a clean, linear process regardless of what tools you use. Six weeks with Gemini CLI did not transform our codebase into a perfect system. What it did was dramatically accelerate the mechanical work — the type annotations, the function extractions, the async conversions, the test scaffolding — that would otherwise have consumed the majority of engineering time and left no bandwidth for the architectural thinking that mattered most.
The numbers are real: 78% TypeScript coverage, test coverage up from 14% to 61%, production errors down 65%. Those improvements came from a combination of Gemini CLI's throughput on mechanical tasks and the team's own judgment on where to apply it, how to review its output, and when to step back and work manually.
If you are planning a similar effort, the single most important advice I can give is this: use Gemini CLI to do more of the right work faster, not to delegate decisions you should be making yourself. Define the architecture. Write the conventions. Know your business logic deeply. Then let the model handle the mechanical transformation at scale. The combination of human judgment and AI throughput is genuinely powerful. Either one alone would have produced a worse result.
Related reading:

Zhihao Mu
· Full-stack DeveloperDeveloper and technical writer passionate about AI-powered development tools. Building geminicli.one to help developers unlock the full potential of Gemini CLI.
GitHub ProfileWas this article helpful?