2026-06-12

Two UI Migrations and 56 Million Tokens Later, Claude Fable Still Needed My Eyes

This week I aimed Claude at migrating the UI frameworks of two production apps. Every screen moved to a new framework, and nothing about how either app behaves changed. The migrations ran side by side and finished over two days. Two days is calendar time, not a stopwatch.

I started with brainstorming sessions to plan the migration, then played it like a DJ with a few records spinning. Each agent ran its own long sessions, and I dropped in with new information whenever one of them needed a nudge.

What the agents were doing

Changing UI code was the minority of what the agents did. Most of their time went to operating browsers:

Parsed the DOM
Took screenshots
Analyzed pixels for rendering issues
Filed reports back to me
Kept notes on where to go and what to look at on future visits

This let each fresh session pick up cold from the repo and keep working. It’s slower than asking for sweeping changes in one shot with none of the browser checking, but the one-shot route backfires often enough that the slow route can come out ahead.

Both apps kept the old and new UI frameworks running at once during the migration, behind a thin compatibility layer. That detail mattered more than it sounds. The apps stayed shippable between batches, in case it made sense to ship part of the work at any moment.

The rhythm

The work ran in batches with pre-planned stopping points, and the full test suite ran at every boundary. Only the browser tests failed with any regularity, usually because a class name had drifted, and the assertions got corrected along the way. This might be a good case for test markers in the UI that never change, attributes that exist only for the tests, so the assertions stop caring about class names.

The stopping points were scheduled chances for me to QA and course correct. There was more correcting than I would have liked, though noticeably less than when I’ve pointed other models at changes this size. Then again, my own skills are sharpening at the same time, so who knows how much of the credit goes to me and how much to Fable.

Each time an agent missed a detail, I handed it the information along with instructions to improve its own tools so the same miss couldn’t slip through quietly again.

Cycles ended and sat waiting for my review. I stepped away, went to bed while a work cycle ran, and came back to QA in the morning. Most of the waiting was the agents waiting for me.

For scale, about 515 UI-related files changed across the two apps. This was the whole UI of two production systems, not a restyle.

Since we’re talking speed, a short rant about Anthropic’s Fable victory lap. The launch coverage quoted companies getting months of work done in days, and none of them disclosed their method. It’s possible they had forward-deployed Anthropic engineers coaching the work and free use of the highest effort modes. That’s not a typical situation, so we’ll see whether those stories keep coming once it’s typical teams paying their own bills. My two days are real, but they came with babysitting attached, and I’d rather report the number that way.

Fable felt elevated, a step up from what I’d used for work like this before, but it wasn’t blowing the doors off on this task. In the end I could have gotten by with other models just as well. A story like the Pokémon one helps explain why.

Anthropic showed Fable beating Pokémon FireRed from raw screenshots alone, no maps and no memory hooks, which earlier models couldn’t do. But look at how it won: it ground a single Charizard up to overpowered and bulldozed the whole game. One writeup called it “an overleveled Charizard”, the laziest strategy that works, the one any bored kid also lands on.

The vision and the endurance are the real advance. But the win was a grind, and that was what I saw too: point it at a job and it finds a path that works and hammers it. Certianly smarter than the quote above, but not as fast to change course as you’d expect.

The toolkit that learned

At the start I told Claude to build its own tools for the job: skills, subagents, and probes, meaning scripts that measure pages. The toolkit improved in two directions from there.

Reactively, every miss I caught went back in as information plus instructions to self-improve the tooling, and each cycle missed less than the one before. Proactively, the agents adapted the tools after learning from their own mistakes, amending their skills in response to their own failed QA or failed tests.

They also kept a scratchpad of things they figured out along the way, for themselves or for a future session. Which account to use for which scenario. What one account can do that another can’t because of its role. They were getting familiar with the application the way a QA or support person would, and committing it to memory.

By the end, the migration recipe had picked up about ten discoveries nobody knew about at the start. None of them would have disrupted a user, but they were valid issues worth correcting, and I’m impressed the agents zeroed in on them.

This kind of work is relatively slow and token-expensive. Every fresh session re-reads a sizable docs corpus before doing anything, and screenshot verification burns tokens on every image. I’ve clocked 56 million tokens since turning on Fable, and these sessions account for most of it.

I started with Fable doing everything, and once I saw the subscription quotas climbing fast, I split the work, with Sonnet and some Opus taking batches and Fable leading. That brought the burn down, but no matter how you split it, using a model as eyes is the expensive part.

Next time I’d push more of the looking onto the browser itself, and I wouldn’t reserve all the work for Anthropic models. There are plenty of capable models at other providers that could take a share of it.

The biggest saver came from Claude itself. After a few misses, it decided that measuring pages directly, asking the browser for the real sizes, gaps, and positions, would beat screenshots for layout questions. It was right. The measurements were cheaper and more reliable, screenshots stayed in reserve for judgment calls, and the strategy went straight into the skills it was already improving.

What the machinery missed

Every class of miss got caught eventually, but several were caught by my eyes rather than the machinery. Here’s each one: what happened, and the rule that came out of it.

A CSS reset erased the form controls

A framework-level CSS reset deleted every switch thumb, checkbox glyph, and radio dot across an entire app. The behavior checks passed, because the state underneath still toggled fine. The screenshot judge passed too. I caught it by looking at a page.

The rule: passing checks don’t prove things render. A script now confirms the glyphs are actually drawn.

A seam the judges couldn’t see

A zero-pixel gap between a form field and its button sailed through two judging passes, because the focus ring visually filled the seam. I caught it twice before it stayed fixed.

The rule: measure the gaps. A script now reads the real spacing between elements instead of judging the picture.

The inventory had holes

The file manifest came from a grep. Four separate times something that emits HTML turned up outside it: helpers that render partials, custom form inputs, classes the framework generates on its own. Each one was late rework.

The rule: build the inventory from a checklist of every place that produces HTML, not from a single grep.

A delegate lied

One sub-agent reported applying a change it hadn’t applied. Browser QA exposed it.

The rule: the orchestrator verifies diffs and rendered pages itself, because delegate reports are claims, not evidence. Hooks might enforce that automatically, or different instructions might. I haven’t tried either yet.

The written instructions were wrong twice

A pre-resolved fix and a named “working exemplar” were both subtly off. Both got caught only because the implementing pass re-derived the fix from the code instead of executing the text.

The rule: treat prescriptions as hypotheses. Re-verify, then deviate with a recorded reason.

Dev data had gaps

One form path was never live-tested because no dev record exercised it, and the same root cause bit the second app.

The rule: every conditional render gets a dev-data instance before migration starts.

Lint debt end-loaded

The pre-commit hook only linted Ruby files, so template style offenses piled up silently until the final gate. One of these apps is a Rails app, and this is the only place the stack matters.

The rule: static guards on all relevant file types.

What improved

The loop tightened. Each QA stop found less than the one before, and the later review cycles ran nearly autonomously off a single handoff document, with no questions asked mid-cycle. The final review ended with a written disposition for every finding: fixed, tracked, or rebutted with evidence. The rebuttals cited a dated ledger of approved design decisions, so review noise never turned into code churn.

What I realized I was for

Three jobs.

Design arbiter

The theme and the dialog shell were chosen from live A/B variants before implementation started.

Perceptual QA

The catches above: noticing the things no check had been written for yet.

Scope cop

This mattered most during the design phase for the new UX. One agent kept trying to sneak new functionality into what was supposed to be a migration. The fix was memory, plus a better explanation of what a UI migration is.

The fair summary is that the agents did the work and most of the verification. I supplied taste, decisions, and the checks that hadn’t been codified yet.

What I’d do differently

A lot of misses were discovered mid-flight and fixed mid-flight. Next time I’d plan ahead more, in ways I never bothered to before, because before I wasn’t directing a machine to do my work.

The centerpiece is a verification matrix, declared up front: every page, crossed with every viewport, interactive state, data variant, and check type. Everything my eyes caught would have mapped to a cell in a matrix like that. The switch thumbs were a missing “visual state checked” cell; the form seam was a missing “geometry measured” cell. A dumb dashboard of those cells turns “I happened to notice” into “the dashboard shows a hole.”

There’s more:

Build the file inventory from a checklist of every place that produces HTML, not from a grep.
Seed dev data for every conditional render before starting.
Put static guards on every relevant file type in pre-commit.
Measure layout with scripts and save screenshots for judgment calls.
When a human catches something, turn it into an automated check.

And keep what already worked: a written constitution before any code, a dated decision ledger, scheduled eyeball rounds at batch boundaries, A/B variants for design calls, and a review cycle that ends in written dispositions.

Where it ended up

By the late cycles the agents ran long sessions with little input from me. They flagged UI and UX problems on their own while verifying. Both apps came out with a better customer experience than they went in with.

When the work wound down, I asked for the tooling to be extracted into general-purpose tools for other projects, maybe a plugin eventually, since the second app had already proven the playbook travels.

What I actually ended up with is a better way to plan a migration, and my work done in a relatively short amount of time.