My Boots on the Ground Review of Anthropic's Fable

tldr; meh. Only marginally better than the previous model.

Anthropic released their latest model, Fable 5, which is based off of their Mythos model. I’ve seen the initial hype results with people claiming that it one shots apps and everything is now solved. But the apps that I’ve seen one shot aren’t real apps.

So I spun up Claude, switched to Fable and gave it a pretty simple task in a real Rails codebase. Before handing it off I created a branch and then I purposefully broke some automated tests to see how it would handle it.

The Task

User recommendations are too loose right now. Allow users to tighten their recommendations on their profile. Set the default recommendation to medium (that’s the not the default right now). There should be a tooltip for the user, so that they understand what this new setting does.

Seems pretty simple, right?

The Results

It got to work adding a migration, running it and then started working on code. If you’ve ever written code with tests associated with it you’ve probably got alarm bells going off in your head. It’s now gotten itself in a situation where it doesn’t know if on the next test run the changes that it added broke the test suite or if those were already broken.

Right off the rip, it started trying to use an older Ruby hash syntax when it A) it doesn’t need to and B) I’ve told Claude many times not to do that. Second, in specs it has the bad habit of using let statements and then using before blocks to access the data. This is not the preferred way of doing things. The preferred way of doing things is to use let!. I can’t tell you how many times I’ve had to tell Claude not to do this.

For the non-RSpec crowd: let is lazy, let! is eager.

describe User, '.count_active' do
  let(:lazy_user)  { create(:user, active: true) }
  let!(:eager_user) { create(:user, active: true) }

  it 'counts active users' do
    expect(User.count_active).to eq(1)  # passes — only eager_user exists;
                                        # lazy_user was never referenced, never created
  end

  it 'counts both when referenced' do
    before { lazy_user }                # NOW the lazy block runs

    expect(User.count_active).to eq(2)
  end
end

After coding up the solution in the backend and frontend, it gets into trouble. It realizes that there are failing specs but doesn’t know if those existed before it started making changes or not. So it starts trying stash changes and switch to main to re-run specs. That won’t work because the database migration that it created.

I let if off the hook and finally tell it that those specs were failing previously. With that it gives me a summary of what changed and says that it’s done.

You might be thinking, “well, that’s not so bad” but it’s actually missed a critical piece of the puzzle. Remember when I said that the recommendation preference isn’t set to medium currently but I told the model to make it that way? The reason that I did that is to see if it would realize that there needed to be some data migration step for existing users.

In this context, users would see some funky behavior but it’s not a show stopper but that’s not the point of the test. The thing that could be changing could be a tax rate, a payment plan rate, or a payment processor. Seasoned software engineers know to point out this and to either let it ride with funky behavior or take data migration into account. It’s not a big deal for the app in question but it could be catastrophic for another production system.

Finally, it seemed to guess at some language in the tooltip which is wrong. A minor annoyance but it’s whatever in the scheme of things.

Initial Conclusion

I’m not an AI Doomer but I’m not an AI Zoomer (wait, is that the opposite of doomer? 🤷‍♂️ It is now. 😅). I want to see results not hype and so far in my initial testing it’s only marginally better than the previous model. Also, something I’ve noticed as I’ve continued to use it, the longer it runs the worse it seems to get (just like the previous model). So sure, it’s better but it’s still making critical errors.