5 min read

Getting more out of your open-source LLM with a Ralph Wiggum Loop

Getting more out of your open-source LLM with a Ralph Wiggum Loop

Later this year, I'm going to give a talk at work to explain how we're using open-source Large Language Models (LLMs) to have an option in case some of the fronteer model providers do something funky like disabling models or driving up prices too much.

The talk I'm giving isn't just about how to host models. It is also about using open-source models effectively.

I learned that while open-source models are not as powerful as the fronteer models, they can still be effective when used with the right approach. The big models are awesome, but it turns out that if you use the right strategy, open-source models can outperform the fronteer models on costs because you don't have to pay for the tokens. It takes more time, that's the main trade-off.

In this post, I'm giving a short preview of what's to come, because I found something out this weekend that I think can't wait until September. I learned that you can use a Ralph Loop to make open-source models effective at long-horizon tasks.

What is a Ralph Loop?

Last year in July, Geofry Huntley came up with this idea: Let's do something really stupid and place an agent in a loop, giving it the same command over and over until it's finished.

You could write a simple bash script that looks like this:

while :; do cat TASK.md | claude-code; done

When I looked at this, I thought it was an amazingly stupid idea. And I never used it since reading about it. Simply because this doesn't add value with fronteer models. It burns more tokens, and the models work very well without this silly loop.

But the story turns out to be very different for open-source models, so it's worth exploring the Ralph loop in greater depth. The Ralph loop is, in fact, a lot more complicated than a while loop. I've made a drawing to figure things out.

The Ralph loop visualized

It all starts with a TASK.md file containing the task the agent should work on. This can be a more abstract explanation of what you want to achieve. This is a file you write (with or without help from another agent, maybe). This task description is then broken down into an implementation plan.

The implementation plan should have a specific format. You should have a list of implementation tasks the agent needs to perform, in the order they should be performed. So something like this:

- [ ] Create a database migration for the NextAuth schema
- [ ] Create the authentication middleware based on NextAuth
- [ ] Create unit-tests for the authenticaiton middleware

Fragment from the implementation plan

After you've broken down the larger task into a list of smaller tasks for the agent, you give the plan to the agent, who gets the following additional instructions:

Read {{input_file}}. Pick the first unchecked item "- [ ]", implement and verify it, then mark it "- [x]" in {{input_file}}. Do exactly one item, commit the changes, then stop.

The ralph loop instructions

Once the agent is done, we'll check whether all tasks are marked complete and stop the loop if so.

Why does Ralph work so well for open-source?

I tried to implement NextAuth on an application I'm working on with an instruction like:

Implement NextAuth in my application. Follow the guide located at https://next-auth.js.org/getting-started/example. Use GitHub authentication. Assume the necessary environment variables are available in `.env`.

My initial task for the agent

It failed because the LLM couldn't reason its way through the task. I'm using a 30B Qwen 3.6 model that can reason, just not as well as the Opus 4.8 behemoth from Anthropic.

I felt stupid because I knew this wouldn't work. And I was fairly disappointed with the results at first. It went so far that I concluded: my colleagues can never use this for real work.

I was wrong because when I tried the same task with a Ralph loop, it worked flawlessly. It took 12 minutes and 5 Ralph iterations to complete. And I could walk away and do other stuff, like making a cup of coffee.

By breaking the large, abstract plan into smaller steps, I'm reducing the work to something elementary and fairly stupid. This is what the open-source model excels at. The Ralph loop instructions I showed earlier in this post are elementary to this. I'm showing the big picture to provide essential context, but then I want it to focus on a single task to simplify things.

There is another important element in the Ralph instructions. I'm telling the agent to verify the work. I have a fairly extensive set of rule-based validations that I force the agent to run:

  • I want the agent to run the OxC formatter and linter to ensure it produces properly formatted, typed code.
  • I want the agent to run unit tests with Vitest and Storybook, ensuring proper test coverage for both the logic and user interface components.

I've added instructions to the AGENTS.md file in my project to help the agent perform the validation tasks. I even configured Husky pre-commit hooks to force the agent to run the validations.

So, the Ralph loop isn't as simple as a while loop. It forces you to face a few realities about your project and why you're still the captain on the ship:

  • You need to have a specific structure for the work you want the agent to perform.
  • You need to have a strong (preferably rule-based) validation setup.

This requires human involvement because AI doesn't know what you want and will hallucinate all sorts of crap about your project. Neither good instructions nor good quality validations are easy to get right. It requires iteration and constant supervision.

The tooling I'm using

I learned a lot from the experience building the Ralph loop for my open-source LLM setup. It takes a bit of work to get going, but it's really powerful. So I made a tool to help me out: Bosun.

The name is inspired by a nautical theme. The bosun is the person connecting the captain to the deckhand. The captain tells the bosun the task, and the bosun puts the deckhands to work.

Bosun is a Python-based CLI tool that lets you run an implementation plan through a Ralph loop with the following command:

bosun implement TASK.md

You need to have Pi installed on your system, an open-source coding agent for the application to work properly. I chose Pi, because it can be easily integrated with open-source models and has a very minimal interface that I can communicate with over an inter-process communication channel and JSON.

During implementation, I found that the agent would sometimes stall on a task. I added a check to the Bosun tool to help prevent stalls. When a task file hasn't been updated for three consecutive iterations, the tool will stop.

I also added an early-stopping condition so we're not running the agent unnecessarily. When all tasks in an implementation plan are completed, the tool stops too.

Give it a try

Please give the Ralph loop a try if you're disappointed in your open-source models. You're in for a surprise! Grab the tooling from my GitHub repository: https://github.com/wmeints/bosun