Introducing OpenRL: A self-hosted post-training API for fine-tuning LLMs

We are pleased to share a research preview of OpenRL, a new open-source project coming out of GKE Labs. OpenRL is a self-hosted training API for fine-tuning LLMs on your own Kubernetes cluster.

Why we built it

If you look at agentic RL on LLMs, it is incredibly easy to get bogged down in system complexity. To run a single RL loop, you have to coordinate a dozen different things: selecting and cleaning datasets, choosing RL environments, debugging training loops, managing reward signals, handling inference mismatches, allocating hardware, and managing infrastructure. Picture looks something like this:

an AI researcher and an infrastructure engineer staring at the hurdles in post training along the way to the summit
Figure shows an AI researcher and an infrastructure engineer staring at the hurdles in post training along the way to the summit.

Each of these is a hard problem. But what makes it more complex is how tightly AI research and infrastructure concerns are mixed together in today's tooling and frameworks.

We believe decoupling the infrastructure from AI research can make these problems more tractable so that infrastructure engineers and AI researchers can independently tackle them. We have seen this pattern with Kubernetes where Kubernetes abstracted out the infrastructure and made application developers and SREs life easier.

So, can you abstract out post training infrastructure? We believe so and drew huge inspiration/validation from Tinker (from Thinking Machines). The Tinker APIs for post training hit that Goldilocks zone where it hides all the post training infrastructure behind four key APIs:

high level components and their interaction in a OpenRL based RL workflow
Figure shows high level components and their interaction in a OpenRL based RL workflow

So the end result of this abstraction is that AI Researchers get full flexibility on their RL loop and infrastructure engineers can focus on scaling, orchestration, and reliability. OpenRL allows you to run the same training APIs but on your own infrastructure. And this decoupling has other interesting benefits.

Sharing GPUs

Traditional RL loops are strictly sequential. The trainer waits for the sampler to finish rollouts, the sampler waits for the environment to score rewards (which is often bound by slow CPU/network tasks), and the whole loop sits blocked. Your expensive GPUs spend a lot of time doing nothing. The abstraction allows running multiple RL jobs and allows infrastructure engineers to pack the training/sampling steps to utilize more of their GPUs. The graph below shows the GPU consumption in OpenRL for running one, two, and three RL jobs concurrently.

The figure shows the trainer/sampler duty cycle in OpenRL for scenarios with 1 RL job, 2RL jobs and 3 RL jobs respectively
The figure shows the trainer/sampler duty cycle in OpenRL for scenarios with 1 RL job, 2RL jobs and 3 RL jobs respectively.

Better UX

Once you separate out the infrastructure behind the APIs, you start to see the gains in user experience of developing the RL loop because AI researchers no longer have to wrangle the complex python dependencies like cuda. When you are doing R&D, you do not have to run the RL loop directly on the machines with GPUs, you can simply run your RL loop on your Mac pointing to the training APIs running on a Kubernetes cluster/VMs.

Autoresearch

We believe that frontier AI research will get more and more automated in the future and abstracting out infrastructure as a building block is key to that. To demonstrate that, we added an autoresearch recipe inspired heavily by karpathy's work. The recipe demonstrates how to conduct parallel experiments to conduct parameter sweep, and improve the reward signal for our text-to-sql recipe for Gemma models.

Figure showing autoresearch UI with multiple AI researchers conducting experiments in parallel in OpenRL
Figure showing autoresearch UI with multiple AI researchers conducting experiments in parallel in OpenRL

What OpenRL is not

  • A managed service. OpenRL is self-hosted and not a managed service. We aim to make it easy for users to deploy and operate it on their Kubernetes clusters.
  • An RL framework. OpenRL gives AI researchers full control over their RL loop.

Get started

We have made it easy to run OpenRL on your Mac, Nvidia GPUs, or on GKE. This allows you to test your RL loop on Mac and when you are ready to scale, you can point the RL loop to the OpenRL endpoint running in the GKE cluster.

Try out our text-to-SQL example for teaching the latest Gemma model SQL here: guides.

One of the benefits of a Tinker compatible endpoint is that you can use Tinker-Cookbook with OpenRL. Tinker-cookbook is one of the best resources for post training infrastructure for RL.

Future steps

We have started with a simple architecture focussing on LoRA fine-tuning and plan to evolve the project in the coming months, so please give it a try and share your feedback. A few things we are very excited to work on:

  • Full parameter fine-tuning
  • Multitenancy (simultaneous RL on different types of base models)

Acknowledgement

We have been inspired by the work done by various open source projects in AI communities, so huge thank you to Thinking Machines, vLLM, PyTorch, prime-rl, verl, SkyRL, and llm-d.