We are pleased to share a research preview of OpenRL, a new open-source project coming out of GKE Labs. OpenRL is a self-hosted training API for fine-tuning LLMs on your own Kubernetes cluster.
Why we built it
If you look at agentic RL on LLMs, it is incredibly easy to get bogged down in system complexity. To run a single RL loop, you have to coordinate a dozen different things: selecting and cleaning datasets, choosing RL environments, debugging training loops, managing reward signals, handling inference mismatches, allocating hardware, and managing infrastructure. Picture looks something like this:
Each of these is a hard problem. But what makes it more complex is how tightly AI research and infrastructure concerns are mixed together in today's tooling and frameworks.
We believe decoupling the infrastructure from AI research can make these problems more tractable so that infrastructure engineers and AI researchers can independently tackle them. We have seen this pattern with Kubernetes where Kubernetes abstracted out the infrastructure and made application developers and SREs life easier.
So, can you abstract out post training infrastructure? We believe so and drew huge inspiration/validation from Tinker (from Thinking Machines). The Tinker APIs for post training hit that Goldilocks zone where it hides all the post training infrastructure behind four key APIs:
So the end result of this abstraction is that AI Researchers get full flexibility on their RL loop and infrastructure engineers can focus on scaling, orchestration, and reliability. OpenRL allows you to run the same training APIs but on your own infrastructure. And this decoupling has other interesting benefits.
Sharing GPUs
Traditional RL loops are strictly sequential. The trainer waits for the sampler to finish rollouts, the sampler waits for the environment to score rewards (which is often bound by slow CPU/network tasks), and the whole loop sits blocked. Your expensive GPUs spend a lot of time doing nothing. The abstraction allows running multiple RL jobs and allows infrastructure engineers to pack the training/sampling steps to utilize more of their GPUs. The graph below shows the GPU consumption in OpenRL for running one, two, and three RL jobs concurrently.
Better UX
Once you separate out the infrastructure behind the APIs, you start to see the gains in user experience of developing the RL loop because AI researchers no longer have to wrangle the complex python dependencies like cuda. When you are doing R&D, you do not have to run the RL loop directly on the machines with GPUs, you can simply run your RL loop on your Mac pointing to the training APIs running on a Kubernetes cluster/VMs.
Autoresearch
We believe that frontier AI research will get more and more automated in the future and abstracting out infrastructure as a building block is key to that. To demonstrate that, we added an autoresearch recipe inspired heavily by karpathy's work. The recipe demonstrates how to conduct parallel experiments to conduct parameter sweep, and improve the reward signal for our text-to-sql recipe for Gemma models.
What OpenRL is not
- A managed service. OpenRL is self-hosted and not a managed service. We aim to make it easy for users to deploy and operate it on their Kubernetes clusters.
- An RL framework. OpenRL gives AI researchers full control over their RL loop.
Get started
We have made it easy to run OpenRL on your Mac, Nvidia GPUs, or on GKE. This allows you to test your RL loop on Mac and when you are ready to scale, you can point the RL loop to the OpenRL endpoint running in the GKE cluster.
Try out our text-to-SQL example for teaching the latest Gemma model SQL here: guides.
One of the benefits of a Tinker compatible endpoint is that you can use Tinker-Cookbook with OpenRL. Tinker-cookbook is one of the best resources for post training infrastructure for RL.
Future steps
We have started with a simple architecture focussing on LoRA fine-tuning and plan to evolve the project in the coming months, so please give it a try and share your feedback. A few things we are very excited to work on:
- Full parameter fine-tuning
- Multitenancy (simultaneous RL on different types of base models)
Acknowledgement
We have been inspired by the work done by various open source projects in AI communities, so huge thank you to Thinking Machines, vLLM, PyTorch, prime-rl, verl, SkyRL, and llm-d.