I am considering doing RL as a service for companies looking to finetune LLMs, and I have doubts. It is a lot more compute-intensive. it promises data efficiency, but training is more unstable, it is less straightforward to debug, and there are so many moving parts in infra and environment setup that make reproducibility very difficult unless you just have the compute to scale. was wondering how far RL for agents is from adoption? are there people experimenting with this in your work/training custom reasoning models? is it worth it?

Posted in