ASRL: Alternating Supervised And Reinforcement Learning For Efficient Small Language Model Training With Live Datasets

19 Sep

Authors: Ouissam Drissi

Abstract: Look, here's the thing – training small language models to think properly is hard. Really hard. Especially when you're working with just 600 million parameters and need them to follow a specific format while actually being smart about it. I've been there – you try pure reinforcement learning and your model outputs garbage for the first 10 epochs. You try supervised learning and it just memorizes without understanding. So I built something different. ASRL (Alternating Supervised-Reinforcement Learning) switches between supervised fine-tuning and GRPO within each epoch. Not after completing all supervised training. Not as separate phases. Every. Single. Epoch. First the model learns from your actual examples, then it explores variations through RL. Rinse and repeat. The results? My 0.6B parameter model learned my custom and thinking format in 3 epochs instead of 12. It handles new data as it arrives without restarting training. And it actually understands what it's doing instead of just pattern matching. This isn't some theoretical framework – I built this because I needed it. My training data grows by 200 examples per hour, I have strict formatting requirements, and I'm running on limited hardware. Traditional methods failed me. ASRL didn't.