Collection of policy gradient and reinforcement learning experiments for various domains.
| Branch | Description | Domain |
|---|---|---|
| MBPP | Code generation with GRPO + vLLM + SandboxFusion | MBPP coding tasks |
| gsm8k | Mathematical reasoning with GRPO | GSM8K dataset |
| math_posttrain | Post-training experiments | MATH dataset (Hendrycks) |
Each experiment lives in its own branch with complete code, documentation, and results. The main branch serves as an index to all experiments.