Policy Gradient Experiments

Collection of policy gradient and reinforcement learning experiments for various domains.

Experiments

Branch	Description	Domain
MBPP	Code generation with GRPO + vLLM + SandboxFusion	MBPP coding tasks
gsm8k	Mathematical reasoning with GRPO	GSM8K dataset
math_posttrain	Post-training experiments	MATH dataset (Hendrycks)

Each experiment lives in its own branch with complete code, documentation, and results. The main branch serves as an index to all experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
README.md		README.md
grpo_notes.md		grpo_notes.md
pyproject.toml		pyproject.toml