grpo_demo.ipynb: Training reduces accuracy from 53.0% to 49.0%

I'm trying to reproduce the Gemma fine-tuning shown in [your blog post](https://developers.googleblog.com/en/introducing-tunix-a-jax-native-library-for-llm-post-training/). I ran the `grpo_demo.ipynb` notebook on a `v5e-4` machine on GCP. 

**Expected Behavior**

I expected to see a performance increase like 52.67% to 64.06% as shown in your blog post.

**Actual Behavior**

After about an hour the training finished, and I saw a reduction in accuracy from 53% to 49%:

```
Before training: corr=53, total=100, accuracy=53.0%, partial_accuracy=61.0%, format_accuracy=65.0%
After training: corr=49, total=100, accuracy=49.0%, partial_accuracy=56.00000000000001%, format_accuracy=91.0%
```

I uploaded the notebook to Colab for you to inspect the output: https://colab.research.google.com/drive/1HfFkurkr-FSuDF_YZbFcPHFGNS5BI0i5 Note that I did not run this on Colab but rather on GCP, because Colab won't provide `v5e-4` as far as I know. 

**Steps to Reproduce the Problem**

1. Run the notebook.

**Environment**

- **OS:** Ubuntu 22.04.3 LTS
- **Project Version:** `git+https://github.com/google/tunix@f0d5d1e63e24f42647fd4e6122641d689f8bfd0e`

**Checklist**

- [X] I have searched the existing issues for a similar bug report.
- [X] I have provided all the required information in the "Environment" section.
- [X] I have provided a minimal, reproducible example.

**Would you like to help us fix it?**

Yes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

grpo_demo.ipynb: Training reduces accuracy from 53.0% to 49.0% #688

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

grpo_demo.ipynb: Training reduces accuracy from 53.0% to 49.0% #688

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions