Finish PR #98: replay fallback checkpoint compatibility + RL test updates by amcberkes · Pull Request #166 · google/sbsim

amcberkes · 2026-04-26T16:34:53Z

Completes PR#98
Incorporates replay fallback direction from Yukta/rl module finish #127
Replaces Reverb-only checkpoint calls with ReplayBufferManager.save_checkpoint()
Adds TFUniform fallback checkpoint save/restore support
Updates RL tests for fallback compatibility

…rl pipeline

…rking

…ts as well

…py3.11

…onal

…nly calls

gemini-code-assist

Code Review

This pull request introduces a comprehensive reinforcement learning (RL) framework, including scripts for agent training, evaluation, and configuration generation. It adds support for DDPG and SAC agents, implements custom observers for trajectory recording and visualization, and provides a robust replay buffer management system with Reverb and TFUniform fallbacks. Feedback focuses on critical logic errors such as redundant training step increments, potential resource leaks from uncleaned temporary directories in the evaluation script, and the presence of dead code and hardcoded configuration values that hinder maintainability.

gemini-code-assist · 2026-04-26T16:37:05Z

-      log_interval=args.log_interval,
-      checkpoint_interval=args.checkpoint_interval,
-      learner_iterations=args.learner_iterations,
+      train_step.assign_add(1)


Manually incrementing train_step here is redundant and likely incorrect. agent_learner.run(iterations=self.learner_iterations) already increments the train_step variable by the number of iterations performed. This double-incrementing will cause triggers (like evaluation and saving) to fire much more frequently than intended, as they are based on the global step count.

gemini-code-assist · 2026-04-26T16:37:05Z

+    self._num_timesteps_in_episode = self._environment.pyenv.envs[
+        0
+    ]._num_timesteps_in_episode


The variable self._num_timesteps_in_episode is initialized here using fragile internal access to the environment, but it is never used elsewhere in the class. This code should be removed to improve maintainability and avoid potential crashes if the environment structure changes.

gemini-code-assist · 2026-04-26T16:37:05Z

+  model_structure_dir = None
+  if os.path.exists(os.path.join(policy_dir, "greedy_policy")):
+    model_structure_dir = os.path.join(policy_dir, "greedy_policy")
+    logger.info("Using model structure from greedy_policy directory")
+  else:
+    raise ValueError(
+        "No policy structure directories found in"
+        f" {os.path.abspath(policy_dir)}"
+    )


The logic currently only checks for the greedy_policy directory. As noted in the docstring, it should also support the policy directory as a fallback. If neither is found, then it should raise the ValueError.

gemini-code-assist · 2026-04-26T16:37:05Z

+    self.temp_saved_model_policy_dirpath = create_merged_saved_model(
+        self.saved_model_policy_dirpath
+    )


The ExperimentEvaluator class creates a temporary directory via create_merged_saved_model but does not provide a mechanism to clean it up. This will lead to an accumulation of temporary directories on disk. Consider adding a cleanup method or implementing the class as a context manager to ensure shutil.rmtree is called on self.temp_saved_model_policy_dirpath.

gemini-code-assist · 2026-04-26T16:37:05Z

+def old_main(argv: Sequence[str]):
+  if len(argv) > 1:
+    raise app.UsageError("Too many command-line arguments.")
+
+  # handle relative and absolute filepaths:
+  config_filepath = FLAGS.eval_config_filepath
+  if not os.path.isabs(config_filepath):
+    config_filepath = os.path.join(ROOT_DIR, config_filepath)
+
+  policy_dirpath = FLAGS.eval_policy_dirpath
+  if (
+      policy_dirpath is not None
+      and not os.path.isabs(policy_dirpath)
+      and policy_dirpath != "schedule"
+  ):
+    policy_dirpath = os.path.join(ROOT_DIR, policy_dirpath)
+
+  evaluate_policy(
+      experiment_name=FLAGS.eval_experiment_name,
+      policy_dirpath=policy_dirpath,
+      config_filepath=config_filepath,
+      num_eval_episodes=FLAGS.num_eval_episodes,
+  )


The old_main function appears to be dead code and should be removed to improve maintainability.

gemini-code-assist · 2026-04-26T16:37:05Z

+    if self.agent_type not in ['sac', 'ddpg']:
+      raise ValueError(
+          'Agent {self.agent_type} has not (yet) been implemented. Please'
+          " choose one of: ['sac', 'ddpg']."


The ValueError message is missing the f prefix, so {self.agent_type} will not be interpolated correctly.

gemini-code-assist · 2026-04-26T16:37:05Z

+        data_spec=self.agent.collect_data_spec,
+        capacity=50000,  # Use default capacity
+        checkpoint_dir=new_buffer_path,  # Use the copied buffer path
+        sequence_length=2,
+        # should we keep these defaults, or use the dynamic parameter values?
+    )


The ReplayBufferManager should be initialized using the class attributes self.buffer_capacity and self.sequence_length instead of hardcoded defaults, to ensure it respects the user-provided configuration.

… metadata

amcberkes · 2026-04-26T17:37:26Z

I got the RL training running end-to-end on my Mac without dm-reverb.

I switched replay handling to a TFUniform fallback path and added metadata persistence for replay capacity (replay_buffer_metadata.json), which fixed the restore mismatch I was hitting during train startup.

s2t2 · 2026-04-27T14:31:54Z

@amcberkes awesome!

Looks like there are two file conflicts that need resolution. Let me know once they are resolved and I will provide a review.

gabriel-trigo and others added 30 commits August 15, 2025 16:19

feat: reinforcement learning PR#2; several additions/improvements to …

8b8900b

…rl pipeline

Update pyproject.toml

e35953e

fix: fix linting errors of previous commit

385d7ee

Update PR Template

02cea62

Update PR Template

e377076

Restore original formatting

28d016c

Restore original formatting

792ca1d

Clean top of files

c65bc25

Refactor filepaths

c20efcb

Refactor filepaths

e9c2f34

Refactor and test temp conversion functions; closes google#25

ebbae9c

Refactor temp conversion tests

4ac8181

Review eval script

959728b

Remove redundant variable setting

e29eeb1

Fix failing test

b1a48ad

Repro generate configs script; use absl flags because argparse not wo…

8031d66

…rking

Update gitignore

763f60e

Test config file generation

27dae87

Test read config file

a7a127a

Fix file names - remove quote

5235615

Describe the config generation script

7a8f1d2

Flags WIP

baa6ed4

Attempt to reproduce starter buffer script; fix google#115

a45f6cd

Test starter buffer population

adeacfc

Refactor test: use setup, teardown, and temp dir

3224585

WIP - reproduce train script, run into known issue

3b6f3a8

Hotfix known issue

1503fce

Generate example starter buffers for training and testing

0b36870

WIP - refactor and test RL agent trainer

034af1f

Regenerate starter buffer for testing

00072d7

s2t2 and others added 7 commits August 15, 2025 16:19

Decrease number of training steps when testing

5cb19bf

WIP - reproducing eval script - encounter env config errors

3d22490

Reproduce eval script

ea1d3aa

WIP - refactor eval script; need to save schedule policy results char…

3c53a8c

…ts as well

fix(replay): make dm-reverb optional; fallback to TFUniform on macOS/…

000641b

…py3.11

fix(replay): TFUniform fallback with batched observer; dm-reverb opti…

cbafb62

…onal

fix(rl): support TFUniform fallback checkpointing and remove reverb-o…

0f63374

…nly calls

gemini-code-assist Bot reviewed Apr 26, 2026

View reviewed changes

fix(rl): persist replay metadata and restore TFUniform checkpoints by…

53308b0

… metadata

This was referenced Apr 27, 2026

Reinforcement Learning Module, Part 2 #98

Open

Yukta/rl module finish #127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finish PR #98: replay fallback checkpoint compatibility + RL test updates#166

Finish PR #98: replay fallback checkpoint compatibility + RL test updates#166
amcberkes wants to merge 38 commits intogoogle:copybara_pushfrom
amcberkes:finish-pr98-reverb-tests

amcberkes commented Apr 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Uh oh!

amcberkes commented Apr 26, 2026

Uh oh!

s2t2 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

amcberkes commented Apr 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

amcberkes commented Apr 26, 2026

Uh oh!

s2t2 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants