GitHub - adeotti/SAC-HER: Soft Actor-Critic (SAC) algorithm + Hindsight Experience Replay (HER) implementation on Robosuite Panda robot and Gymnasium Pick And Place (Pytorch)

Implementation of SAC+HER on the Gym Robotics Fetch pick and place sparse reward environment and SAC without HER on the Gym robotics Fetch Reach Dense rewards environment and the Robosuite Stack dense reward environment.

Gymnasium environment : SAC + HER

The first notebook contains the code for soft actor critic + Her algorithm implemented on gymnasium robotics Fetch pick and place environment. The base environment produces a sparse binary reward, and that's why I implemented HER(Hindsight Replay Experiment) since it works better for sparse reward environments. I wrote a custom class inheriting from Gym.wrapper to modify some attributes of the base environment (the x,y position of the base robot and the z position of the block).The observation is normalized before passing to the networks using Welford's running algorithm then clipped in range [-5,5] to stabilize the training and mostly following the method used in the Hindsight experience replay even if the base algorithm are differents (They used DDPG and Im using SAC).That clipping is probably and mostly because of the use of Tanh as the activation function to get the actions which satures at -5 and 5. The running statistics are only updated during data collection and during training and test time, I used the frozen statistics for the normalization.

The ratio of HER/normal transition is 4:1 which is proven to be one of the most optimal (They showed that k between 4 and 8 perform best across most Gym fetch task coupled with the "future" strategy).So the strategy used here is "Future" and I'm using a thread-safe Queue to prefetch some training data from the warmup phase to the queue before pulling those stored data from it to avoid or reduce idle time during training. Using threading here is important from what I saw because sampling the HER batches with my sampling code (which might not be the optimal sampling method) is computationally expensive, and after some tests, using a thread + queue combo for the data collection sped the training by a big margin (more than 3X).

The second notebook contains SAC implemented on Gymnasium Fetch Reach environment. It's a relatively easy to solve environment and I trained on the Dense reward version of the environment and I did not really used any "trick" (no Welford's algorithm, etc.) to solve it, just the standard SAC implementation was enough to solve it after 1 million steps (same steps I used to solve that same environment with TD3 algoritm)

Robosuite environment : SAC

Contain the code for SAC but on top of the robosuite Stack environment using the Panda robot and with three three-finger dexterous gripper. The goal for the Stack environment is to stack two blocks on top of one another. I use a dense reward version of that environment for my implementation.The only real difference between those two environment is the observation space (much larger here than in gym, mostly because the robot is more complex and the grippers are also differents).The normalization here also follow the same method as in the gymnasium Fetch Pick and place environment (Welford's algorithm with updated statistics only during data collection then frozen statistics for both training and test time)

Training and Vectorization method (trained on kaggle)

I use sync mode with disabled autoreset for Gymnasium Fetch Pick and Place environment instead of the async method because, after running many tests, the asynchronous method just doesn't work in the Jupyter environment for both environments. One way to use Async mode is to convert the notebooks to a Python file, then import that file as a dataset and run the code with Python's built-in exec function.Another, more straightforward method might be to add the path of the Python file to the system using sys (And those two method only work with the Gym env, not with the Robosuite env)

I disabled that autoreset mode because the reward with auto reset mode set to "next steps" yields 0.0 at the first step of each reset episode, which might lead to some instability during training since the environment gives a positive reward even though it is not solved. A way to avoid that issue that i found is to manually reset the done environments.

Training logs

For both environments, I tracked the same elements: the variance of the action generated by the policies during data collection, the variance of the action generated by the policies during training, the losses when optimizing entropy, the critics's losses, and the value of alpha.

References

Name		Name	Last commit message	Last commit date
Latest commit History 206 Commits
gym_env		gym_env
robosuite_env		robosuite_env
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

adeotti/SAC-HER

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages