Soft Actor-Critic (SAC)
The Soft Actor-Critic (SAC) algorithm extends the DDPG algorithms by 1) using a stochastic policy, which in theory would to express multi-modal optimal policies. This also enables the use of 2) entropy regularization based on the stochsatic policy's entropy. It serves as a built-in, state-dependent exploration heuristic for the agent, instead of relying on non-correlated noise processes as in DDPG [TODO: link], or TD3 [TODO: link] Additionally, it incorporates the 3) usage of two Soft Q-network to reduce the over-estimation bias issue in Q-network based methods.
Original papers: The SAC algorithm introduction, and later and updates and improvements can be chronologically traced through the following publications:
- Reinforcement Learning with Deep Energy-Based Policies
- Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Composable Deep Reinforcement Learning for Robotic Manipulation
- Soft Actor-Critic for Discrete Action Settings
Reference resources:
- haarnoja/sac
- openai/spinningup
- ikostrikov/pytorch-a2c-ppo-acktr-gail
- denisyarats/pytorch_sac
- DLR-RM/stable-baselines3
- haarnoja/softqlearning
- rail-berkeley/softlearning
Variants Implemented | Description |
---|---| , docs |
For continuous action space |
Below is our single-file implementations of SAC:
The has the following features:
- For continuous action space.
- Works with the
observation space of low-level features. - Works with the
(continuous) action space. - Numerically stable stochastic policy based on openai/spinningup, ikostrikov/pytorch-a2c-ppo-acktr-gail implementations.
- Supports automatic entropy coefficient \(\alpha\) tuning, enabled by default.
poetry install
# Pybullet
poetry install -E pybullet
## Default
python cleanrl/ --env-id HopperBulletEnv-v0
## Without Automatic entropy coef. tuning
python cleanrl/ --env-id HopperBulletEnv-v0 --autotune False --alpha 0.2
Explanation of the logged metrics
Running python cleanrl/ will automatically record various metrics such as actor or value losses in Tensorboard. Below is the documentation for these metrics:
: episodic return of the game -
: number of steps per second -
: for each Soft Q-value network \(Q_{\theta_i}\), \(i \in \{1,2\}\), this metric holds the mean squared error (MSE) between the soft Q-value estimate \(Q_{\theta_i}(s_{t}, a_t)\) and the entropy regularized Bellman update target estimated as \(r_t + \gamma \, Q_{\theta_{i}^{'}}(s_{t+1}, a') + \alpha \, \mathcal{H} \big[ \pi(a' \vert s') \big]\).
More formally, the Soft Q-value loss for the \(i\)-th network is obtained by:
**[TODO: add the min over the target Q values]
with the entropy regularized Bellman update target $$ y = r + \gamma \, Q_{\theta_{i}^{'}}(s', a') + \alpha \, \mathcal{H} \big[ \pi(a' \vert s') \big] $$, where \(a' \sim \pi( \cdot \vert s')\), \(\mathcal{H} \big[ \pi(a' \vert s') \big]\) represents the entropy of the policy, and \(\mathcal{D}\) is the replay buffer storing samples of the agent during training.
: averageslosses/qf1_loss
for comparison with algorithms using a single Q-value network. -
: -
: \(\alpha\) coefficient for entropy regularization of the policy. -
Implementation details
Experiment results
PR vwxyzjn/cleanrl#146 tracks our effort to conduct experiments, and the reprodudction instructions can be found at vwxyzjn/cleanrl/benchmark/sac.
Tracked experiments and game play videos: