Soft Actor-Critic (SAC)
Overivew
The Soft Actor-Critic (SAC) algorithm extends the DDPG algorithms by 1) using a stochastic policy, which in theory would to express multi-modal optimal policies. This also enables the use of 2) entropy regularization based on the stochsatic policy's entropy. It serves as a built-in, state-dependent exploration heuristic for the agent, instead of relying on non-correlated noise processes as in DDPG [TODO: link], or TD3 [TODO: link] Additionally, it incorporates the 3) usage of two Soft Q-network to reduce the over-estimation bias issue in Q-network based methods.
Original papers: The SAC algorithm introduction, and later and updates and improvements can be chronologically traced through the following publications:
- Reinforcement Learning with Deep Energy-Based Policies
- Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
-
Composable Deep Reinforcement Learning for Robotic Manipulation
- Soft Actor-Critic for Discrete Action Settings
Reference resources:
- haarnoja/sac
- openai/spinningup
- ikostrikov/pytorch-a2c-ppo-acktr-gail
- denisyarats/pytorch_sac
- DLR-RM/stable-baselines3
- haarnoja/softqlearning
- rail-berkeley/softlearning
Variants Implemented | Description |
---|---|
sac_continuous_actions.py , docs |
For continuous action space |
Below is our single-file implementations of SAC:
sac_continuous_action.py
The sac_continuous_action.py has the following features:
- For continuous action space.
- Works with the
Box
observation space of low-level features. - Works with the
Box
(continuous) action space. - Numerically stable stochastic policy based on openai/spinningup, ikostrikov/pytorch-a2c-ppo-acktr-gail implementations.
- Supports automatic entropy coefficient \(\alpha\) tuning, enabled by default.
Usage
poetry install
# Pybullet
poetry install -E pybullet
## Default
python cleanrl/sac_continuous_action.py --env-id HopperBulletEnv-v0
## Without Automatic entropy coef. tuning
python cleanrl/sac_continuous_action.py --env-id HopperBulletEnv-v0 --autotune False --alpha 0.2
Explanation of the logged metrics
Running python cleanrl/ddpg_continuous_action.py will automatically record various metrics such as actor or value losses in Tensorboard. Below is the documentation for these metrics:
-
charts/episodic_return
: episodic return of the game -
charts/SPS
: number of steps per second -
losses/qf1_loss
,losses/qf2_loss
: for each Soft Q-value network \(Q_{\theta_i}\), \(i \in \{1,2\}\), this metric holds the mean squared error (MSE) between the soft Q-value estimate \(Q_{\theta_i}(s_{t}, a_t)\) and the entropy regularized Bellman update target estimated as \(r_t + \gamma \, Q_{\theta_{i}^{'}}(s_{t+1}, a') + \alpha \, \mathcal{H} \big[ \pi(a' \vert s') \big]\).
More formally, the Soft Q-value loss for the \(i\)-th network is obtained by:
**[TODO: add the min over the target Q values]
with the entropy regularized Bellman update target $$ y = r + \gamma \, Q_{\theta_{i}^{'}}(s', a') + \alpha \, \mathcal{H} \big[ \pi(a' \vert s') \big] $$, where \(a' \sim \pi( \cdot \vert s')\), \(\mathcal{H} \big[ \pi(a' \vert s') \big]\) represents the entropy of the policy, and \(\mathcal{D}\) is the replay buffer storing samples of the agent during training.
-
losses/qf_loss
: averageslosses/qf1_loss
andlosses/qf2_loss
for comparison with algorithms using a single Q-value network. -
losses/actor_loss
: -
losses/alpha
: \(\alpha\) coefficient for entropy regularization of the policy. -
losses/alpha_loss
:
Implementation details
TODO
Experiment results
PR vwxyzjn/cleanrl#146 tracks our effort to conduct experiments, and the reprodudction instructions can be found at vwxyzjn/cleanrl/benchmark/sac.
Tracked experiments and game play videos: