Solving Continuous Control using Deep Reinforcement Learning (Policy-Based Methods)

Adrian Chow
6 min readJan 9, 2021

Introduction

An alternative to classical control methods is deep reinforcement learning. Both are used to solve an optimization problem for dynamic systems that have a target behavior. Classical control theory deals with the behavior of dynamical systems with inputs and how behavior can be tuned by using feedback. On the other hand, Deep RL’s approach relies on an agent that is trained to have the policy which maximizes a measurable reward. The following article elaborates on a Deep RL agent’s ability to solve a continuous control problem, namely Unity’s Reacher.

Real-World Robotics

The application of this project is targeted for, while not limited to, robotic arms. Once the control aspect of the stack is solved, one can then generate a behavior to plan tasks.

Robotic Arm using Deep Reinforcement Learning (Source: ResearchGate)

Furthermore, it is possible to increase the productivity of manufacturing by sharing the trained policy. It has been shown that having multiple copies of the same agent sharing experience can accelerate learning, as I learned from solving the project!

Multiple Robotic Arms using same Policy

The Environment

To solve the task of continuous control, I have used the Reacher environment on the Unity ML-Agents GitHub page.

In this environment, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent’s hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

Reacher Environment (Source: Unity ML-Agents)

RL Problem Specifications

  • Goal of Agent: move to target locations
  • Rewards : +0.1 every step in the goal location, +0.0 every step out of the goal location
  • Action Space — Continuous, 4-Dimension Vector Space from [-1, +1]
  • State Space — Continuous, 33-Dimension
  • Solving Condition: Average score of +300.00 over 100 consecutive episodes.

Learning Algorithm

To solve the environment I have used an Actor-Critic Method which is a Deep Reinforcement Learning agent that utilizes two neural networks to estimate the policy (actor) and value function (critic). The policy structure is known as the actor, because it is used to select actions, and the estimated value function is known as the critic because it criticizes the actions made by the actor. Although I will go through the general algorithm, you can learn more about Actor-Critic Methods from Chris Yoon’s in-depth article about Actor-Critic Methods. His work was tremendously useful when exploring this topic.

DDPG Algorithm (Source: Here)

The adaptation for the specific Actor-Critic method used is referred to as the Deep Deterministic Policy Gradient (DDPG). The original paper can be found here. The idea stems from the success of Deep-Q Learning and is modified to continuous action spaces.

The main takeaways are:

  • Uses an Actor to choose the agent’s actions deterministically.
  • Uses a Critic to estimate the return value of the next state-action pair.
  • DDPG is an Off-Policy Actor-Critic algorithm: Policy used to interact with the environment is different from the policy being learned.
  • Uses soft-update to update target networks
  • Uses some controlled action noise to explore action space

Initialization:

  • Randomly initialize critic network Q(s, a|θ-Q) and actor µ(s|θ-µ) with weights θ-Q (critic) and θ-µ (actor).
  • Initialize target networks denoted by Q’ and µ’ with weights θ-Q’ ← θ-Q, θ-µ’ ← θ-µ
  • Initialize replay buffer R. The replay buffer is a data structure to This data structure that stores nodes (named tuples) with the following data: [“state”, “action”, “reward”, “next_state”, “done”]

Here is the implementation for the Actor and Critic Neural Networks.

Note that:

  • The Actor uses the Hyperbolic Tangent Activation Function (tanh) to limit the action space to [-1, 1].
  • Using a seed will result in the same weights for initialization between the target and local model.
  • Critic factors in the action during the second fully-connected layer.

Here is the implementation of the Replay Buffer:

Putting it together:

Training Loop

For episode e 1 to M:

  • Initialize a random process N for action exploration
  • Receive initial observation state, s1
for i_episode in range(1, n_episodes + 1):   env_info = env.reset(train_mode=True)[brain_name]
states = env_info.vector_observations
score = np.zeros(n_agents)

Here we are able to sample our initial environment state as well as set up a list to store our training scores.

For step t 1 to T:

for t in range(max_t):            
actions = agent.act(states)
env_info = env.step(actions)[brain_name]
next_states = env_info.vector_observations
rewards = env_info.rewards
dones = env_info.local_done
agent.step(states, actions, rewards, next_states, dones)
score += env_info.rewards
states = next_states
if np.any(dones): # Check if there are any done agents
break
  • Select action A= µ(st|θ-µ) + Noise according to the current policy and exploration noise

Here we use the local Actor model to sample an action space with a bit of added noise exploration. For the noise added to the actions, DDPGs often use the Ornstein–Uhlenbeck process to generate temporally correlated exploration for exploration efficiency in physical control problems.

  • Execute action at and observe reward r(t) and observe new state s (t+1), where t is the current timestep
    env_info = env.step(actions)[brain_name]                    
next_states = env_info.vector_observations
rewards = env_info.rewards
dones = env_info.local_done
  • Store transition s(t), a(t), r(t), s(t+1)] in Replay Buffer, R
agent.step(states, actions, rewards, next_states, dones)
  • The remaining parts are the algorithm are mathematical learning steps to update the policy (weights of the local and target networks).
Learning Steps of DDPG Algorithm

The agent takes a step and immediately samples randomly from the replay buffer to learn a bit more about the environment it is currently in. We can split the update into two parts: Actor and Critic update. (Note: Actor loss is -ve due to the fact we are trying to maximize the return values passed by the Critic.)

DDPG only updates the local networks (the networks that interact with the environment) and performs soft updates on the target networks. As you can see below, soft updates only factor in a small amount of the local network weights (defined by Tau).

Yeah, so that’s it! The model is able to learn over time and develop the ability to follow a goal location in a continuously controlled state. The full repository can be viewed here.

Plot of Results

Training Parameters:

  • Max Episodes: 1500
  • Max Time Steps: 3000
  • Buffer Size: 10000
  • Batch Size: 128
  • Gamma: 0.99
  • Tau: 1e-3
  • Actor Learning Rate = 1e-3
  • Critic Learning Rate = 1e-3
  • Weight Decay: 0.0
Training Progression

Ideas for Future Work

Another way to take it further is to look into more Actor-Critic Methods like PPO, A3C, and D4PG that use multiple (non-interacting, parallel) copies of the same agent to distribute the task of gathering experience. These agents take advantage of multiple agents to improve learning. In addition, I would like to explore other control methods for active exploration.

This concludes the exploration of continuous control using reinforcement learning algorithms. Continuous control is just one of several applications that Actor-Critic methods strive in tackling. Please check out my previous post, where I tackle Navigation using Deep RL.

--

--