Solving Continuous Control using Deep Reinforcement Learning (Policy-Based Methods)

6 min readJan 9, 2021

Introduction

An alternative to classical control methods is deep reinforcement learning. Both are used to solve an optimization problem for dynamic systems that have a target behavior. Classical control theory deals with the behavior of dynamical systems with inputs and how behavior can be tuned by using feedback. On the other hand, Deep RL’s approach relies on an agent that is trained to have the policy which maximizes a measurable reward. The following article elaborates on a Deep RL agent’s ability to solve a continuous control problem, namely Unity’s Reacher.

Real-World Robotics

The application of this project is targeted for, while not limited to, robotic arms. Once the control aspect of the stack is solved, one can then generate a behavior to plan tasks.

Robotic Arm using Deep Reinforcement Learning (Source: ResearchGate)

Furthermore, it is possible to increase the productivity of manufacturing by sharing the trained policy. It has been shown that having multiple copies of the same agent sharing experience can accelerate learning, as I learned from solving the project!

The Environment

To solve the task of continuous control, I have used the Reacher environment on the Unity ML-Agents GitHub page.

In this environment, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent’s hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

Reacher Environment (Source: Unity ML-Agents)

RL Problem Specifications

Goal of Agent: move to target locations
Rewards : +0.1 every step in the goal location, +0.0 every step out of the goal location
Action Space — Continuous, 4-Dimension Vector Space from [-1, +1]
State Space — Continuous, 33-Dimension
Solving Condition: Average score of +300.00 over 100 consecutive episodes.

Learning Algorithm

To solve the environment I have used an Actor-Critic Method which is a Deep Reinforcement Learning agent that utilizes two neural networks to estimate the policy (actor) and value function (critic). The policy structure is known as the actor, because it is used to select actions, and the estimated value function is known as the critic because it criticizes the actions made by the actor. Although I will go through the general algorithm, you can learn more about Actor-Critic Methods from Chris Yoon’s in-depth article about Actor-Critic Methods. His work was tremendously useful when exploring this topic.

The adaptation for the specific Actor-Critic method used is referred to as the Deep Deterministic Policy Gradient (DDPG). The original paper can be found here. The idea stems from the success of Deep-Q Learning and is modified to continuous action spaces.

The main takeaways are:

Uses an Actor to choose the agent’s actions deterministically.
Uses a Critic to estimate the return value of the next state-action pair.
DDPG is an Off-Policy Actor-Critic algorithm: Policy used to interact with the environment is different from the policy being learned.
Uses soft-update to update target networks
Uses some controlled action noise to explore action space

Initialization:

Randomly initialize critic network Q(s, a|θ-Q) and actor µ(s|θ-µ) with weights θ-Q (critic) and θ-µ (actor).
Initialize target networks denoted by Q’ and µ’ with weights θ-Q’ ← θ-Q, θ-µ’ ← θ-µ
Initialize replay buffer R. The replay buffer is a data structure to This data structure that stores nodes (named tuples) with the following data: [“state”, “action”, “reward”, “next_state”, “done”]

Here is the implementation for the Actor and Critic Neural Networks.

Note that:

The Actor uses the Hyperbolic Tangent Activation Function (tanh) to limit the action space to [-1, 1].
Using a seed will result in the same weights for initialization between the target and local model.
Critic factors in the action during the second fully-connected layer.

Here is the implementation of the Replay Buffer:

Putting it together:

Training Loop

For episode e ⟵1 to M:

Initialize a random process N for action exploration
Receive initial observation state, s1

for i_episode in range(1, n_episodes + 1):   env_info = env.reset(train_mode=True)[brain_name]
   states = env_info.vector_observations          
   score = np.zeros(n_agents)

Here we are able to sample our initial environment state as well as set up a list to store our training scores.

For step t ⟵ 1 to T:

for t in range(max_t):            
    actions = agent.act(states)                                   
    env_info = env.step(actions)[brain_name]                    
    next_states = env_info.vector_observations    
    rewards = env_info.rewards
    dones = env_info.local_done
    agent.step(states, actions, rewards, next_states, dones)                                   
    score += env_info.rewards                    
    states = next_states
    if np.any(dones): # Check if there are any done agents                          
        break

Select action A= µ(st|θ-µ) + Noise according to the current policy and exploration noise

Here we use the local Actor model to sample an action space with a bit of added noise exploration. For the noise added to the actions, DDPGs often use the Ornstein–Uhlenbeck process to generate temporally correlated exploration for exploration efficiency in physical control problems.

Execute action at and observe reward r(t) and observe new state s (t+1), where t is the current timestep

    env_info = env.step(actions)[brain_name]                    
    next_states = env_info.vector_observations    
    rewards = env_info.rewards
    dones = env_info.local_done

Store transition s(t), a(t), r(t), s(t+1)] in Replay Buffer, R

agent.step(states, actions, rewards, next_states, dones)

The remaining parts are the algorithm are mathematical learning steps to update the policy (weights of the local and target networks).

The agent takes a step and immediately samples randomly from the replay buffer to learn a bit more about the environment it is currently in. We can split the update into two parts: Actor and Critic update. (Note: Actor loss is -ve due to the fact we are trying to maximize the return values passed by the Critic.)

DDPG only updates the local networks (the networks that interact with the environment) and performs soft updates on the target networks. As you can see below, soft updates only factor in a small amount of the local network weights (defined by Tau).

Yeah, so that’s it! The model is able to learn over time and develop the ability to follow a goal location in a continuously controlled state. The full repository can be viewed here.

Plot of Results

Training Parameters:

Max Episodes: 1500
Max Time Steps: 3000
Buffer Size: 10000
Batch Size: 128
Gamma: 0.99
Tau: 1e-3
Actor Learning Rate = 1e-3
Critic Learning Rate = 1e-3
Weight Decay: 0.0

Ideas for Future Work

Another way to take it further is to look into more Actor-Critic Methods like PPO, A3C, and D4PG that use multiple (non-interacting, parallel) copies of the same agent to distribute the task of gathering experience. These agents take advantage of multiple agents to improve learning. In addition, I would like to explore other control methods for active exploration.

This concludes the exploration of continuous control using reinforcement learning algorithms. Continuous control is just one of several applications that Actor-Critic methods strive in tackling. Please check out my previous post, where I tackle Navigation using Deep RL.

Solving Continuous Control using Deep Reinforcement Learning (Policy-Based Methods)

Introduction

Real-World Robotics

The Environment

Learning Algorithm

Written by Adrian Chow

No responses yet