Solving Navigation using Deep Reinforcement Learning (Value-Based Methods)

5 min readDec 25, 2020

Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning and deep learning. Rather than using traditional RL methodologies, such as a Q-Table, artificial neural networks allow us to expand an agent’s ability to solve a complex environment. While this post focuses on tackling the navigation issue, there is a lot to learn about how a DQN (Deep Q-Network) can use non-linear approximators to estimate the optimal action-value function.

Artificial Neural Network (Source: VIASAT)

The Environment

To solve the task of navigation, I have used a task similar to the Food Collector environment on the Unity ML-Agents GitHub page. Instead, the project is provided by Udacity as a part of their Deep Reinforcement Learning Nanodegree.

In this environment, an Agent navigates a large, square world collecting bananas. Each episode of this task is limited to 300 steps. A reward of +1 is provided for collecting a yellow banana, and a reward of -1 is provided for collecting a blue banana. Thus, the goal of the agent is to collect as many yellow bananas as possible, avoiding the blue ones.

RL Problem Specifications:

The Goal of the Agent: Collect yellow bananas, avoid blue bananas
Rewards : +1 collecting yellow bananas, -1 collecting blue bananas
Action Space — Discrete, 4 Actions [Foward, Back, Left, Right]
State Space — Continuous, 37-Dimension
Solving Condition: Average score of +13 over 100 consecutive episodes.

The state-space has 37 dimensions and contains the agent’s velocity, along with the ray-based perception of objects around the agent’s forward direction. Given this information, the agent has to learn how to best select actions.

Learning Algorithm

Illustration of DQN Architecture (Source)

The learning algorithm was developed by researchers working towards achieving human-level control to play a challenging domain of classic Atari 2600 games! To learn more about this, you read this research paper which outlines the algorithm in full detail.

Initialization:

Initialize Replay Buffer Data Structure. A memory bank/buffer with a capacity of N=10000 memories should suffice. This data structure is essential a deque that stores nodes (named tuples) with the following data: [“state”, “action”, “reward”, “next_state”, “done”]
Initialize Local Model: A feed-forward NN that will estimate the action-value function.
Initialize Target Model: A feed-forward NN that will estimate the target action-value function.

The implementation of the simple feed-forward model can be seen below. In addition, you can see the full implementation of the replay buffer here.

For episode e ⟵1 to M:

Initialize the input frame x, a.k.a the initial state. Most OpenAI Gym environments will provide a simple API to grab a restated state.

for i_episode in range(1, n_episodes + 1):     env_info = env.reset(train_mode=True)[brain_name]
     state = env_info.vector_observations[0] # get the current state

For step t ⟵ 1 to T:

We break this down into two parts: SAMPLE and LEARN

Sample Step:

Choose action A from state S using policy π ⟵ ε-Greedy( q(S, A, w) ).

2. Take Action A, observe reward R, and next input frame X{t+1}

3. Prepare next state and store experience tuple in Replay Buffer

4. Set S ⟵ S’

Learn Step:

Obtain random minibatch of tuples (“state”, “action”, “reward”, “next_state”, “done”) from Replay Buffer
Set TD target and perform the update step on the local model

3. Perform a soft update step of the target model by factoring in a portion of the local model’s parameters

Yeah, so that’s it! The model is able to learn over time and develop the ability to collect yellow bananas while avoiding the blue bananas. The full repository can be viewed here. The application of this project is targeted for, while not limited to, robotic navigation. Some ideas that one could build off of this project are Garbage Collector Robots and Agricultural Robots (Muti-purpose)!

Plot of Results

Training Parameters:

Max Episodes: 2000
Max Time Steps: 1000
Buffer Size: 10000
Batch Size: 64
Gamma: 0.99
Tau: 1e-3
Learning Rate = 5e-4
Epsilon (Start, End, Decay Rate): [0.10, 0.01, 0.987]

Going Further…

Several improvements to the original Deep Q-Learning algorithm have been suggested.

Double DQN

Deep Q-Learning tends to overestimate action values. Double Q-Learning has been shown to work well in practice to help with this. The main idea is to gain a second opinion on the target action-value (TD Target) by running the state through the local model first, then the target model. See the changes below:

Double DQN Changes

2. Prioritized Experience Replay

Deep Q-Learning samples experience transitions uniformly from a replay memory. Prioritized experience replay is based on the idea that the agent can learn more effectively from some transitions than from others, and the more important transitions should be sampled with higher probability. By assigning priority we can steer the agent towards understanding the more important experiences.

PER Replay Buffer Implementation

As you can see the implementation is more involved than the previous replay buffer. It is important to note that prioritized experience replay adds additional time and space complexity to the algorithm.

3. Dueling DQN

Currently, in order to determine which states are (or are not) valuable, we have to estimate the corresponding action values for each action. However, by replacing the traditional Deep Q-Network (DQN) architecture with a dueling architecture, we can assess the value of each state, without having to learn the effect of each action. By using two streams (value and advantage), we can gain a better estimate of the action-value.

Here I have implemented my own version of the dueling network with the help of Chris Yoon’s post, feel free to take a look.

This concludes the exploration of Deep-Q learning algorithms for navigation problems. Navigation is just one of several applications that DQNs strive in tackling. To see the full implementation, see this. Please stay tuned for more in the future…

Solving Navigation using Deep Reinforcement Learning (Value-Based Methods)

Written by Adrian Chow