Multi-Agent Control using Deep Reinforcement Learning

Adrian Chow
4 min readJan 21, 2021

One difficult task in the Deep Reinforcement Learning space is having multiple agents interact and learn. The reality is that the world is a multi-agent environment in which intelligence is developed by interacting with multiple agents. A major breakthrough in this regard was when researchers at DeepMind developed an AI engine, alphago zero, which learned to play Go by training in a multi-agent environment. The agent developed such a high level of competency it was able to beat LeeSedol, a professional Go player. Extensions of this finding led to OpenAI Five which was able to conquer a complex game such as Dota 2. Clearly, DRL in a multi-agent environment has expanded the capabilities of Markov learning.

Source: OpenAI

I have taken the challenge to myself to implement a multi-environment agent using my pre-existing knowledge of RL to solve the Tennis environment on the Unity ML-Agents. The goal of this project is to teach two agents to play a low-level version of tennis. By applying deep reinforcement learning to this environment, two separate agents compete in order to define their individual policies. This method allows competition to essential become collaboration as agents compete to find the optimal policy. The following project has been an implementation based on OpenAi’s Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments research paper.

The Environment

Source: Unity-Technologies

RL Problem Specifications

  • Goal of Agent: keep the ball in play
  • Rewards : +0.1 every time agent hits the ball over the net, -0.1 every time agent lets a ball hit the ground or hits the ball out of bounds
  • Action Space — Continuous, 4-Dimension Vector Space from [-1, +1]
  • State Space — Continuous, 8-Dimension, 3 Stacked Observations
  • Solving Condition: Average score of +0.5 over 100 consecutive episodes.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation. Two continuous actions are available, corresponding to the movement toward (or away from) the net, and jumping.

Learning Algorithm

The algorithm is a multi-agent variation of the standard DDPG algorithm that I have implemented in the past. If you are unfamiliar with the Deep Deterministic Policy Gradent algorithm, you can check out my other medium post. Essentially this article is an extension of the previous DDPG I have implemented. Here is the basis of the algorithm and the code base can be found here:

DDPG Algorithm (Source: Here)

Some constraints/assumptions to make when implementing our multi-agent policy include:

  • The learned policies can only use local information (i.e. their own observations) at execution time
  • Do not assume a differentiable model of the environment dynamics
  • Do not assume any particular structure on the communication method between agents

Multi-Agent DDPG Initialization:

  • Notice the number of agents is passed in as a parameter for the MADDPG. In the case of the Tennis Environment, there are two agents.
  • Individual agents with their own set of actors and critics are created based on the number of agents, see line 13.
  • The replay buffer will be the same as before, except instead this time an experience will contain the states and actions for both agents in a given timestep.
  • Note the DDPG follows the same class implemented in the previous project. See below:
DDPG, Source
  • DDPG is still using then Ornstein Uhlenbeck Noise for action space exploration.
  • The replay buffer is shared between agents, that is why it is stored in the MADDPG class.
  • Since there are multiple states, actions per experience they are concatenated into single tensors, see lines 100 and 115.

Multi-Agent DDPG Methods:

MADDPG, Source
  • Per the time step for MADDPG, each DDPG is updated normally (randomly sampling from replay buffer) to allow each agent to learn independently from each other.

Multi-Agent DDPG Training:

  • Before we get into the training aspect here are some of the hyperparameters that have worked for me in this project. Notice that I have increased the batch_size, due to the increase in learning capacity required.
Hyperparameters
  • The critic and actor networks have stayed the same for the most part. One minor change is the fact that the critic now accounts for the states of both agents when calculating the return. This minor change allows for the critic to adapt to a broader state space and converge better.
  • The training loop is the same as usual
  • The results from training were approximately on-point with the baseline training requirements presented for this environment. I was able to solve the environment in 1454 episodes.
Training Score-Episodes Plot. Source

Possible Improvements

  • Modify the critic and actor to account for the agent’s belief of the opponent’s next move. If the agent can understand the opponent's policy it may be able to perform better?
  • Update the agents selectively or stochastically.
  • Decrease action exploration as the training episodes increase.

--

--