Predator, Prey, Grass project – Human Behavior Patterns

The Predator, Prey, Grass project aims to investigate behavior of multiple agents in a simple closed grid world. Although only partially applicable to modern humans, this could shed light to ancestral human behavior. In the hunter-gatherer period of Homo Sapiens, roughly 10,000 years ago a common setting, humans could be as well be predators as prey. Moreover, the limited availability of grass could be regarded as a proxy for the limited proximal resources for humans down the food chain.

To implement this, a multi-agent reinforcement learning (MARL) environment is trained using Proximal Policy Optimization (PPO). Learning agents Predators (red) and Prey (blue) both expend energy moving around, and replenish it by eating. Prey eat Grass (green), and Predators eat Prey if they end up on the same grid cell. In the base case for simplicity, the agents obtain all energy from the eaten Prey of Grass. However, in reality this is much less because the ecological efficiency is only around 10% in most cases, and certainly not 100%.

Predators die of starvation when their energy is zero, Prey die either of starvation or when being eaten by a Predator. The agents asexually reproduce when energy levels of learning agents rise above a certain threshold by eating. Learning agents, learn to execute movement actions based on their partial observations (transparent red and blue squares respectively) of the environment to maximize cumulative reward.

The reward structure

The reward structure initially is as ‘sparse‘ as possible. Both Predator and Prey only receive 10 units when they reproduce. That’s it. There is no reward/penalty for eating, dying or moving in the base case. Despite this simple reward structure both Predator and Prey exhibit elaborate emergent behaviors. This is because both agents need to gather sufficient energy to reproduce. Predators do so by learning to chase and eat Prey and Prey learn to eat Grass, while at the same time Prey learns to avoid Predators.

The reward structure of the Predator-Prey-Grass environment has several optionalities, which can be adjusted in the configuration file:

step_reward_predator
step_reward_prey
catch_reward_grass
catch_reward_prey
death_reward_prey
death_reward_predator
reproduction_reward_prey
reproduction_reward_predator

The emergent behavior described below can be observed and detected with varying combinations of the reward parameters above.

Emergent behavior

On the micro-level (the individual agents behaviors):

Predators:

Predators are pursuing Prey, when in their observation range. When no Prey is in the Predator’s observation range a common strategy for Predators is to hover around grass agents to wait for incoming Prey. This is because Predators learn that ultimately Prey have to approach Grass in order to gain energy which enables them to reproduce and gain rewards. A simplified configuration below shows this behavior in the video below. It is used to demonstrate the strategy of Predators hovering around Grass to wait for incoming Prey, when no Prey is in the Predator’s observation range. Predator number 0, is hovering around at grass number 26, 31 and 41 before it is going after Prey number 12, when it enters the observation range of Predator 0.
Predators pursue Prey even when there is no reward for eating Prey in the reward function and/or when there is no negative step/move reward. In such cases, they apparently learn to consume Prey as a means to reproduce in the future, a process that requires sufficient energy. In that case, Predators learn to consume prey without the promise of immediate rewards, but attain only a (sparse) reward for reproduction.
Predators sometimes engage into cooperation with another Predator to catch a single Prey. However, the configurations only gives energy to the Predator who ultimately catches the Prey. Therefore, this cooperating behavior is likely to be be attributed to the higher overall probability per agent to catch a Prey in tandem.

Prey:

Prey try to escape attacking Predators. Moreover, Prey apparently find out the observation range of Predators and sometimes try to stay out outside the Predator’s observation range. This is obviously only the case if Prey have a larger observation range than Predators (the base case).
Since the prey and predators strictly move within a Von Neumann neighborhood (left/right/up/down/stay) rather than a Moore neighborhood, it might be tempting to assume that prey can consistently outmaneuver a given predator, ensuring it is never caught. However this is under the conditions that:
1) the agents move in a turn-based fashion
2) no other predators are involved in catching that specific prey; ie. no cooperation among Predators.

This theoretical escape possibility is because a Prey, even when at risk of being caught in the immediate next turn by a predator, can always make a single step towards a position where the threatening predator needs to move at least two times to catch it (a feature of the Von Neumann neighborhood which is utilized).

However, this particular Prey behavior goes largely unnoticed in practice in the benchmark display because the simulation is trained in a parallel environment where all agents decide simultaneously, rather than in a turn-based fashion (note: the simulation is parallel trained but evaluated in a turn based fashion).

At first glance, Prey let themselves sometimes easily get caught by Predators in the base case. This is maybe due to that no penalty is given for “dying”. When a (severe) penalty is added to the reward structure for being eaten by Predators is given, then Prey tend to be more careful and stay out of the way of Predators more often. This results in a higher average age of Prey.

Evading behavior Prey towards Predators can be enforced by penalizing death for Prey

In any case, Prey are trying to escape from Predators, even when the penalty for dying is set to zero. This is because the “ultimate” reward is for reproduction for Prey, which is optimized when Prey are evading Predators. However, if an additional penalty for dying is introduced, Prey will try to avoid Predators even more. This can be concluded from a rising average age of Prey when the penalty is increased.

On the macro-level (the total population behaviors):

These emerging behaviors on the micro-levels lead also to more complex dynamics at the ecosystem level. The trained agents are displaying a classic Lotka–Volterra pattern over time steps. This learned outcome is not obtained with a random policy:

The reward structure

Emergent behavior

References