My favorites | Sign in
Project Home Downloads Issues Source
Search
for
NeroMod  
The NERO game description
Demo
Updated Dec 14, 2011 by ikarpov

The NERO Machine Learning Game

In the NERO game, the user trains intelligent agents to perform well in battle. It is a machine learning game, i.e. the focus is on designing a set of challenges that allow agents to learn the necessary skills step by step. Learning takes place in real time, as the user is observing the game and changing the environment and behavioral objectives on the fly. The challenge for the player is to develop as proficient a team as possible.

The NERO game in OpenNERO is a simpler research and education version of the original NERO game, focusing on demonstrating learning algorithms interactively in order to make it clear how they work. The game environment is first described below, then the two methods for training the agents (neuroevolution and reinforcement learning), how a team can be put together for battle, and then the battle mode itself. Ways of extending the learning methods and handcoding the teams, as well as differences from the original NERO are described in the end. To get a quick introduction to NERO, watch the video below.

NERO Environment

The player first enters the NERO-Training environment, where s/he develops a team and saves it. The player then enters the NERO-Battle environment, where s/he loads two competing teams that then battle each other.

The agents are simulated "Steve" robots seen also in the Maze and BlocksWorld environments. In NERO they have egocentric sensors:

  • laser range finders that sense distance to the nearest object (wall or a tree) in 5 directions around it: front, 45 degrees to each side, 90 degrees to each side. These sensors have a range of 100. They return 1 if they don't intersect with any walls, and a value between 0 and 1 if the intersection interrupts the ray a fraction of its length.
  • radar sensors that return the distance to the flag in 5 overlapping sectors: -3..18, 12..90, -18..3, -12..-90, and -90..90 (behind the agent). These sensors have a range limited to 300 (about half of the field width). Flags can be used to train agents e.g. to move around walls.
  • radar sensors that trigger when enemies are detected within a sector of the space around the robot. The more enemies are detected or the closer they are, the higher the value of the corresponding radar sensor will be.
  • two sensors that depend on the distance and direction to the center of mass of the teammates. These sensors can allow agents to stick together or spread out.
  • a sensor that indicates whether the agent is facing an enemy within sensor range within 2 degrees. This sensor is useful to train agents that shoot well.

Their effectors are

  • Forward/backward speed: -1..1 of maximum. The agents move at a rate of up to 1 unit per frame (MAX_MOVEMENT_SPEED in constants.py.
  • Turning speed: -1..1 of maximum. The agents can turn at a rate of up to 0.2 radians per frame (MAX_TURNING_RATE in constants.py).
Because the robot can shoot only forward, both directions of running are useful (forward for attacking, backward for retreating). If the robot tries to run into an object, it is simply stopped. However, in the current version the robots run through each other (to speed up the simulation).

Note that there is no output for taking a shot. Instead, the agents shoot probabilistically. First, they have to be oriented within 2 degrees of the target; outside of that angle they never shoot. Similarly, if they are further than 600 lenth units away (roughly the width of the standard field), they never shoot. Between 600 and 300, their likelyhood of shooting increases linearly, and within 300, they always shoot. Within the 2 degrees, their accuracy increases linearly so that at 2 degrees they have a 50% chance of hitting, and iff they are facing the center of the target exactly, they'll always hit. It is therefore possible to train agents to become better at shooting by getting closer and orienting more accurately towards the enemy.

Their weapon is a laser gun that shoots a single instantaneous ray; it is blocked by walls and trees, but it has no effect on teammates. The red team shoots red rays and the blue team blue rays; if either ray hits a wall or a tree, it turns green. Each shot that hits an agent decreases the agent's hitpoints by one; once the hitpoints run out, the agent dies and is removed from the field. In battle, it is gone for good; during training, it is respawned with hitpoints and lifetime reset.

The standard NERO environment consists of an enclosed field with a wall in the middle and a couple of trees around, behind which the robots can take cover. Most of the environment can be manipulated during training, and to create interesting new battlefields if so desired.

The player can look around the environment as usual using the keyboard and mouse controls:

  • W - Up (also up arrow)
  • A - Left (also left arrow)
  • S - Down (also down arrow)
  • D - Right (also right arrow)
  • Q - Rotate Left
  • E - Rotate Right
  • R - Rotate Up
  • Z or scroll up - Zoom in
  • C or scroll down - Zoom out
  • F1 - Go to this page for help
  • F2 - Cycle through useful stats that can be displayed above each agent, including (in order) the time their brain has spent on the field, remaining hitpoints (percentage indicated by 5 dots), rtNEAT genome ID (indicating how recently the agent was introduced), rtNEAT species number (indicating how recently the species was introduced), and the current population champion. When you do that, the title bar in the NERO window which data are currently shown.

Near the top left of the screen there's a single number that indicates the current frame rate of the display. It should be 24 or higher in a visually appealing animation, but may fall to 12 or lower on a slow machine (as well as currently on MacOSX due to Irrlicht rendering issue).

NERO Training

In the training mode, the user selects one of the two training methods (neuroevolution or reinforcement learning) and manipulates the environment and the behavioral goals in order to train them to do what s/he wants.

Typically the training starts by deploying either an rtNEAT team or Q-learning team, and then setting some of the goals (or fitness coefficients) in the parameter window (the sliders become active after a Deploy button is pressed). They are:

  • Stand Ground:
    • Positive: Punished for nonzero movement velocity
    • Negative: Rewarded for nonzero movement velocity
  • Stick Together
    • Positive: Rewarded for small distance to center of mass of teammates.
    • Negative: Rewarded for large distance to center of mass of teammates.
  • Approach Enemy:
    • Positive: Rewarded for small distance to closest enemy agent
    • Negative: Rewarded for large distance to closest enemy agent
  • Approach Flag:
    • Positive: Rewarded for small distance to flag
    • Negative: Rewarded for large distance to flag
  • Hit Target:
    • Positive: Rewarded for hitting enemy agents
    • Negative: Punished for hitting enemy agents
  • Avoid Fire:
    • Positive: Punished for having hit points reduced
    • Negative: Rewarded for having hit points reduced

There are also a number of parameters that effect learning that should be set appropriately (the default values usually are a good starting point):

  • Explore/Exploit: This slider has no effect on neuroevolution. With Reinforcement Learning, it determines the percentage of actions taken greedily (i.e. those with the best Q-values) vs. actions taken to explore the environment (i.e. randomly selected actions)
  • Lifetime: The number of action steps each agent gets to perform before being removed from the simulation (and restarted, or replaced by offspring).
  • Hitpoints: The amount of damage an agent can take before dying (Note: being hit by an enemy is 1 point of damage). In training, the agent is removed from where it is and respawned at the spawn location with lifetime and hitpoints reset.

The second part of the initialization is to set up the environment. An initial environment is already provided, and it is the same as the battle environment. The user can, however, add objects to it to design the training curriculum, through right clicking with the mouse:

When right clicking on empty space:

  • Add Wall: Generates a standard wall where you clicked
  • Place Flag: Generates a flag to the place where you clicked, or moves the flag there if one already exists. The flag has the appearance of a blue pole. The flags are useful for demonstration purposes, but not necessary in training for battle.
  • Place Turret: Generates an enemy at a location where you clicked. The enemy rotates and fires at anything in its line-of-fire, with the same probabilistic method as the agents themselves. It does not die no matter how many times it is hit.
  • Set Spawn location: moves the location around which the agents are created to the location where you clicked. The locations and orientations of the agents are randomly choces within a small circle around that point. The team is blue in training; in battle there are red and blue teams.

When right clicking on an object (i.e. a wall or a turret) that you placed:

  • Rotate Object: Rotates the object around the z-axis until the user left clicks.
  • Scale Object: Scales the object until the user left clicks.
  • Move Object: Lets you move the object until you left click.
  • Remove Object: Removes the object

The trees are sensed as small walls; in the current version they cannot be created or modified though.

Over-head display

By hitting the F2 key, you can cycle through additional information about each agent that may be useful during training. This "over-head" display shows up as a bit of text above each agent on the field. When an over-head display is active, the window title will change to say what is being displayed. Some of information is specific to Neuroevolution, and some is specific to RL.

  • fitness
    • for RL, this is the cumulative reward over the agent's lifetime. Because the meaning of the reward values can change with the adjustment of sliders, the exact meaning and units of this value depend on the current slider setting.
    • for rtNEAT, this is the relative fitness of the organism compared to the rest of the population. This is calculated as the weighted sum of the Z-scores (the number of standard deviations above or below population average) of the agent in each of the fitness slider categories.
  • time alive
    • for RL, this is simply the number of steps on the field that the current individual has been trained for. Generally, the longer an agent is fielded, the more experience it has, and, if using a hash table, the larger its representation in the team file.
    • for rtNEAT, this is the total time (in frames) that the phenotype has been on the field. Note that this can be larger than a single lifetime because the same network can be "re-spawned" several times if it is considered good enough, because rtNEAT is an elitist steady-state algorithm.
  • id
    • for RL, this is the body id of the individual, allowing you to keep track of its behavior over time.
    • for rtNEAT, this is the genome id, which you can use to track the behavior and to extract individuals from saved populations for use in combat teams.
  • species id
    • for RL, this shows the value 'q' for the default q-learner to allow you to distinguish RL agents from rtNEAT ones.
    • for rtNEAT, this shows the unique species number that the individual belongs to. rtNEAT uses speciation and fitness sharing in order to protect diversity within the evolving population.
  • champion:
    • not available for RL
    • for rtNEAT, this shows the label 'champ!' above the highest-ranked individual within the current population, allowing you to quickly check what the best behavior so far is according to the current fitness profile.

Neuroevolution (rtNEAT)

The rtNEAT neuroevolution algorithm is a method for evolving (through genetic algorithms) a population of neural networks to control the agents. See the paper on rtNEAT for more details.

When you press the "Deploy rtNEAT" button, a population of 50 agents is created and spawned on the field. Each agent is controlled by a simply neural network connecting the input sensors directly to outputs, with random weights. Over their lifetime, fitness is accumulated based on the behavior objectives specified with the sliders: if e.g. the approach enemy is rewarded, the time they spend near the enemy is multiplied by a large constant and added to the fitness.

After their lifetime expires, they are removed from the field one at a time. If their fitness was low, they are simply discarded. If their fitness was high, they will be put back into the field, and in addition, a new agent is generated by mutating the neural network (i.e. adding nodes and connections and/or changing connection weights) and crossing over its representation with another network with a high fitness. A balance of about 50% new individuals and 50% repeats is maintained in the field in the steady state (the explore/exploit slider has no effect on evolution). In this manner, evolution is running incrementally in the background, constantly evaluating and reproducing individuals.

Over time, evolution is thus likely to come up with more complex networks, including those with recurrent connections. Recurrency is useful e.g. when an agent needs to pursue an enemy around the corner (i.e. even though the enemy disappeared from view, activation in a recurrent network will retain that information). In other word, it allows disambiguating the state in a POMDP problem (where the state is partially observable).

When the population is saved, the genomes of each agents are written into a text file. That file can be edited to form composite teams, reloaded for further training, or loaded into battle.

The rtNEAT algorithm is parameterized using the file neat-params.dat; you can edit it in order to experiment with different versions of the method (such as mutation and speciation rates, balance of old and new agents, etc.)

Reinforcement Learning (Q-learning)

The reinforcement learning method in NERO is a version of Q-learning (familiar from the Q-learning demo), using either static, linear discretization or a tile-coding function approximator. The agents learn during their lifetime to optimize the behavioral objectives.

When you press the "Deploy Q-learning" button, a Q-learning agent is created according to the specs in the file mods/_NERO/data/shapes/character/steve_blue_qlearning.xml. The <Python agent="NERO.agent.QLearningAgent()"> XML element can be changed to include keyword arguments that will be passed to the QLearning constructor. These parameters are:

  • gamma - reinforcement learning discount factor
  • alpha - learning rate
  • epsilon - exploration factor for epsilon-greedy action selection, note that this can also be changed during the NERO simulation by manipulating the "Exploit/Explore" slider
  • action_bins - discretize each continuous dimension in the action space into this many linear bins (there are 2 action dimensions in NERO: turning and moving)
  • state_bins - discretize each continuous dimension in the state space into this many linear bins (there are about 15 state dimensions in NERO)
  • num_tiles - the number of tiles you want to use in the tile-coding approximator
  • num_weights - the amount of memory reserved for storing function approximations in the tile-coding approximator

The last four parameters specify the discretization of the state and action dimensions so that the agent's state can be represented as a discretized table of Q-values, one for each state/action pair (these values are initialized to zero). If you choose to use the tile-coding approximator, be sure to set action_bins and state_bins to 0; conversely, if you wish to use the static bins, be sure to set num_tiles and num_weights to 0. The default Q-Learning agents are created with action_bins set to 3 and state_bins set to 5.

The population for the game is generated by cloning this agent 50 times; each agent gets its Q-table to update, so different agents can learn different Q-values depending on their experiences.

Q-learning progresses as usual during the lifetime of these individuals, modifying the values in the table. Using the Exploit-Explore slider you can adjust the fraction of the actions taken greedily (i.e. those with the best Q-values) vs. actions taken to explore the environment (i.e. randomly selected actions). When the lifetime of an agent expires, it is respawned, and continues from the spawn location with its current Q-tables.

When the population is saved, the Q-tables of each individual are saved together with its parameters and the function approximation parameters, so that they can be loaded for further training and battle.

Training strategy

The game consists in trying to come up with a sequence of increasingly demanding goals, so that the agents will perform well in the end. It is a good idea to start with something simple, such as approaching the enemy. Once the agents learn that, place the enemy behind a wall so they learn to go around it. Then reward the agents for hitting the enemy as well. Then start penalizing them for getting hit. Introduce more enemies, and walls behind which the agents can take cover. You can also explore the effects of staying close or apart from teammates, and standing ground or moving a lot. In this manner, you can create agents with distinctly different personalities, and thus possibly serving different roles in battle.

Achieving each objective will take some time. Within a couple of minutes you should see some of the agents perform the task sometimes; within 10-15 minutes, almost the entire team may converge. Using the F2 displays you can follow the behavior of the current champion, which agents are drawing fire and which are avoiding it, and with rtNEAT, observe which agents are new and which are old, and how speciation is progressing. Note that it is not always good to converge completely, because it may be difficult to learn new skills then. The trick is to discover a sequence where later skills build on earlier ones so that little has to be unlearned between them.

It is a good idea to train several teams, and then test them in the battle mode. In this manner, you can develop an understanding of what works and why, and can focus your training better. Based on that knowledge you can also decide how to put a good team together from several different trained teams, as will be described next.

Composing a Team for Battle

Note that you can train several different teams to perform different behaviors, for instance a team of attackers, defenders, snipers, etc. It may then be useful to combine agents with such different behaviors into a single team. Because the save files are simply text, you can form such composite teams simply by editing them by hand. You can also "clone" agents by copying them multiple times. You can even combine agents created by neuroevolution and reinforment learning into a single team. The first 50 in the save file will be used in the battle; if there are fewer than 50 agents in the file, they will be copied until 50 are created in battle.

The basic structure of the file is like this for rtNEAT teams:

genomestart 120
trait 1 0.112808 0.000000 0.305447 0.263380 0.991934 0.000000 0.306283 0.855288
...
node 1 1 1 1 FriendRadarSensor 3 90 -90 15 0
...
node 21 1 1 3
...
gene 1 1 22 0.041885 0 1.000000 0.041885 1
...
genomeend 120

In words, a population consists of one or more genomes. Each genome starts with a genomestart (followed by its ID) line and ends with a genomeend line. Between these lines, there are one or more trait lines followed by one or more input (sensor) lines, followed by some other node lines, followed by the gene lines.

For RL teams, the file looks like this:

22 serialization::archive 5 0 0 0.8 0.8 0.1 3 3 ... 1 7 27 OpenNero::TableApproximator 1 0
0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0 ...

22 serialization::archive 5 0 0 0.8 0.8 0.1 3 3 ... 1 7 27 OpenNero::TableApproximator 1 0
0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0 ...

...

Each team member is represented by a bunch of numbers representing the stored Q table for the agent. Unlike rtNEAT teams, RL agents in this file are separated by one blank line.

Either way, you will probably want to pick and choose the individual agents from your training episodes that perform the best for the tasks you anticipate. You should assemble these agents into one file for the battle.

(Note: If you include reinforcement learning agents, you need to separate all agents in your submission file with one blank line. Also note: if you form a team by combining individuals from different rtNEAT runs, you current cannot train such a combo team further (because rtNEAT training depends on historical markings that then would not match)).

Before you submit to the tournament, you should test your file by loading it into NERO_Battle and making sure it runs correctly. If you want, you can test your team e.g. against this sample team.

NERO Battle

In the NERO-battle environment the user first loads the two teams: one is identified as Red and the other as Blue based on how the top of the head of the robots is painted. By default they spawn on the opposite sides of the central wall in the standard environment (the environment and the spawn locations can be changed as in training mode).

The Hitpoints slider specifies how many times each agent can be hit before it dies and is removed from the battle. The game ends when one team is completely eliminated or when the time runs out, in which case the team that has more hits on the opponent wins. The current hitpoints are displayed in the title bar of the NERO window; the agent that delivered the winning shot will jump up and down in jubilation :-).

The game starts when the user presses the Continue button. The agents are spawned only once, and they then have to move around in the environment and engage the other team. This is where the training pays off: the agents need to respond appropriately to the opponents' actions, emploing different skills in different situations, such as attacking, retreating, sniping, ambushing, sometimes perhaps working together with teammates and sometimes independently of them. There is no a-priori winning strategy; the performance of the team depends on the ingenuity of its creator!

To see how the battle mode works, or see how well your team is doing, you can use this sample team.

NERO Tournament

A fun event in e.g. AI or machine learning courses is to organize a NERO tournament. The students develop teams, and the teams are then played against each other in a round-robin or a double-elimination tournament. One such tournament was held in Fall 2011 for the Stanford Online AI course; the tournament assignment is here.

Extending NERO Methods

The ingenuity is not limited to simply training the agents with the methods that have been implemented in OpenNERO. The game is open source, and you can modify all aspects of it by changing the python code (and in some case, the C++ code). The main files are...

For instance, you can implement more sophisticated versions of the sensors and effectors, or entirely new ones such as line-of-fire sensors, or sending and receiving signals between the agents. You can implement more sophisticated function approximators for reinforcement learning, and even other neuroevolution and reinforcement learning algorithms. If you so desire, you can also program the agent behaviors entirely by hand.

Note that many such changes will require making corresponding changes into the battle mode as well, and therefore it will not be possible to use them in the NERO Tournament. However, note that as long as your team is represented in terms of genomes and Q-tables, it doesn't matter how that representation is created. That is, if your changes apply to training only, and your team can still be saved in the existing format, the team can be entered into the tournament. For instance, you can express behaviors in terms of rules and finite state automata based on the sensors and effectors in NERO, and then mechanically translate them into neural networks (see e.g. this paper). Those networks can then be represented as a genome and entered into tournament.

Differences between OpenNERO and Original NERO

The NERO game in OpenNERO differs from the original NERO game in several important ways. First of all, whereas the original NERO was based on the Torque game engine, OpenNERO is entirely open source (based on the Irrlicht game engine and many other open-source components). This design makes it a good platform for research and education, i.e. it is possible for the users to extend it and to understand it fully.

Second, the original NERO was designed to demonstrate that machine learning games can be viable. It therefore aimed to be a more substantial game, and included many features such as more advanced graphics, sound, and user interface, as well as more detailed environments that made gameplay more enjoyable. The 2.0 version of NERO also included interactive battle where the human players specified targets and composed teams dynamically.

Third, OpenNERO includes reinforcement learning as an alternative method of learning for NERO agents. The idea is to demonstrate more broadly how learning can take place in intelligent agents, both for research and education.

Fourth, the original NERO included several features that have not yet been implemented in OpenNERO, but could in the future. They include a sensor for line-of-fire (which may help develop more sophisticated behaviors); taking friendly fire into account; collisions among NERO agents; different types of turrets in training; a button that converges a population to a single individual, and a button that removes an individual from the population. We invite the users to implement such features, and perhaps others, in the game, and contributed them to OpenNERO!

Fifth, much of OpenNERO is written in Python (instead of C++), making it easier to understand and modify, again supporting research and education. Unfortunately, it has the result of slowing down the simulation by an order of magnitude. However, we believe that researchers and students have the patience it takes to "play" OpenNERO, in order to gain the better insight into the learning in it.

Software Issues

OpenNERO is academic software and under (hyper)active development. It is possible that you will come across a bug in it, or a feature that should be implemented. If so, please report it here, so that everyone can see it and track it (please first check whether it has already been reported).


Sign in to add a comment
Powered by Google Project Hosting