Many real-world concerns involve complex teamwork, e.g. individuals or algorithms, between multiple agents. In this respect, a machine learning technique called multi-agent reinforcement learning (MARL) has demonstrated success, primarily in two-team games such as Go, DOTA 2, StarCraft, hide-and-seek, and flag capture. But there’s a much messier human world than sports. That’s because people face numerous social dilemmas, from interpersonal to foreign, and they need to determine not only how to cooperate, but when to cooperate.
Researchers at OpenAI suggest training AI agents with what they term randomized unpredictable social preferences (RUSP) to overcome this problem, an increase that extends the distribution of environments trained by reinforcement learning agents. Agents share different amounts of reward with each other during training; however each agent has an individual degree of ambiguity about their relationships, causing “asymmetry” that researchers believe pressures agents to learn socially reactive behaviors.
To illustrate the potential of RUSP, the co-authors had agents play Prisoner’s Buddy, a grid-based game in which agents earn a reward for finding a buddy.” Agents act on each phase of the time either by choosing another agent or choosing no one and sitting out the round. If two agents choose each other, they each receive a +2 reward. Alice receives -2 and Bob receives +1 if Agent Alice prefers Bob, but the preference is not reciprocated. Agents who want no one will earn 0.0
In a far more complex setting called Oasis, the coauthors have discussed preliminary team dynamics. It is physics-based and assigns survival agents; their reward is +1 for each time they stay alive and when they die, a significant negative reward. With each move, their health declines, but by eating food pellets, they can regain health and can attack others to minimize their health. When an agent is lowered below 0 health, after 100 timesteps, it dies and respawns at the edge of the play area.
In Oasis, there is only ample food to sustain two of the three agents, producing a social dilemma. To protect the food source to remain alive, agents must break symmetry and gang up on the third.
The researchers report that RUSP agents in Oasis performed much better than a “selfish” baseline in that they achieved higher rewards and died less frequently. (Up to 90% of the deaths in an episode were due to a single agent for agents equipped with high levels of instability, meaning that two agents learned to form a partnership and mostly exclude the third from the food source.) And in Prisoner’s Buddy, RUSP agents effectively divide into teams that have proven to be cohesive and sustained throughout an episode.
The researchers note that RUSP is inefficient; 1,000 iterations corresponded to approximately 3.8 million episodes of experience with the training setup in Oasis. This being the case, RUSP and techniques like it warrant more exploration, they claim. In a paper submitted to the 2020 NeurIPS conference, they wrote, ‘Reciprocity and team forming are hallmark behaviors of sustained cooperation in both animals and humans.’ “The roots of all of our social systems are rooted in these essential practices and are often clearly written into them-reciprocal punishment was at the heart of the code of laws of Hammurabi almost 4,000 years ago. It seems a wise first step to understanding how simple types of reciprocity will evolve in artificial agents, if we are to see the emergence of more complex social structures and norms.