Multi-agent Planning with High-Level Human Guidance

Wu, Feng; Zilberstein, Shlomo; Jennings, Nicholas R.

doi:10.1007/978-3-030-69322-0_12

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12568))

Included in the following conference series:

International Conference on Principles and Practice of Multi-Agent Systems

512 Accesses

Abstract

Planning and coordination of multiple agents in the presence of uncertainty and noisy sensors is extremely hard. A human operator who observes a multi-agent team can provide valuable guidance to the team based on her superior ability to interpret observations and assess the overall situation. We propose an extension of decentralized POMDPs that allows such human guidance to be factored into the planning and execution processes. Human guidance in our framework consists of intuitive high-level commands that the agents must translate into a suitable joint plan that is sensitive to what they know from local observations. The result is a framework that allows multi-agent systems to benefit from the complex strategic thinking of a human supervising them. We evaluate this approach on several common benchmark problems and show that it can lead to dramatic improvement in performance.

This work was supported in part by the National Key R&D Program of China (Grant No. 2017YFB1002204), the National Natural Science Foundation of China (Grant No. U1613216, Grant No. 61603368), and the Guangdong Province Science and Technology Plan (Grant No. 2017B010110011).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ai-Chang, M., et al.: MAPGEN: mixed-initiative planning and scheduling for the mars exploration rover mission. IEEE Intell. Syst. 19(1), 8–12 (2004)
Article Google Scholar
Amato, C., Dibangoye, J.S., Zilberstein, S.: Incremental policy generation for finite-horizon DEC-POMDPs. In: Proceedings of the 19th International Conference on Automated Planning and Scheduling, pp. 2–9 (2009)
Google Scholar
Bechar, A., Edan, Y.: Human-robot collaboration for improved target recognition of agricultural robots. Ind. Robot Int. J. 30(5), 432–436 (2003)
Article Google Scholar
Bernstein, D.S., Givan, R., Immerman, N., Zilberstein, S.: The complexity of decentralized control of Markov decision processes. Math. Oper. Res. 27(4), 819–840 (2002)
Article MathSciNet Google Scholar
Bradshaw, J.M., et al.: Kaa: policy-based explorations of a richer model for adjustable autonomy. In: Proceedings of the 4th International Conference on Autonomous Agents and Multiagent Systems, pp. 214–221 (2005)
Google Scholar
Côté, N., Canu, A., Bouzid, M., Mouaddib, A.I.: Humans-robots sliding collaboration control in complex environments with adjustable autonomy. In: Proceedings of Intelligent Agent Technology (2013)
Google Scholar
Dibangoye, J.S., Amato, C., Buffet, O., Charpillet, F.: Optimally solving Dec-POMDPs as continuous-state MDPs. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence (2013)
Google Scholar
Dorais, G., Bonasso, R.P., Kortenkamp, D., Pell, B., Schreckenghost, D.: Adjustable autonomy for human-centered autonomous systems. In: IJCAI Workshop on Adjustable Autonomy Systems, pp. 16–35 (1999)
Google Scholar
Goldberg, K., et al.: Collaborative teleoperation via the internet. In: Proceedings of the 2000 IEEE International Conference on Robotics and Automation, vol. 2, pp. 2019–2024 (2000)
Google Scholar
Goodrich, M.A., Olsen, D.R., Crandall, J.W., Palmer, T.J.: Experiments in adjustable autonomy. In: Proceedings of IJCAI Workshop on Autonomy, Delegation and Control: Interacting with Intelligent Agents, pp. 1624–1629 (2001)
Google Scholar
Ishikawa, N., Suzuki, K.: Development of a human and robot collaborative system for inspecting patrol of nuclear power plants. In: Proceedings 6th IEEE International Workshop on Robot and Human Communication, pp. 118–123 (1997)
Google Scholar
Kuhlmann, G., Knox, W.B., Stone, P.: Know thine enemy: a champion RoboCup coach agent. In: Proceedings of the 21st National Conference on Artificial Intelligence, pp. 1463–1468 (2006)
Google Scholar
Kumar, A., Zilberstein, S.: Point-based backup for decentralized POMDPs: complexity and new algorithms. In: Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems, pp. 1315–1322 (2010)
Google Scholar
Mouaddib, A.I., Zilberstein, S., Beynier, A., Jeanpierre, L.: A decision-theoretic approach to cooperative control and adjustable autonomy. In: Proceedings of the 19th European Conference on Artificial Intelligence, pp. 971–972 (2010)
Google Scholar
Oliehoek, F.A., Spaan, M.T., Amato, C., Whiteson, S.: Incremental clustering and expansion for faster optimal planning in decentralized POMDPs. J. Artif. Intell. Res. 46, 449–509 (2013)
Article MathSciNet Google Scholar
Peshkin, L., Kim, K.E., Meuleau, N., Kaelbling, L.P.: Learning to cooperate via policy search. In: Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pp. 489–496 (2000)
Google Scholar
Riley, P.F., Veloso, M.M.: Coach planning with opponent models for distributed execution. Auton. Agents Multi-Agent Syst. 13(3), 293–325 (2006). https://doi.org/10.1007/s10458-006-7449-z
Article Google Scholar
Rosenthal, S., Veloso, M.M.: Mobile robot planning to seek help with spatially-situated tasks. In: Proceedings of the 26th AAAI Conference on Artificial Intelligence (2012)
Google Scholar
Scerri, P., Pynadath, D., Tambe, M.: Adjustable autonomy in real-world multi-agent environments. In: Proceedings of the 5th International Conference on Autonomous Agents, pp. 300–307 (2001)
Google Scholar
Seuken, S., Zilberstein, S.: Improved memory-bounded dynamic programming for decentralized POMDPs. In: Proceedings of the 23rd Conference Conference on Uncertainty in Artificial Intelligence, pp. 344–351 (2007)
Google Scholar
Seuken, S., Zilberstein, S.: Memory-bounded dynamic programming for DEC-POMDPs. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 2009–2015 (2007)
Google Scholar
Shiomi, M., Sakamoto, D., Kanda, T., Ishi, C.T., Ishiguro, H., Hagita, N.: A semi-autonomous communication robot: a field trial at a train station. In: Proceedings of the 3rd ACM/IEEE International Conference on Human Robot Interaction, pp. 303–310. ACM, New York (2008)
Google Scholar
Szer, D., Charpillet, F.: Point-based dynamic programming for DEC-POMDPs. In: Proceedings of the 21st National Conference on Artificial Intelligence, pp. 1233–1238 (2006)
Google Scholar
Wu, F., Jennings, N.R., Chen, X.: Sample-based policy iteration for constrained DEC-POMDPs. In: Proceedings of the 20th European Conference on Artificial Intelligence (ECAI), pp. 858–863 (2012)
Google Scholar
Wu, F., Zilberstein, S., Chen, X.: Trial-based dynamic programming for multi-agent planning. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence, pp. 908–914 (2010)
Google Scholar
Yanco, H.A., Drury, J.L., Scholtz, J.: Beyond usability evaluation: analysis of human-robot interaction at a major robotics competition. Hum.-Comput. Interact. 19(1–2), 117–149 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, University of Science and Technology of China, Hefei, China
Feng Wu
College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, USA
Shlomo Zilberstein
Department of Computing, Imperial College London, London, UK
Nicholas R. Jennings

Authors

Feng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Shlomo Zilberstein
View author publications
You can also search for this author in PubMed Google Scholar
Nicholas R. Jennings
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feng Wu .

Editor information

Editors and Affiliations

Nagoya Institute of Technology, Nagoya, Japan
Takahiro Uchiya
University of Tasmania, Tasmania, TAS, Australia
Quan Bai
University of Alcalá, Alcala de Henares, Spain
Iván Marsá Maestre

A The Benchmark Problems

1.1 A.1 Meeting in a 3 \(\times \) 3 Grid

In this problem, as shown in Fig. 3(a), two robots R1 and R2 situated in a 3 \(\times \) 3 grid try to stay in the same cell together as fast as possible. There are 81 states in total since each robot can be in any of the 9 cells. They can move up, down, left, right, or stay so each robot has 5 actions. Their moving actions (i.e., the actions except stay) are stochastic. With probability 0.6, they can move in the desired direction. With probability 0.1, they may move in another direction or just stay in the same cell. There are 9 observations per robot. Each robot can observe if it is near one of the corners or walls. The robots may meet at any of the 4 corners. Once they meet there, a reward of 1 is received by the agents. To make the problem more challenging, the agents are reset to their initial locations when they meet at the corners.

We design the high-level commands so that the robots are asked to meet at a specific corner. In more detail, the command set for this problem is \(C=\{ c_0, c_1, c_2, c_3, c_4 \}\) where \(c_0\) allows the agents to meet at any corner and \(c_i (i\ne 0)\) is the command for the robots to meeting in the corner labeled with i in Fig. 3(a). Depending on what we want to achieve in the problem, the commands can be more general (e.g., meeting at any of the top corners) or specific (e.g., meeting at the top-left corner without going through the center). The reward function is implemented so that they are rewarded only when they meet at the specific corner (for \(c_0\), the original reward function is used). For example, \(R(c_1, \cdot , \cdot ) = 1\) only when they meet at the top-left corner and 0 otherwise.

During execution time, we simulate a random event to determine whether meeting at one of the corners has a much higher reward (i.e., 100) and which corner. It is a finite state machine (FSM) with 5 states where state 0 means there is no highly rewarded corner and state i (1\(\le \) \(i\) \(\le \)4) means the corner labeled with i has the highest reward. The transition function of this FSM is predetermined and fixed for all the tests, but it is not captured by the model and not known during planning time. Therefore, they are not considered in the agents’ policies. The event can only be observed by the operator at runtime. This kind of a stochastic event can be used to simulate a disaster response scenario, where a group of robots with pre-computed plans are sent to search and rescue victims at several locations (i.e., the corners). For each location, the robots must cooperate and work together (i.e., meeting at the corner). As more information (e.g., messages reported by the people nearby) is collected at the base station, one of the locations may becomes more likely to have victims. Thus, the operators should guide the robots to search that location and rescue the victims there.

1.2 A.2 Cooperative Box-Pushing

In this problem, as shown in Fig. 3(b), there are two robots R1 and R2 in a 3 \(\times \) 3 grid trying to push the large box (LB) together or independently push the small boxes (SB). Each robot can turn left, turn right, move forward, or stay so there are 5 actions per robot. For each action, with probability 0.9, they can turn to the desired direction or move forward and with probability 0.1 they just stay in the same position and orientation. Each robot has 5 observations to identify the object in front, which can be either an empty field, a wall, the other robot, a small box, or the large box. For each robot, executing an action has a cost of 0.1 for energy consumption. If a robot bumps into a wall, the other robot, or a box without pushing it, it gets a penalty of 5. The standard reward function is designed to encourage cooperation. Specifically, the reward for cooperatively pushing the large box is 100, while the reward of pushing a small box is just 10. Each run includes 100 steps. Once a box is pushed to its goal location, the robots are reset to an initial state.

The high-level commands C=\(\{ c_0, c_1, c_2, c_3 \}\) are designed as follow: (\(c_0\)) the robots should push any box; (\(c_1\)) the robots should only push the small box on the left side; (\(c_2\)) the robots should only push the small box on the right side; (\(c_3\)) the robots should only push the large box in the middle. Specifying the corresponding reward function is straightforward. For \(c_0\), we use the original reward function. For \(c_i\) (\(1\le i \le 3\)), we reward the agents (\(+100\)) for pushing the right box and penalize them for pushing other boxes (\(-100\)).

Similar to the previous domain, we also simulate a random event representing a trapped animal in one of the cells labeled with numbers in Fig. 3(b). The animal is hidden behind a box so the robots cannot see it with their cameras. However, if the robots push a box while an animal is on the other side of that box, it will get injured and the robots get a high penalty of 100. The random event is modeled by a FSM with 5 states where state 0 represents no animal and state i (1\(\le \) \(i\) \(\le \)4) means an animal is trapped in the cell labeled i. If the animal gets injured, the FSM transitions to another state based on a predefined transition function. Again, this event is neither captured by the agents’ model nor their policies. The animal can only be observed by the operators during execution time with an additional camera attached behind the boxes. This setting allows us to simulate a scenario where the operators supervise robots performing risk-sensitive tasks. For example, robots doing construction work on a crowded street.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, F., Zilberstein, S., Jennings, N.R. (2021). Multi-agent Planning with High-Level Human Guidance. In: Uchiya, T., Bai, Q., Marsá Maestre, I. (eds) PRIMA 2020: Principles and Practice of Multi-Agent Systems. PRIMA 2020. Lecture Notes in Computer Science(), vol 12568. Springer, Cham. https://doi.org/10.1007/978-3-030-69322-0_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-69322-0_12
Published: 14 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69321-3
Online ISBN: 978-3-030-69322-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-agent Planning with High-Level Human Guidance

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A The Benchmark Problems

A The Benchmark Problems

1.1 A.1 Meeting in a 3 \(\times \) 3 Grid

1.2 A.2 Cooperative Box-Pushing

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation