Comparing Psychometric and Behavioral Predictors of Compliance During Human-AI Interactions

Gurney, Nikolos; Pynadath, David V.; Wang, Ning

doi:10.1007/978-3-031-30933-5_12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13832))

Included in the following conference series:

International Conference on Persuasive Technology

692 Accesses

Abstract

Optimization of human-AI teams hinges on the AI’s ability to tailor its interaction to individual human teammates. A common hypothesis in adaptive AI research is that minor differences in people’s predisposition to trust can significantly impact their likelihood of complying with recommendations from the AI. Predisposition to trust is often measured with self-report inventories that are administered before interactions. We benchmark a popular measure of this kind against behavioral predictors of compliance. We find that the inventory is a less effective predictor of compliance than the behavioral measures in datasets taken from three previous research projects. This suggests a general property that individual differences in initial behavior are more predictive than differences in self-reported trust attitudes. This result also shows a potential for easily accessible behavioral measures to provide an AI with more accurate models without the use of (often costly) survey instruments.

Part of the effort behind this work was sponsored by the Defense Advanced Research Projects Agency (DARPA) under contract number W911NF2010011. The content of the information does not necessarily reflect the position or the policy of the U.S. Government or the Defense Advanced Research Projects Agency, and no official endorsements should be inferred.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Measuring and Predicting Human Trust in Recommendations from an AI Teammate

Autonomous Agent Teammate-Likeness: Scale Development and Validation

Effective Human–Artificial Intelligence Teaming

Notes

1.
Controlling for mission did not meaningfully change the interpretation of the results.
2.
A likelihood ratio test is another method of comparing such models and will produce similar insights.

References

Ajzen, I.: The theory of planned behavior. Organ. Behav. Hum. Decis. Process. 50(2), 179–211 (1991)
Article Google Scholar
Aliasghari, P., Ghafurian, M., Nehaniv, C.L., Dautenhahn, K.: Effect of domestic trainee robots’ errors on human teachers’ trust. In: Proceedings of the IEEE International Conference on Robot & Human Interactive Communication (RO-MAN), pp. 81–88. IEEE (2021)
Google Scholar
Aliasghari, P., Ghafurian, M., Nehaniv, C.L., Dautenhahn, K.: How do different modes of verbal expressiveness of a student robot making errors impact human teachers’ intention to use the robot? In: Proceedings of the 9th International Conference on Human-Agent Interaction, pp. 21–30 (2021)
Google Scholar
Amershi, S., et al.: Guidelines for human-AI interaction. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–13 (2019)
Google Scholar
Aroyo, A.M., Rea, F., Sandini, G., Sciutti, A.: Trust and social engineering in human robot interaction: will a robot make you disclose sensitive information, conform to its recommendations or gamble? IEEE Robot. Autom. Lett. 3(4), 3701–3708 (2018)
Article Google Scholar
Ashleigh, M.J., Higgs, M., Dulewicz, V.: A new propensity to trust scale and its relationship with individual well-being: implications for HRM policies and practices. Hum. Resour. Manage. J. 22(4), 360–376 (2012)
Article Google Scholar
Bargain, O., Aminjonov, U.: Trust and compliance to public health policies in times of covid-19. J. Publ. Econ. 192, 104316 (2020)
Article Google Scholar
Barnes, M.J., Wang, N., Pynadath, D.V., Chen, J.Y.: Human-agent bidirectional transparency. In: Trust in Human-Robot Interaction, pp. 209–232. Elsevier (2021)
Google Scholar
Brentano, F.: Psychology from an Empirical Standpoint. Routledge, Milton Park (2012)
Google Scholar
Chater, N., Zeitoun, H., Melkonyan, T.: The paradox of social interaction: shared intentionality, we-reasoning, and virtual bargaining. Psychol. Rev. 129(3), 415 (2022)
Article Google Scholar
Chi, O.H., Jia, S., Li, Y., Gursoy, D.: Developing a formative scale to measure consumers’ trust toward interaction with artificially intelligent (AI) social robots in service delivery. Comput. Hum. Behav. 118, 106700 (2021)
Article Google Scholar
Dennett, D.C.: The Intentional Stance. MIT press, Cambridge (1987)
Google Scholar
Elliot, J.: Artificial social intelligence for successful teams (ASIST) (2021). www.darpa.mil/program/artificial-social-intelligence-for-successful-teams
Gurney, N., Pynadath, D., Wang, N.: My actions speak louder than your words: when user behavior predicts their beliefs about agents’ attributes. arXiv preprint arXiv:2301.09011 (2023)
Gurney, N., Pynadath, D.V., Wang, N.: Measuring and predicting human trust in recommendations from an AI teammate. In: International Conference on Human-Computer Interaction, pp. 22–34. Springer (2022). https://doi.org/10.1007/978-3-031-05643-7_2
Hancock, P.A., Billings, D.R., Schaefer, K.E., Chen, J.Y., De Visser, E.J., Parasuraman, R.: A meta-analysis of factors affecting trust in human-robot interaction. Hum. Factors 53(5), 517–527 (2011)
Article Google Scholar
Hoff, K.A., Bashir, M.: Trust in automation: integrating empirical evidence on factors that influence trust. Hum. Factors 57(3), 407–434 (2015)
Article Google Scholar
Jessup, S.A., Schneider, T.R., Alarcon, G.M., Ryan, T.J., Capiola, A.: The measurement of the propensity to trust automation. In: Chen, J.Y.C., Fragomeni, G. (eds.) HCII 2019. LNCS, vol. 11575, pp. 476–489. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21565-1_32
Chapter Google Scholar
Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101(1), 99–134 (1998)
Article MathSciNet MATH Google Scholar
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996)
Article Google Scholar
Lee, J.D., See, K.A.: Trust in automation: designing for appropriate reliance. Hum. Factors 46(1), 50–80 (2004)
Article Google Scholar
Lutz, C., Tamó-Larrieux, A.: The robot privacy paradox: understanding how privacy concerns shape intentions to use social robots. Hum. Mach. Commun. 1, 87–111 (2020)
Article Google Scholar
McKnight, D.H., Choudhury, V., Kacmar, C.: Developing and validating trust measures for e-commerce: an integrative typology. Inf. Syst. Res. 13(3), 334–359 (2002)
Article Google Scholar
Merritt, S.M., Huber, K., LaChapell-Unnerstall, J., Lee, D.: Continuous Calibration of Trust in Automated Systems. MISSOURI UNIV-ST LOUIS, Tech. rep. (2014)
Book Google Scholar
Millikan, R.G.: Biosemantics. J. Philos. 86(6), 281–297 (1989)
Google Scholar
Mischel, W.: Personality and Assessment. Psychology Press, London (2013)
Google Scholar
Nomura, T., Kanda, T., Suzuki, T.: Experimental investigation into influence of negative attitudes toward robots on human-robot interaction. AI Soc. 20(2), 138–150 (2006)
Article Google Scholar
Nomura, T., Suzuki, T., Kanda, T., Kato, K.: Measurement of negative attitudes toward robots. Interact. Stud. 7(3), 437–454 (2006)
Article Google Scholar
Ouellette, J.A., Wood, W.: Habit and intention in everyday life: the multiple processes by which past behavior predicts future behavior. Psychol. Bull. 124(1), 54 (1998)
Article Google Scholar
Parasuraman, R., Riley, V.: Humans and automation: use, misuse, disuse, abuse. Hum. Factors 39(2), 230–253 (1997)
Article Google Scholar
Pynadath, D.V., Gurney, N., Wang, N.: Explainable reinforcement learning in human-robot teams: the impact of decision-tree explanations on transparency. In: 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pp. 749–756. IEEE (2022)
Google Scholar
Pynadath, D.V., Wang, N., Kamireddy, S.: A markovian method for predicting trust behavior in human-agent interaction. In: Proceedings of the 7th International Conference on Human-Agent Interaction, pp. 171–178 (2019)
Google Scholar
Rossi, A., Dautenhahn, K., Koay, K.L., Walters, M.L.: The impact of peoples’ personal dispositions and personalities on their trust of robots in an emergency scenario. Paladyn J. Behav. Robot. 9(1), 137–154 (2018)
Article Google Scholar
Rossi, A., Dautenhahn, K., Koay, K.L., Walters, M.L., Holthaus, P.: Evaluating people’s perceptions of trust in a robot in a repeated interactions study. In: Wagner, A.R. (ed.) ICSR 2020. LNCS (LNAI), vol. 12483, pp. 453–465. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-62056-1_38
Chapter Google Scholar
Schaefer, K.: The perception and measurement of human-robot trust (2013). stars.library.ucf.edu/etd/2688
Schrum, M.L., Johnson, M., Ghuy, M., Gombolay, M.C.: Four years in review: statistical practices of likert scales in human-robot interaction studies. In: Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, pp. 43–52 (2020)
Google Scholar
Seeber, I., et al.: Machines as teammates: a research agenda on AI in team collaboration. Inf. Manage. 57(2), 103174 (2020)
Article MathSciNet Google Scholar
Shneiderman, B.: Human-centered artificial intelligence: reliable, safe & trustworthy. Int. J. Hum. Comput. Interact. 36(6), 495–504 (2020)
Article Google Scholar
Stevenson, D.C.: The internet classics archive: on interpretation by aristotle (2009). https://classics.mit.edu/Aristotle/interpretation.htmlclassics.mit.edu/Aristotle/interpretation.html
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT press, Cambridge (2018)
Google Scholar
Tauchert, C., Mesbah, N., et al.: Following the robot? Investigating users’ utilization of advice from robo-advisors. In: Proceedings of the International Conference on Information Systems (2019)
Google Scholar
Textor, C., Pak, R.: Paying attention to trust: Exploring the relationship between attention control and trust in automation. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting. vol. 65 no. 1, pp. 817–821. SAGE Publications Sage CA: Los Angeles, CA (2021)
Google Scholar
Venkatesh, V.: Determinants of perceived ease of use: integrating control, intrinsic motivation, and emotion into the technology acceptance model. Inf. Syst. Res. 11(4), 342–365 (2000)
Article Google Scholar
Wang, N., Pynadath, D.V., Hill, S.G.: The impact of POMDP-generated explanations on trust and performance in human-robot teams. In: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 997–1005 (2016)
Google Scholar
Wang, N., Pynadath, D.V., Hill, S.G.: Trust calibration within a human-robot team: Comparing automatically generated explanations. In: 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 109–116. IEEE (2016)
Google Scholar
Wang, N., Pynadath, D.V., Rovira, E., Barnes, M.J., Hill, S.G.: Is It My Looks? Or Something I Said? The impact of explanations, embodiment, and expectations on trust and performance in human-robot teams. In: Ham, J., Karapanos, E., Morita, P.P., Burns, C.M. (eds.) PERSUASIVE 2018. LNCS, vol. 10809, pp. 56–69. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78978-1_5
Chapter Google Scholar
Wong, C.M.L., Jensen, O.: The paradox of trust: perceived risk and public compliance during the COVID-19 pandemic in Singapore. J. Risk Res. 23(7–8), 1021–1030 (2020)
Article Google Scholar
Xu, A., Dudek, G.: OPTIMo: online probabilistic trust inference model for asymmetric human-robot collaborations. In: 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 221–228. IEEE (2015)
Google Scholar
Yagoda, R.E., Gillan, D.J.: You want me to trust a robot? The development of a human-robot interaction trust scale. Int. J. Soc. Robot. 4(3), 235–248 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Creative Technologies, Los Angeles, USA
Nikolos Gurney, David V. Pynadath & Ning Wang
Viterbi School of Engineering, Computer Science Department, University of Southern California, Los Angeles, CA, USA
David V. Pynadath & Ning Wang

Authors

Nikolos Gurney
View author publications
You can also search for this author in PubMed Google Scholar
David V. Pynadath
View author publications
You can also search for this author in PubMed Google Scholar
Ning Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikolos Gurney .

Editor information

Editors and Affiliations

University of Salzburg, Salzburg, Austria
Alexander Meschtscherjakov
Eindhoven University of Technology, Eindhoven, The Netherlands
Cees Midden
Eindhoven University of Technology, Eindhoven, The Netherlands
Jaap Ham

Appendices

Appendix A Figures

Appendix B Data and Models

1.1 B.1 Models

We used linear regression to model and test the predictive value of the various behavioral measures. The basic approach is to fit a reference model (Model 1) in which the outcome measure, future behavior, is predicted by the treatment conditions alone, such as:

$$\begin{aligned} Y_{iFB} = \beta _0 + \beta _\textrm{Treat} X_\textrm{iTreat} + \epsilon _i \end{aligned}$$

(1)

We use this general form for the reference (null) model across the experiments. Note that there are different FB measures; which one is used in a given set of models depends on the independent variable in question. A given reference and test model, however, always use the same dependent variable.

Model 2: The first category of test models incorporate participants’ DTI score as an independent variable:

$$\begin{aligned} Y_{iFB} = \beta _0 + \beta _\textrm{Treat} X_\textrm{iTreat} + \beta _\textrm{DTI} X_\textrm{iDTI} + \epsilon _i \end{aligned}$$

(2)

Note that the reference model is nested within this model. In other words, Eq. (2) represents the alternative hypothesis that adding the predictor variable to the model will result in accounting for more variance, or a significant improvement in model performance. Comparing the two models is as simple as applying an F-test, which in this case tells us whether the more complex model results in a statistically different residual sum of squares value. If it does, we can reject the null hypothesis, i.e. that the reference model is sufficient, in favor of the alternative hypothesis, or that adding the additional predictor variable(s) was warranted. The F-test, in this instance, can be thought of as allowing us to investigate the utility of adding DTI or other measures to the model.^{Footnote 2} When this test returns a p-value less than 0.05, we can conclude that the residual sum of squares values for the two models are significantly different at the level of $\alpha = 0.05$, or in other words, the new explanatory variable is warranted as it significantly increases the amount of variance explained in the model (i.e. RSS is lower).

Model 3: The next category of test models incorporate participants’ past behavior as independent variables, for example, the model:

$$\begin{aligned} Y_{iFB} = \beta _0 + \beta _\textrm{Treat} X_\textrm{iTreat} + \beta _\textrm{FC} X_\textrm{iFC} + \epsilon _i \end{aligned}$$

(3)

This uses participants’ first choice to predict their compliance for all of the remaining choices that they faced. Again, this has the reference model embedded in it, and a simple F-test will reveal the utility of the FC predictor. We construct similar models for M1C, AFM, and AC-AFM (but using appropriate FB measures).

Model 4: Since it is likely that DTI and the past behavior measures are accounting for different variance in the models, directly comparing equations (2) and (3) via $R^2$ is not entirely informative. Thus, we introduce a third category of test models in which both DTI and a past behavior measure are included. In the case of FC, the model is:

$$\begin{aligned} Y_{iFB} = \beta _0 + \beta _\textrm{Treat} X_\textrm{iTreat} + \beta _\textrm{DTI} X_\textrm{iDTI} + \beta _\textrm{FC} X_\textrm{iFC} + \epsilon _i \end{aligned}$$

(4)

These models facilitate assessing whether the added complexity of including both DTI and the past behavior measure is warranted, again using F-tests. Finally, for readability, we refer to relevant statistics within the text of the manuscript but place regression and other tables for each set of models and their associated tests in the appendix. These tables include, in the same order as above, models that facilitate comparing DTI and the behavioral measures. Each regression table is followed by a table presenting the results of the relevant F-tests.

1.2 B.2 Data

We used data from experiments conducted as part of a long-term research project on explainability and AI. Participants of these experiments team with a simulated robot during reconnaissance missions. The missions involve entering buildings to determine whether threats are present. The robot goes first and is equipped with a camera, microphone, and sensors for nuclear, biological, and chemical threats. These sensors are not perfectly reliable. Based on the data that it collects using its sensors, the robot makes a recommendation to the participant about putting on protective gear. The participant then makes a choice about wearing the gear, i.e., whether or not to comply with the robot’s recommendation. When participants wear the gear, it always neutralizes any threat. If they do not wear it and encounter a threat, they die in the virtual world, but in reality, incur a prohibitive time penalty. Finally, participants incur a slight time delay (much smaller than that for death in the virtual world) when equipping the gear.

In all three studies, the robot based its recommendations on the noisy sensor readings as input to a policy computed through either Partially Observable Markov Decision Processes (POMDPs) [19] or model-free reinforcement learning (RL) [20, 40] using the reward signal of the time cost and deaths incurred. The robot performed significantly better than chance across the studies. This means that compliance from the participants was highly correlated with making the normative choice, i.e., wearing protective equipment at the right time.

Participants of all three studies completed the 12-item DTI before starting their assigned mission(s). Gross experimental details and the results of these conditions are reported in the original publications. As such, we do not replicate those findings here for brevity’s sake. Do note, however, that the n’s we report may vary from the original papers because of incomplete observations (some participants chose to not complete the DTI).

Study 1 participants ($n=198$, Amazon Mechanical Turk) completed three missions, each with eight buildings [44]. They were randomly paired with one of two POMDP-based robot types: a high-ability robot that was never wrong or a low-ability robot that made mistakes 20% of the time (or was 80% reliable). Both types of robots were crossed with four different recommendation explanation conditions: none, confidence level, sensor readings version 1, and sensor readings version 2. This experiment was fully between subjects, meaning that each participant interacted with only one robot type and received only one type of explanation throughout the missions. The coefficient $\beta _\textrm{Treat}$ in the models for Study 1 captures which information condition participants experienced. First compliance choice (FC) takes 1 if participants heeded the robot’s recommendation for the first building and 0 if not. Mission 1 compliance (M1C), on the other hand, is the fraction of times that a participant complied with the robot’s recommendations during the first mission. The compliance future behavior (FB) measure associated with FC for study one is thus the fraction of times a participant complied for the remaining 23 buildings. Similarly, for M1C it is the fraction of times that a participant complied with the robot’s advice during missions two and three. It should be noted that participants were not told whether they were interacting with the same robot across missions; instead, the robot started each mission as if it had not previously interacted with a participant.

Study 2 participants ($n=53$, cadets at West Point Academy) completed eight missions, each with a different POMDP-based robot [46]. In each mission, the human-robot team carried out a reconnaissance task of 15 buildings. The mission order was fixed (i.e., always searched the buildings in the same order and across missions), but the robot order was randomized. The $2\times 2\times 2$ design crossed robot acknowledgment of mistakes (none/acknowledge), recommendation explanation (none/confidence), and embodiment (robot-like/doglike). Unlike Study 1, participants interacted with different robots during each mission. Nevertheless, to demonstrate the robustness of the simple behavioral measures, we rely on the same first compliance choice (FC) as Study 1 and a similar mission 1 compliance (M1C). The compliance measures, obviously, cover a longer horizon: 119 missions and 105 missions, respectively. The $\beta _\textrm{Treat}$ of the models for Study 2 data captures the robot type of the first mission. It is possible that ordering of robot advisors mattered; however, the data are insufficient to specify a hierarchical model that would uncover such a feature.

Study 3 participants ($n=148$, Amazon Mechanical Turk) completed one mission that covered 45 buildings with an RL-based (RL: reinforcement learning) robot in a fully between design [14, 15, 31]. The treatment conditions held the robot’s ability constant but varied how it explained its recommendations: no explanation, explanation of decision, explanation of decision and learning. Again, the first compliance choice (FC) is the same as the previous two studies and the FC outcome measure is the compliance fraction for the remaining 44 buildings. Mission 1 compliance (M1C) is not applicable given that the entire experiment consisted of a single mission. Given that building order and robot performance were fixed across treatment conditions, however, the two additional compliance measures, choice after the first mistake (AFM) and average compliance through the first mistake (AC-AFM), become meaningful. The first mistake occurred during building six, thus participants’ decision for building seven is the AFM measure and the fraction of times they complied during the first seven buildings is AC-AFM. Relatedly, the dependent variable is the fraction of times that a given participant complied during the remaining 38 buildings.

Appendix C Tables

Table 1. Study 1 FC Models

Comparing Psychometric and Behavioral Predictors of Compliance During Human-AI Interactions

Abstract

Access this chapter

Similar content being viewed by others

Measuring and Predicting Human Trust in Recommendations from an AI Teammate

Autonomous Agent Teammate-Likeness: Scale Development and Validation

Effective Human–Artificial Intelligence Teaming

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Appendix A Figures

Appendix B Data and Models

1.1 B.1 Models

1.2 B.2 Data

Appendix C Tables

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation