1 Introduction

The development of autonomous systems has been ongoing for some time in the aviation, rail, medical and automotive industries, among others. It is unlikely that these systems will completely replace the human element in the system in the near term; therefore, designers are focused on automation systems that serve as team members or partners with human operators, a concept known as “Human Autonomy Teaming (HAT).” HAT represents a significant shift in the view of automation as a simple replacement for human functions. Nowhere is this change more apparent than in the aviation industry. In the 20th century automation served to replace skilled crew members. New jet engines eliminated the need for the flight engineer, advanced navigation aids (e.g. VOR, INS) eliminated the need for a navigation officer, and improvements in radio communication eliminated the need for communications officer. As a result, the standard five-person crew (two pilots, a navigator, radio operator and flight engineer) was reduced to just two pilots [1]. It is not surprising then, that some thought and development starting at the end of the 20th century is considering further reductions in crew size.

1.1 Human Autonomy Teaming (HAT)

Reduced Crew Operations (RCO) or Single Pilot Operations, however, will not be feasible without support from advanced autonomous systems in combination with ground support [2, 3]. The Human Automation Teaming Laboratory at NASA Ames Research Center has been evaluating HAT autonomous systems for assisting pilots and dispatchers in these environments. In a program of automation support development and simulation evaluations, the HAT Laboratory has evaluated tools for a ground dispatcher assisting pilots dealing with off-nominal events and cockpit tools to assist reduced crews [4, 5]. At the same time, a conceptual model for HAT is being developed that consists of three tenets (for details see [3, 4, 6]).

Transparency.

The human operator must understand the intent and reasoning of the autonomous agent, and determine the factors used by the agent in arriving at a solution or recommendation. The operator must have knowledge of the general logic being used by the agent, and have an accurate mental model of its functioning. At the same time, the autonomous agent must understand the preferences, attitudes and states of the human team members. This latter specification may be one of the more challenging aspects facing designers of autonomous-agent crew members.

Bi-directional Communications.

Fast and effective communications between humans and autonomous agents is essential for effective HAT. Communications will establish shared knowledge of team goals, current status and errors, either human or autonomy. Effective communication also requires that the human crew effectively and accurately direct the autonomous agent, and override its decisions, if necessary.

Operator-Directed Interface.

The risk of automation failure can override the benefits of an automation agent, so it is important the operator be able to allocate tasks depending on the current situation. This also serves to keep the operator in the loop.

1.2 Assessing the Effectiveness of HAT

The effectiveness of HAT will be determined by additional factors, based on previous research in the area of human-automation interaction, and human-human team performance. Therefore, evaluations of the HAT designs must include assessments at several system levels, and assess system and operator performance. Operator factors that will determine HAT effectiveness include the following:

Trust.

Effective HAT will depend on the degree of human trust in the autonomous agent. Trust is a complex state that is similar to but not exactly the same as human-human trust. Human crew members must place themselves in a position of uncertainty and vulnerability with respect to the autonomous agent, in the expectation that the agent is doing what is supposed to do, or communicate why it is unable to do so. Trust must be appropriately calibrated in order to avoid negative consequences of over trust (i.e., complacency) and under trust (i.e., workload; [7]).

Workload.

When automation agents become an integral team member, it is critical that they not increase the workload of the human team members (e.g., [8]). Excessive workload may be produced by difficulty understanding the agent’s current reasoning, awkward communication procedures, and lack of trust in the agent.

Situation Awareness.

In addition to awareness of the goals, tasks, systems and environments, human crew members must be aware of the current status of the automated agent, and the automated agent must be aware of the current state of human crew members [9, 10].

Individual Differences.

Human team members can have varied skills that impact how they collaborate and accomplish mission goals. Human operators may also have different attitudes toward an automated agent as crew member. It is important therefore that the agent know about these differences and take them into account when interacting with them [9].

1.3 NASA’s HAT Demonstrations

NASA’s HAT Laboratory has an ongoing program of design and evaluation of HAT tools for aerospace applications [4]. In 2016, collaboration tools for ground operators supporting RCO were developed based on the HAT tenets, and these were evaluated in a simulation demonstration [11, 12]. The following tools were developed based on the HAT tenets.

Plays.

Using plays ensured that the HAT agent was operator directed. The ground operator initiated automation procedures from a set of plays in order to establish system goals and clarify roles and responsibilities between the automation agent and human operator. These plays were called in response to off-nominal events, and served to activate the Autonomous Constrained Flight Planner System (ACFP) and an electronic checklist that showed tasks allocated to either human or autonomous agent.

ACFP.

The major tool developed on HAT principles, the ACFP determines alternative flight plans based on a number of factors such as weather, services, and fuel. Transparency was promoted by displaying the values of each factor used to determine the recommendations, and their relative weights in the decision. Bi-Directional communications was enabled by displaying the weight values as sliders that could be adjusted. The operator could then request new recommendations.

Traffic Situation Display.

This display provided additional transparency. The TSD is a 3-dimensional display of traffic in the area surrounding the currently serviced aircraft. Weather, turbulence, ATIS at the destination and other potential divert options could be provided at the operator’s request.

Voice I/O.

Voice commands for selecting plays or requesting information, enhanced the principle of bi-directional communication. Voice also announced to the operator the current activity of the automation agent.

The simulation evaluation was based on a small sample. Nevertheless preliminary results indicated that workload was lower with HAT tools compared to No-Hat tools. Operators took more time to uplink revised flight plans in the HAT condition, even when no adjustments were made to the recommendations of the autonomous recommender system [12]. Participants rated the HAT condition more favorably than the no-HAT condition: diversion recommendations were rated more acceptable, and confidence in the recommendations was higher. Moreover, HAT displays were preferred for keeping up with operationally important issues, ensuring situation awareness, integrating information, and reducing workload [11].

1.4 Present Investigation

The tools in Brandt et al. [11] were modified and installed on a tablet workstation, in order to provide them to line pilots. A simulation test was conducted with line pilots in a distributed simulation network that compared pilot performance, behavior and subjective responses to autonomous agents based on HAT vs. No-HAT principles. The present paper reports how pilots used these automated tools to deal with off-nominal events.

2 Method

2.1 Participants

Twelve ATP participants participated in this simulation. All were line pilots (2 Captains and 10 First Officers). Eleven participants had over 5000 h of line experience (one with 3000–5000 h) and 10 had over 3000 h of glass cockpit experience. For additional details on the sample, see [13].

2.2 Apparatus

A distributed simulation network was established between University of Iowa and California State University Long Beach, with support from NASA Ames Research Center and Rockwell-Collins, Inc. Pilots flew a Boeing 737 motion-base simulator located in the Operations Performance Laboratory (OPL) at University of Iowa. Confederate dispatchers and air traffic controllers were located in the Center for Human Factors in Advanced Aeronautics Technologies (CHAAT) at California State University, Long Beach (CSULB) which also housed servers for the HAT tools. The simulation network was made possible with NASA’s MultiAircraft System (MACS), and ADRS along with additional tools for generating flight diversion recommendations, displaying automated checklists, and a Cockpit Display of Traffic Information (CDTI). Voice communications between the pilot, ground support and simulation personnel were accomplished via TeamSpeak software. For additional details of the distributed simulation configuration, see [5].

HAT tools were installed on a Microsoft Surface Prime tablet that was mounted on the left wall of the cockpit. The tablet contained separate pages that provided functions or information based on the flight phase: Enroute, Approach, Runway, Play and Alerts. These could be selected by touch or voice commands. The Approach and Runway pages provided charts and information regarding airports, runways, etc. Alerts served to initiate most off nominal events found in the simulation. When an alert occurred, the Alerts button would turn orange and the specific alert would be listed on the page in red or orange, depending on severity.

The pilot acknowledged the alert, and utilized the Plays and Enroute pages to resolve the event. The information displayed depended on the automation condition (HAT or No HAT). The Plays page contained a set of plays that corresponded to one of the off-nominal events. The pilot would call the play, either via touch input (see Fig. 1) or voice commands. When a play was activated, the information on the page changed depending on the automation condition.

Fig. 1.
figure 1

Panel of plays available to the pilot which could be activated by either touch or voice.

The Autonomous Constrained Flight Planner (ACFP) is a flight-planning recommender system for assisting pilots in generating and evaluating routes. The ACFP shows a table of airports based on four diversion recommendations. In the No-HAT condition, these four alternative routes are displayed, but no information as to rationale used to arrive at the recommendations was shown. In effect, if the pilot did not like any of the recommendations, he or she would have to generate a diversion flight plan without assistance from the ACFP. In the HAT condition, the ACFP provided the recommended airports and the basis of its reasoning, as shown in Fig. 2. In addition to a rating of risk, the factors used in the decision-making process and their values were displayed in tabular form. The factor weight values were shown above the table as sliders. The pilot could generate a new set of recommendations by adjusting the relative weights (moving the sliders) and requesting a new set of options.

Fig. 2.
figure 2

ACFP display with factor weights in the HAT condition. In the No-HAT condition, only the top “Option” row is displayed.

In the HAT Condition, an electronic checklist was also displayed based on the play that had been called. The checklists were based on the QRH manual, but included steps that would be performed by the automated agent. In the No-HAT Condition, traditional paper-based checklists were provided in the QRH manual located in the center console of the cockpit. For both conditions additional paper documentation was provided on voice commands, ATIS and in the Medical Emergency event, Medlink.

2.3 Experimental Design and Procedure

Each pilot flew six, 12–15-min scenarios, three in the HAT Condition and three in the No-HAT Condition, with the order counterbalanced. Within each condition, the scenarios varied in the severity of the off-nominal event, as shown in Table 1. The off-nominal event began roughly 4 min into the scenario. The event was signaled by an alert on the tablet. Once the pilot acknowledged the alert, the plays were displayed and the pilot could select the play corresponding to the event. This brought up the ACFP along with the Automated Checklist and Weighting factors in the HAT Condition. In the No-HAT Condition, the pilot would find the appropriate checklist in the QRH manual and begin working through it. In the Medical Emergency event, the event was alerted by a confederate experimenter serving as flight attendant and MedLink. Pilots also communicated via voice with air traffic control and ground dispatch for additional information regarding weather and alternative airports. All flight plan changes required clearance from air traffic control.

Table 1. Specific off-nominal events for each event severity level

At the end of each scenario, pilots completed the NASA TLX workload questionnaire [14], Situation Awareness Rating Technique [15] and a questionnaire asking about the usefulness of the tools. After all scenarios were completed for one condition, the pilots completed a questionnaire that asked about the tools just used, and a Trust in Automation scale [16] (for results of the questionnaires, see [13]). After both conditions were completed, pilots were debriefed as to their preference for HAT agents and other concerns regarding the tools and simulations.

The design was repeated measures with the factors automation condition (HAT vs. No HAT) and event severity (Low, Moderate and High). Here we report on the extent to which pilots utilized the information on the tablet relative to information on the instrument panel, and paper documents, for HAT vs. No Hat conditions. We recorded the eye gaze of the pilots in each scenario. Cameras were located on the tablet, instrument panel and on the right panel next to the first officer seat. The amount of time gazing at each major source of information, (tablet, instrument panel and documents) was determined by examining the position of the participant’s sclera relative to the information source. Tablet gaze time was measured as time looking at the left-side panel of the cockpit. Instruments gaze time was the time spent looking forward at the cockpit instruments including time for adjusting flight parameters. Documents gaze time was the time looking for and reading documents that were originally located in the center console, but often ended up on the pilots lap. It also included time to query the confederate flight attendant in some scenarios. Finally, other interactions with the tablet were recorded, such as the frequency of weight adjustments in the HAT Condition, and which, if any, of the alternative recommendations were selected. We analyzed NASA TLX workload and SART situation awareness scores for each automation condition and event severity.

3 Results

3.1 Eye Gaze

Repeated measures ANOVAs were performed on the total time to resolve each off-nominal event with the factors Automation Condition (HAT vs. No HAT) and Event Severity (low, moderate, high). All effects were non-significant, indicating that the Automation Condition did not affect the time to resolve the event. As shown in the top row of Table 2, resolution times for HAT and No HAT Conditions were nearly identical. ANOVAs were also run on the time spent gazing at each source of information (tablet, instrument panel and documents). Because individual gaze times are related to total event times, we converted the gaze times to the proportion of time spent on each information source for each event, and ran repeated measures ANOVAs on these proportions as well.

Table 2. Mean gaze times and proportion of gaze times: HAT vs. No HAT

As shown in Table 2, in the HAT condition, participants spent significantly more time fixated on the tablet F(1, 10) = 169.04; p < .003, and significantly less time looking at written documentation, F(1, 10) = 9.14; p < .013. The time spent on instruments panel was non-significant, however. More time on documents was required in the No-HAT condition because checklists were paper-based, and pilots would have to find the QRH, locate the correct QRH, and follow the checklist. In the HAT condition, electronic checklists were displayed on the Tablet. When gaze times were converted to proportions of time, similar results were obtained. Pilots in the HAT condition spend roughly 47% of the event-resolution time focused on the tablet, and only 10% of the time on documents, compared with the No-HAT condition in which 34% of the event-resolution time was spent looking on the tablet, and 18% on documents.

The proportion of time spent on each display also depended on Event Severity. Significant main effects of Event Severity were obtained for proportion of time on tablet, F(2, 20) = 4.24; p < .029, and proportion of time on instruments, F(2, 20) = 4.24; p < .029, but not on documentation. As shown in Fig. 3, relatively more time was spent on the tablet and less time on instruments for moderately severe events. In fact post hoc analysis determined that the difference between each measure were non-significant for low and high severity events. For moderately severe events, pilots spend more time looking at the tablet and less time looking at instruments. All interactions between condition and event severity were non-significant.

Fig. 3.
figure 3

Percentage gaze time for each information source as a function of event severity

3.2 Workload and Situation Awareness

Repeated measures ANOVAs were also run on the post-scenario measures of workload (NASA TLX) and situation awareness (SART). For both measures, the effects of Automation Condition were non-significant. For workload only, a significant main effect of Event Severity was obtained, F(2, 20) = 16.14; p < .0001. As shown in Fig. 4, workload scores were highest for the high-severity events, with no difference in workload for low- and moderate-severity events. TLX scores for high-severity events (M = 46.48, SEM = 9.585) were on average 15 points higher compared with moderate- and low-severity events (M = 32.65, SEM = 7.09; M = 32.49, SEM = 7.26 for moderate and low severity events, respectively), and this was confirmed with post hoc comparisons. Figure 4 also shows that across all levels of Event Severity, there were no differences in TLX scores between HAT and No-Hat conditions. Event Severity and Automation Condition did not significantly affect SART scores.

Fig. 4.
figure 4

NASA TLX scores for HAT and No HAT conditions as a function of event severity

3.3 Other Measures of HAT Interactions

There was considerable variability between pilots in the frequency of interactions with ACFP. For example, five of the twelve pilots never adjusted the weights in any HAT scenario; two pilots adjusted weights in only one scenario. This means that 7 of 12 pilots had little or no interactions with factor adjustments. Moreover, slider use depended on Event Severity: three pilots adjusted the weights in the low-severity condition, six pilots in the moderate severity condition and four pilots in the high-severity conditions. Note that three pilots adjusted weights at all severity levels.

We also counted the number of ACFP resolutions accepted by pilots based on automation condition and severity level, and the rank of the recommendation. As shown in Table 3, most pilots accepted the top ranked recommendation for low-severity events, but as severity increased, fewer top-ranked recommendations were accepted. In fact, for the high-severity events, only 4 pilots accepted the highest-ranked solution in the HAT Condition, and 6 in the No-HAT Condition. For the high-severity condition four pilots in the HAT condition, and 3 in the No-HAT condition, rejected all recommendations.

Table 3. Number of resolutions accepted by rank of risk: automation condition (HAT vs. No HAT) and event severity

The lack of effect of automation condition on workload and situation awareness may have been due to differences between pilots in the amount HAT-tool interactions because pilots received minimal training on these tools. To investigate this possibility, we correlated the relative proportion of time spent on each display with workload and situation awareness measures separately for HAT and No-HAT Conditions. These correlations and are shown along with significance values in Table 4. Significant correlations are shown in bold.

Table 4. Correlation between percent time on tablet, NASA TLX and SART. Significance level in parentheses.

The proportion of time spent on the tablet was significantly and negatively correlated with TLX scores, meaning that when a greater proportion of time was spent with the HAT tools subjective workload was lower. Moreover, time spent on the Tablet was significantly and positively correlated with SART scores, indicating that more time on the tablet produced higher levels of subjective situation awareness. In the No-HAT Condition, time on tablet was unrelated to workload and situation awareness. However, TLX and SART scores were highly correlated with one another; low workload was associated with high SART situation awareness. Consequently, we computed semi-partial correlations between proportion of time on the tablet with TLX and SART. The semi-partial correlation between TLX and tablet time was reduced to −.28, and was marginally significant. The semi-partial correlation between SART and tablet time was reduced to .21, which was non-significant. In sum, when pilots spent more time interacting with the HAT tools, they reported lower subjective workload.

4 Discussion

These preliminary results from an investigation of automation tools developed based on HAT tenets can be summarized as follows. First, HAT-designed tools did not affect the time to resolve off–nominal events despite the fact that pilots spent more time on the tablet in the HAT condition, and less time on cockpit instruments. Moreover, pilot workload and situation awareness on the average was unchanged by HAT vs. No-HAT tools. On the one hand, this suggests that HAT did not increase workload, which is a good thing. On the other hand, HAT did not improve pilot situation awareness, which is surprising given the additional information provided in the HAT Automation Condition. One possible reason for these lack of difference may be due to the sensitivity of the instruments themselves, suggesting that changes in workload and situation awareness may be too subtle to be detected by subjective instruments.

Another possible explanation may lie in the variability between pilots in their use of the HAT-designed automation. For example, providing pilots with opportunities for modifying the weights used to arrive at flight diversions was intended to promote the HAT tenet of bi-directional communication. However, five pilots never used the sliders, and two only adjusted the weights once. This could mean that sliders are ineffective for bi-directional communications. Moreover, when the severity of the event was high, pilots were less likely to accept the recommendations of the ACFP. Perhaps the improved transparency of this tool becomes less important when rapid decisions are required as in the case of medical emergencies or wheel well fires. In other words, pilots varied in the use of the tools. This is shown most clearly in the correlational analysis of time on tablet, workload and situation awareness. Greater time spent on the tablet was related to lower subjective workload. At this point, we are unable to determine why some pilots made more use of the tablet in HAT conditions than others, but clearly this factor must be considered when designing HAT agents. As pointed out by Chen et al. [9] individual differences in attentional control, spatial ability and gaming experience affect how operators interact with autonomous robots. It is possible that these differences played a role in the use of our HAT agent.