Keywords

1 Introduction

Remaining focused on presentations in today’s world is getting increasingly difficult, given the countless distractions offered by smartphones and other personal electronic devices, in addition to the various external and ambient distractions. This imposes a big challenge on presenters to perfectly prepare their presentations for their audience.

In a study, Ward et al. (2017) report that even having phones in silent mode makes it harder to stay focused and reduces cognitive performance compared to having no phone at all. But also without smartphones the available attention span is limited. Reports on the typical attention span duration vary broadly. McKeachie and Svinicki (2013) mention a 10-min attention span, which they base on a study about note taking during lectures by Hartley and Davies (1978). Bunce et al. (2010) mentioned that students self-reported an attention decline already after 4–5 min into a presentation after the initial settling in period, and that they would also alternate between attention and nonattention states in briefer and briefer cycles during a lecture. In general Wilson and Korn (2007) criticize the lack of consideration of individual differences in various attention span studies. A widely-reported study conducted by Microsoft, which described an attention span of just 8 s in 2013, has been disregarded in this context, as it basically referred to the time people typically spent on a web page before moving on, and not to actual attention span durations (Bradbury 2016).

Independent of the actual length of the attention span, it’s not only a matter about attention itself, it’s also about where members of the audience divert their attention to, or expressed differently, what might distract them. Pratto and Oliver (1991) describe negative social information as a powerful source to grab attention. Smith et al (2003) reported about a negativity bias in attention allocation, meaning negative stimuli can divert the attention away from the current focus more easily than positive ones.

As soon as some individuals of an audience get distracted, this state might be passed on to others subconsciously by mood transfers, which could possibly lead to an agitated audience (Tudor et al 2013). This process is caused by the same mechanisms as the ones underlying for example contagious yawning, which is described as nonverbal communication to synchronize behavior within a group (Guggisberg 2011). The mechanism of mood transfers has originally been discussed in birds’ research, where one bird fleeing causes all others of the group to flee too at virtually the same time (Lorenz 1935).

There are, of course many ways presenters can prepare for these challenges in first place. Having the information perfectly selected and prepared according to the interests of the expected audience is an important first step. Obeying best practices for preparing presentations is another important step towards captivating the attention of an audience.

However, an interesting question arises. Could a presentation system possibly support the presenter in a subliminal way to regain the attention of an audience? What would be needed to distract the audience from the countless distractions back to the presentation? In this paper, we present a concept for such a system and describe a brief pilot study to test the feasibility of this approach.

Such a system should be able to judge the overall level of attention of an audience and to present stimuli in a way, that recaptures the interest of distracted people without distracting those people already focused on the presentation. In the next section, we look at some of the related work before we present our approach.

2 Related Work

Measuring the level of attention in classroom settings has been of interest to researchers since a long time. One approach is described by Helmke & Renkl in their Münchner Aufmerksamkeitsinventar (MAI). The basic idea is to observe students in a classroom during a lecture and sequentially code the attention state of individual students in fixed time intervals of 5 s. For this, attention states are classified in two layers. The first one discriminates between no-task, on-task, and off-task states. In the second layer the on-task state is further divided into compliance with task-requirements, self-initiated, or externally initiated activities. The off-task state is further divided into active (e.g. distracting others) or passive (e.g. dreaming) state (Helmke and Renkl 1992). Hommel presented an adjusted variant that’s modified to better fit higher education and makes use of videography to analyze all students in 30 s intervals over the whole time (Hommel 2012). Both approaches also consider the lecture context in addition to attention state.

More advanced approaches make use of emotion recognition systems that analyze facial expressions (Magdin et al. 2016; Magdin and Prikler 2018; Heyjin 2017; Ayvaz and Gürüler 2017; Daouas and Lejmi 2018), eye movements (Krithika and Lakshmi 2016), voice characteristics, written text or a combination of different modalities (Nedji et al. 2008; Bahreini et al. 2016) to deduce the emotional state of a person. These systems are mostly used and tested in an e-learning context, where the knowledge of the emotional state should allow for better feedback by the remote lecturer. However, most of these systems are optimized for observations of individuals, but not suited as an overall measure for a whole audience. Furthermore, members of an audience (e.g. students in a lecture or in general listeners to presentations) would not feel comfortable if they would be tracked individually by cameras.

The human eye features different capabilities in the foveal and the peripheral field of view. In its periphery, the eye has a higher temporal resolution and sensitivity for light changes than in the foveal area. This higher sensitivity to light changes is caused by a higher rod density in the periphery (Curcio et al. 1990; Jonas et al. 1992; Wells-Grey et al. 2016). Rod mediated vision also seems to cause more activation in the brain area that is associated with motion, compared to cone mediated vision (Hadjikhani and Tootell 2000). The reason for the difference between cone- and rod-density could be originated in evolution: humans with this feature were able to detect dangerous situations outside of their current field of view faster and could react to the situation in time and thus survived.

These characteristics might also be exploitable by a system that targets people not directly looking at it.

Adaptive systems that react to some form of user input in an indirect/non-obtrusive way to better support users have been used in various fields. Ward et al. (2000) demonstrated a novel text input interface, which used previous input and learned language statistics to make it easier to select probable succeeding characters. This mechanism enables efficient text input even by only using an input modality with 1- or 2-dimensional input capabilities. The same principle is used for modern touch keyboards that automatically increase the sensitive area around certain keys that are more probable to be typed next than others, and therefore make them easier to reach. Kempter et al. (2003) discussed the benefits of analogue communication in the human computer interaction context, where a system’s behavior can be influenced by analogue channels with no explicitly/pre-defined meanings. Ritter (2011) described different approaches where subliminal feedback loops in human computer interaction, based on these analogue communication patterns, could be used to support users in a transparent and unobtrusive way. In a widely-criticized study for its ethical aspects, Kramer et al. (2014) have shown that emotional states can be transferred over social networks. Users tend to take on the same emotions as conveyed in their news stream, even without the users being aware of this process. These studies demonstrate that systems can be used to influence people in subliminal ways.

While people might feel uncomfortable, when they hear they can be influenced in a subliminal way, the process of influencing people is one of the corner stones of advertising and is already happening all around the clock.

3 Concept and Implementation

Based on the insights of the work described in the previous section, we aimed to create a presentation system that observes the audience, deduces an estimate for overall attention and applies evolving visual stimuli to recapture the attention of people looking away from the presentation.

The system shown in Fig. 1 depicts the basic system stages. The input processing layer consists of an image analysis module that takes input from a webcam and calculates a motion complexity index. This index is then forwarded to the adaptation loop stage. The input processing layer also contains an environment module that monitors air conditions inside the room, like CO2 level, temperature, and humidity. These measures are currently logged but not yet used as input for the adaptation loop. The adaptation loop module implements a simple genetic algorithm that uses the motion complexity input as fitness values for its candidate adaptations. Once fitness values are available for each candidate in the population, a new generation of adaptation candidates is generated using a roulette wheel algorithm for parent selection. Each candidate is then applied for a situation with exceeding motion complexity and again assigned a corresponding fitness value based on the motion complexity. In our current system, candidate solutions are encoded using a 10-bit string shown in Table 1.

Fig. 1.
figure 1

System overview of an adaptive presentation system, consisting of an input processing stage, an adaptation loop module, a presentation mapping stage, and the output layer.

Table 1. Example genetic encoding of visual overlay parameters, each consisting of 2 bits

New evaluation cycles are triggered every 60 s during situations with too much motion. The following 30 s are used as timeframe for measuring the reactions of the audience following a stimulus (see Fig. 2).

Fig. 2.
figure 2

Scheduled stimulus application and evaluation timeframes

The presentation mapper takes a candidate bit string as input and maps this to specific adaptation animations that are then overlaid over the presentation. The main idea of these adaptations is to be very low-level and fast, effectively producing a lighting change, and therefore being more noticeable in the peripheral field of view while hardly being visible for people looking directly at the presentation.

Motion complexity (MC) inside the audience area is derived via an image processing algorithm, taking a video stream from a webcam as input and performing grayscale conversion, background subtraction, thresholding, and morphing operators to the individual images (see Fig. 3) using OpenCV. This results in a black and white image with connected and independent motion areas.

Fig. 3.
figure 3

Image processing steps for retrieving motion complexity

From this, two different indices are calculated (see Eqs. 1 and 2).

$$ {\text{MC}}_{\text{rel}} = {\text{image}}\,{\text{area}}\,{\text{with}}\,{\text{changes}}\,/\,{\text{total}}\,{\text{image}}\,{\text{area}} $$
(1)
$$ {\text{MC}}_{\text{cnt}} = {\text{count}}\,{\text{of}}\,{\text{independent}}\,{\text{movement}}\,{\text{areas}} $$
(2)

MCrel should give an overview of the overall action taking place in the camera’s field of view and can also be used to detect lighting changes inside the room (e.g. a percentage higher than 50%). MCcnt reflects the number of independent disturbance areas and helps to counter the effect that moving objects closer to the camera have a bigger influence on MCrel.

We set up a small pilot study to explore the usefulness of the two different motion complexity indices for estimating an audience’s attention level, and to test the feasibility of this approach in a real-life setting.

4 Pilot Study

In a small pilot study, we used a simplified version of the system described in the previous section, to mainly investigate the plausibility of the motion complexity indices and to see if a background color overlay would cause a noticeable effect. Since the genetic algorithm adaptation loop would have added too much variability, we chose to apply a red background overlay of varying opacity with a maximum value of 20%, depending on the current motion complexity. Red was chosen because of the common signal-meaning of it. This overlay will also result in a lighting change depending on the opacity, which should mainly be recognizable from the visual periphery.

4.1 Method

The system was installed in a lecture room with the camera placed so that it was observing the audience but not covering the lecturer area. During the presentation, text-based and image-based slides were shown, both with and without dynamic overlays. Figure 4 shows the sequence of the lecture slides. The audience was not told in advance that there would be overlays. They were informed however, that they would have to fill out a brief questionnaire after some slides. Each slide was shown for exactly 60 s.

Fig. 4.
figure 4

Timeline of the test sequence. Duration of each slide is 60 s

In the pilot study 14 persons (6 female, 8 male) attended the test presentation. After each segment the participants had to rate the following properties on a 7-step rating scale:

  • Q1: “I followed the presentation…” from “exactly (1)” to “not at all (7)”

  • Q2: “At the moment I feel…” from “very calm (1)” to “very agitated (7)”

  • Q3: “The presentation content was…” from “very exciting (1)” to “very boring (7)”

  • Q4: “The slides were…” from “very intrusive (1)” to “very conservative (7)”

4.2 Results

Table 2 lists the descriptive statistics of the questionnaires. The text-based slides with ongoing adaptations were rated the most boring (M = 4.36, SD = 1.50). Text based slides without overlays were rated the most conservatives (M = 4.0, SD = 1.47). The reported attention (Q1) was rated highest for the static pictures (M = 1.93, SD = 0.92) and adaptive pictures (M = 2.62, SD = 1.50). Agitation was rated lowest for the two static slide types (Mpictures = 2.71, SDpictures = 1.54, Mtext = 2.86, SDtext = 1.56). See Fig. 5 for a graphical illustration.

Table 2. Descriptive statistics of ratings
Fig. 5.
figure 5

Mean rating values on a scale from 1 to 7 of the different slide types (n = 14). Error bars show the 95% confidence interval.

T-tests were performed between the static and dynamic phases. The results show that there were significant differences between the reported agitation-level (Madaptive = 3.57, Mstatic = 2.79, p = .019) and the feeling-bored-level of the slides (Madaptive = 4.18, Mstatic = 3.43, p = .007). See Table 3 for more results and Fig. 6 for a graphical overview.

Table 3. Rating differences between adaptive and static slides
Fig. 6.
figure 6

Differences between adaptive and static slides. Significant differences could be found for agitation (Q2) and excitation (Q3). Error bars show the 95% confidence interval.

To also check the influence of the slide type (picture or text), a t-test has been performed between these groups. However, no significant differences could be identified.

Both motion complexity indices show a peak at the beginning of the lecture and then at the questionnaire slots 3, 6, and 9 (see Fig. 7), indicating that both measures reflect the settling in period of a presentation and react to distractions during the presentation (filling out the questionnaires). It’s also interesting to note that the motion complexity levels for the picture slides (4,5 & 7,8) are above the ones for text slides (1,2 & 10,11). This could relate to the higher self-reported attention level of picture slides.

Fig. 7.
figure 7

Motion complexity indices through the course of the presentation. Peaks can be observed at the start, and at phases 3, 6, and 9 (questionnaire action).

5 Discussion and Conclusion

The results show that there was a significant difference between slides with and without adaptions taking place, hinting that we might indeed be able to influence the perception of slides using low-level visual overlays. However, the difficult part is to find the right balance between the noticeability from the peripheral field of view and the center field of view. In the mapping of the pilot study, the strongest visual overlays were too noticeable when directly looking at it – this was probably the reason for the relatively high self-reported agitation level of the slides with adaptions. Overall, the derived motion complexity for picture slides was higher than for text slides. This contrasts with the reported agitation level, where the slides with adaptations were rated higher independent of slide content type, but falls in line with the higher self-reported attention level for picture based slides. A decline from the first to the second slide within a condition could be observed for each variation, which was to be expected due to the settling in phase from the preceding questionnaire section.

One downside of our motion complexity indicators is the susceptibility to overreaction when massive motion occurs (e.g. a person leaving the room or is coming in late). While this is a little improved with MCcnt compared to MCrel, it is still present there too. One workaround could be to set a narrow threshold for valid motion complexity levels, and if this threshold is exceeded, pause the system until things get back to normal. This would also be advisable for filtering out abrupt light changes.

A possible concern is the influence of such a system on people with Epilepsy. Even if our overlays are within the same specs as e.g. movies or games that could also play in the peripheral field of view, further clarifications with experts in this field must be done, before using such a system in real-life settings.

In a next step, different stimuli will be tested in a lab setting regarding their potential to grab peoples’ attention while they are focused on a different task, where the stimulus screen is placed outside the foveal field of view. The most promising properties will then be encoded for use with the genetic algorithm. Also, the motion complexity parameters could be compared to attention level ratings using e.g. the ModAI instrument. Once attention levels are rated for specific lectures, the motion complexity parameter algorithms can be fine-tuned using video recordings of the rated lectures.

After this, further tests of the overall system with more participants will have to be done.