Keywords

1 Introduction

To explore the space around us in real life, we use sensory cues in the environment that indicate the state of some property of the world that might be important to us. By following different sensory cues such as visual, auditory, haptic, olfactory and environmental cues; we perceive the world around us through active exploration [1]. But mobile virtual reality Head Mounted Devices (HMD) such as Google Cardboard and Samsung Gear VR do not offer positional tracking capabilities to track user’s movement in a virtual environment [2]. As a result users cannot walk around to explore any virtual space with mobile HMDs. Furthermore, these devices do not have the capabilities to provide feedback through touch, smell or taste either. This means, all the interactions possibilities need to be conveyed through visual and audio cues in a mobile virtual reality narrative. This makes it a necessity for the designers to understand how these audio-visual cues affect user experiences in mobile VR narratives.

Currently virtual reality is seeing a breakthrough at the consumer market due to the release of HMDs such as Google Daydream View, Samsung Gear VR, Oculus Rift and HTC Vive to name a few [2]. These devices are expected to remove the barriers of screen-based narratives and put the users in the middle of the content. Usually these VR narratives include a storyline that has a specific beginning and ending. When the audience put on their VR headsets, they are transported into an immersive virtual environment and even though they can look around in any direction at any time, the designers of these narratives want them to pay attention to specific elements in that virtual world that are important to stich the story together. These narratives are designed to give the audience the feeling of being a part of the story. But being inside an immersive world invites the audience to look around and since they are in an unknown environment, they are immediately filled with questions such as, where am I? What’s going on around me? What should I do next? Then it becomes the responsibility of the designers to answer these questions quickly so that the audience can focus on the major elements of those narratives. This is where it becomes important [3] to find out, what drives the audience to pay attention to the right elements in virtual reality narratives.

This explorative study investigates, given a 360° virtual space that allows looking at in any direction while users can only focus on 90° to 110° at any given time [4]:

  • How do people know where to look and how to proceed when presented with an immersive virtual narrative?

  • How the audio-visual cues contribute to their experience?

  • What design factors contribute to a positive experience and what design factors contribute to a negative experience for the user?

By building on existing theoretical accounts related to user experience in virtual reality and through a field based study using a popular consumer oriented device used by general consumers, this study analyzes users’ experiences from their point of view. Based on a thematic analysis of collected data through observations, think aloud and semi-structured interviews from 10 participants between the ages of 22 to 29, this study argues that when users are experiencing an immersive virtual reality narrative their curiosity drive them to look for clues to follow in that virtual environment. If the virtual world provides well-designed audio-visual cues to guide their attention throughout the narrative then users feel immersion and spatial presence. On the contrary, lack of cues in a virtual environment keeps users looking straight ahead throughout the narrative, which results in their boredom. When it comes to lack of positional tracking in mobile VR, it can be asserted from the findings that users get confused and frustrated when their movements in the physical environment result in the whole virtual environment moving with them. This clash between real world perception and virtual world realization breaks their feeling of spatial presence in that VR environment. Finally, findings suggest excessive use of audio-visual cues forces users to keep switching their attention in multiple directions in fear of missing out something important, eventually resulting in their frustration and stress.

This paper follows the subsequent structure. The background section examines relevant literature and establishes a gap in research. Section three discusses the methodology used in this study. Section four presents the findings. Section five discusses the findings in detail. And the final section includes conclusion, limitations and suggestions for future research.

2 Background

2.1 Immersion in VR Narratives

When it comes to analysis of experiences, one of the terms quite frequent in literature is flow [5], which describes the mental state in which a person is fully engaged in an activity by a feeling of intense involvement and energized focus. During a flow experience a person feels in control, loses sense of surroundings and his or her awareness is narrowed down to the activity itself. Sutcliffe [6] argues that experience in a virtual world can be explained in terms of flow. He asserts that when the virtual world is well designed, a person in that virtual world feels immersed in a strong sense of presence and the mediating virtual reality device and computer essentially disappears. Compared to screen based media like cinema or television, virtual reality as a media is much more flexible where audience can change their vantage point at any given moment. While more control in a virtual narrative experience provides greater feeling of involvement for the participants [7], this also takes the control away from the designers, who want the audience to follow a specific storyline. This presents a big challenge for the designers of virtual reality narratives.

2.2 Role of Audio-Visual Cues in Perception

Since virtual environments are usually representations of real world environments, it is reasonable to follow the ecological psychology approach proposed by Gibson [1], which describes how different structures in the external world guide people’s everyday actions. According to Gibson’s theory, we perceive the world around us through our actions. We turn our heads to direct our attention to different visual stimuli and we focus our attention to hear better and gather information about action possibilities available around us [1]. Many researchers consider this to be more relevant to HCI than classical cognitive theories [8,9,10]. When it comes to spatial cues available in the environment for perceiving the world around us, previous studies indicate most cues are linked to the visual modality; for example aerial perspective and relative brightness [1]. Along the same line spatial audio plays a big role in directing users’ involuntary attention towards possibilities of action within the virtual environment [11].

2.3 Role of Audio-Visual Cues in Spatial Presence

Wirth et al. [12] describes spatial presence as a two-step process. On the first step the user draws upon available spatial cues to perceive the virtual environment as a plausible space. The virtual environment will more likely be perceived as a plausible space if these audio-visual cues are both rich in quality and have a logical consistency. On the second step, the user experiences herself as being located within that perceived space by discovering possibilities of action within the virtual environment. Existing literature suggest [13, 14] a steady stream of highly detailed information flow supported by appropriate audio-visual spatial cues effectively builds the virtual environment as a plausible place and increases the experience of spatial presence for the users. However, an excessive use of spatial cues can cause sensory overload and produce fatigue for the users [15].

2.4 A Gap in Research

Even though research in virtual reality has been conducted on perception, immersion and spatial presence, little is known about how users decide where to look and how to proceed in an immersive VR narrative and what design factors contribute to that decision. There is also a gap in research that focuses on users’ experience with a consumer oriented mobile virtual reality HMD which only offers applications that are purely narrative or have severely restricted interaction possibilities. Hence research is needed to explore how general consumers who have no experience or very little experience with mobile virtual reality applications perceive this new media. This study aimed to fill this gap in research.

3 Methodology

To follow a well-defined scientific research methodology for analyzing users’ experience with virtual reality narratives and to theorize a set of propositions about those experiences, this study followed a theory informed inductive approach. Since qualitative studies help to uncover and interpret participants’ understanding of the phenomenon that they are involved in [16], it was a good fit for an explorative study like this one where participants’ behavior in a virtual reality narrative was being investigated.

3.1 Data Collection

Csikszentmihalyi and Robinson [17] argue that since experiences are subjective phenomenon that cannot be externally verified, a researcher has to rely on the testimonies given by the participants. They also downplay the validity of relying on physiological measures alone to collect data to explain users’ experience [17]. Since most of the challenges and opportunities associated with users’ experience in virtual reality were not directly observable, in depth semi-structured interviews were chosen as the main source of data collection for this study. Unlike surveys or questionnaires, in depth interviews are flexible, dynamic and those provide a more valid insight into the user’s perception of reality [18]. In qualitative studies that use semi-structured interviews, the primary instrument of data collection is the researcher [19]. This is important in capturing the subject’s point of view as argued by some researchers [20] who assert that due to the use of remote, inferential empirical materials; quantitative researchers seldom capture the users’ point of view of an experience.

3.2 VR Applications Used in This Study

The following virtual realty applications were used during the study. “Oculus home” is the central interface through which other applications can be found and downloaded. The reason behind using existing VR applications was to collect data from professionally designed immersive narratives. Since the focus of the study was to investigate how users experience a virtual reality narrative and what role audio-visual cues play in those experiences; it was important that they were not using low quality prototypes that might not provide accurate data for analysis (Table 1).

Table 1. Summary of virtual reality applications used in this study.

3.3 Equipment and Setup

This study was conducted in a living room set up to ensure privacy of the participants and also to provide them with an environment where a mobile VR headset was most likely to be used. Participants were invited one person at a time to ensure they can act and talk freely. The HMD used in the research was a Samsung Gear VR coupled with a Samsung Galaxy S6 edge mobile phone. This HMD supports 3 degrees of Freedom (DOF) with a Field of view (FOV) of 96°. To avoid ambient noise an in-ear headphone was used during the experiments. To ensure anonymity, a list of randomized participant IDs were prepared and assigned to each participant. Participants were given a consent form that described the research study in a nutshell. It was made sure participants could exit the experiment at any time. They were informed that the follow up interviews would be audio recorded. Each participant had to read and sign the consent form before participating in the study.

3.4 Semi Structured Interviews

In total, 10 participants between the ages of 22 to 29 participated in the study. Five of the participants were male and five were female. All trials followed the same structure. First, the researcher introduced the HMD to the participants with a quick demonstration of how it works. The participant then put on the headset, potentially aided by the researcher. Each of the participants was then instructed to explore the Oculus Home interface for a few minutes and then try out two of the applications selected randomly from the list above. All the participants were interviewed right after they had completed each of the applications. After the experiment, the participant was thanked for his or her time and was debriefed about the purpose of the study.

3.5 Data Analysis

All the recorded interviews were transcribed in details and coded using qualitative analysis software “NVivo”. After coding, the relevant data extracts were collated for analysis to find recurring themes from the dataset. Thematic analysis method was used to analyze the recorded data following the guidelines suggested by Braun and Clarke [21]. For each code, relevant data extracts were reviewed and compared against the whole data set to make sure the emerging themes make sense and no data extracts were being taken out of context from the interview transcript.

4 Results and Analysis

4.1 Specific Observations

After coding the interviews, think aloud data and observation notes; the entire data set was reviewed to identify themes relevant to positive or negative experiences of the users. During this phase of analysis, the use of audio-visual cues to attract and direct user’s attention stood out to be one of the most important design factors affecting the user’s experience during the narrative. The following table provides a short summary of the initial themes along with examples of their relevant data extracts. A detailed thematic analysis can be found here [22] (Table 2).

Table 2. Initial themes from the dataset with salient data extracts from the interviews.

These initial themes were then reviewed to generate refined themes for further analysis. After naming, collating, defining and refining the specifics of each theme [22], the final thematic table was constructed to find patterns of answers for the research questions (Table 3).

Table 3. Refined themes from dataset with coded modes of user engagement.

4.2 Analysis of Results

For all the applications when participants were placed inside the virtual narrative, they explored the space around them out of curiosity and looked for anything that grabbed their attention. When they found something that caught their attention they kept looking in that direction way until their attention was directed to some other element in the environment by a visual or audio cue. In all the applications used during the study, due to the lack of positional tracking in Mobile VR, when the participants moved during the narrative, the whole VR environment moved with them. It came as a surprise for the participants since they were expecting their movement to be tracked inside the virtual environment. It took a little bit of time for the participants to get used to this conflict between expectation and reality, but once they got used to their movement, the participants had no further difficulties with following the narratives.

When participants were exploring the Oculus Home Interface, the participants felt like there was a big screen in front of them and they kept looking straight ahead towards that conceptual screen throughout the experience since no audio-visual cues directed their attention towards any other element on the surrounding space. To match with the participants’ experience, this mode of engagement has been coded as the “screen mode”.

Participants also experienced this “screen” mode of engagement in Rosebud. Once they were placed inside the narrative, the participants looked all around out of curiosity but the only element that caught their attention was the asteroid in front of them. Once they started focusing in front, no other audio-visual cues directed their attention to any other direction in the narrative. The participants expressed that there was not much going on around them and they got bored pretty quickly. It is interesting to note that even though Rosebud offered the largest possibilities for interaction by letting the audience change camera angles, the participants still got stuck in the “screen” mode since everything was happening in one direction, and they were unable to affect the storyline even by changing camera angles.

In the case of Muse Revolt, participants experienced a different mode of experience. When they got into the immersive world of this VR music video, multiple visual elements started attracting their attention at the same time. First they saw the band performing on the stage, but their attention quickly got directed to the groups of people running all around them. While they were trying to follow the groups to find out what’s going on, their attention got directed again by several police cars coming into the scene. While several visual cues tried to catch their attention at the same time, lack of directional audio cues made it even harder for the participants to decide what element of the narrative to focus on, which made them to try and follow too many random cues in Fear Of Missing Out (FOMO) something important around them. Due to this excessive use of visual cues and lack of any sort of guidance, eventually they got frustrated and expressed that there was simply too much going on throughout the narrative. To match with the participants’ experience, this mode of engagement has been coded as the “FOMO mode”.

In the case of Invasion and Song for Someone, once the participants entered the VR environment, visual and directional audio cues directed their attention to the first element they needed to focus on. From that point onwards their focus was guided throughout the narrative from one element to the next. The audio-visual cues were well designed to make sure multiple cues were not asking for attention at the same time. The participants felt guided throughout the experience and they followed the audio-visual cues all around them without much effort. Since the narratives were gradually unfolding all around them, the participants felt immersed in those narratives. They also expressed the feeling of “being there” in those virtual environments. To match with the participants’ experience, this mode of engagement has been coded as the “guided mode”.

The following table lists the modes participants engaged in throughout different narratives (Table 4):

Table 4. Modes the users engaged in with different applications.

One interesting finding from this study is the mismatch between user’s real world perception and virtual world realization due to the lack of positional tracking in mobile VR. This results in users requiring some additional time to get used to the VR environment.

5 Discussion

From the inductive analysis of user experiences with mobile virtual reality narratives used in this study it can be hypnotized that, users have an overall positive experience when there are well-designed audio-visual cues available throughout the experience that put them in a “guided” mode which follows the narrative flow. In this scenario users feel like they are in control, they know where to look and how to follow the cues and they are not missing out anything important. They also feel immersed in that virtual environment which gives them a feeling of spatial presence. It can also be suggested that excessive use of audio-visual cues or poorly designed cues put the users in a mode of engagement where they try their best to follow the cues in Fear Of Missing Out (FOMO) something important and end up feeling stressed and frustrated with the overall experience. They feel there is too much going on and they have no control over the experience. Finally, it can be implied that, lack of audio-visual cues throughout the VR narrative puts the users in a mode of engagement where they end up looking in one direction at a conceptual screen, which breaks the immersion and stops them from experiencing the feeling of “being there” or spatial presence. Users get bored in this mode and eventually end up with a negative overall experience.

By comparing the results from the analysis with the previous studies presented in the theoretical framework section, we can see that some of the findings are supported by previous literature.

5.1 Role of Audio-Visual Cues in Perception and Immersion

In guided mode users experience several components of psychological flow state [5] where they feel immersed in the virtual environment, lose track of the surrounding real environment, feel in control and their focus get directed to follow the storyline of the narrative through available audio-visual cues. This results in enjoyment and a sense of spatial presence. This finding matches with Sutcliffe [6] in terms of flow experience.

The role of audio-visual cues in attracting and directing users’ attention in a virtual environment agree with the theory of ecological perception [1], which states that we turn our heads to direct our attention to different visual stimuli and we focus to hear better and gather information about action possibilities around us. The use of spatial audio to direct users’ attention in different directions throughout a narrative matches with the findings of involuntary attention allocation in a virtual environment studied by Hendrix and Barfield [11].

5.2 Role of Audio-Visual Cues in Spatial Presence

When the users’ experience was directed by appropriate audio-visual cues, multiple users expressed the experience of being spatially present. This fits the findings from existing literature [13, 14] that emphasize on the use of appropriate audio-visual spatial cues to increase the chance of users feeling spatial present in a virtual environment. It is also important to point out the use of directional audio cues in both of the narratives where users experienced immersion and spatial presence.

In “FOMO” mode, users ended up having a frustrating experience because they felt like they were not in control. They got stressed thinking that they might be missing something important in the storyline and too many things were going on at the same time, which in many cases broke their immersion. This agrees with the findings from Wirth et al. [12] who argue that a virtual environment will more likely be perceived as a plausible space if the used audio-visual cues have a logical consistency. In “FOMO” mode, the inconsistencies with the audio-visual cues confuse the users, which eventually block them from experiencing spatial presence in most cases. The negative experience of the users also matches the findings from de Rijk et al. [15] who argue an excessive use of spatial cues can cause sensory overload producing fatigue for the users.

6 Conclusions

Noticing the exponential rise of virtual reality applications in 2016, the goal of this study was to explore user experience in mobile virtual reality narratives and to investigate what role audio-visual cues play in users’ positive or negative experiences. By using a consumer oriented mobile HMD and some popular VR applications, this study also examined whether the results relate to the findings from existing literature, where the research was conducted mostly in controlled environments using proprietary devices.

We know from the language of cinema that motion, color and contrast work really well as visual cues to direct audience’s attention where needed. But when those audiences are placed inside the media in virtual reality, there is always a chance of having their back turned to important elements. To avoid this situation a designer can use visual cues inside the field of view of the users and audio cues outside the FOV to make the users turn their heads to face the elements important for the narrative. While only four narratives cannot be used to generalize the finding, it can be taken as a basis for further investigation into the effect of audio-visual cues on user experience in virtual reality applications.

The main findings of this study can be summarized as the following:

6.1 Audio-Visual Cues Make or Break an Experience

One of the most important findings of this study is how different audio-visual cues attract and direct user’s attention throughout a VR experience. It is clear from the analysis of the recorded data that a virtual reality narrative needs to have well designed audio-visual cues to guide users’ attention in a virtual environment to increase immersion that results in a positive overall experience. It is also important to keep in mind the usefulness of spatial audio cues that direct user’s attention to elements of the VR experience not visible in user’s field of view.

6.2 The Sweet Spot Lies Between Boredom and Frustration

Another important finding is how the amount of available audio-visual cues affects user experience in a virtual narrative. It is clear from the findings of this study that excessive audio-visual cues put the users in “FOMO” mode resulting in their frustration and a negative overall experience. While too many cues bring frustration, lack of cues brings boredom since the users expect to see events happening all around them in an immersive VR environment. Only a limited number of well-designed audio-visual cues hits the sweet spot and guides the audience throughout the narrative without requiring much effort from them.

6.3 It Takes a Little Extra Time to Get Used to Mobile VR

When it comes to lack of positional tracking in mobile VR, it is clear from the findings that users get confused and frustrated when their movements in the physical environment result in the whole virtual environment moving with them. This clash between real world perception and virtual world realization breaks their feeling of spatial presence in that VR environment. Fortunately, once they get used to the limited tracking capabilities of the HMD, users can easily get back into the flow of the narrative, especially when well-designed audio-visual cues guide them throughout the experience. Based on this finding it can be suggested that users should always be given a little extra time in the beginning to get used to their movements in a mobile virtual environment.

6.4 Limitations and Future Work

Due to the small sample size and usage of inductive methods, no claims can be made towards the generalizability of the findings. While the data collection and analysis were done thoroughly and carefully, the results need to be tested in a controlled study to verify the findings. In particular, it must be established in future studies if the same three modes of engagement would re-emerge with new users and new applications.

Another limitation is the age group of the participants. The findings might be different if the participants were from an older generation who are more hesitant towards new technology or if the participants were children who are more curious in a new environment. There are also cognitive differences in perception among different age groups, which might affect the different modes of engagements proposed in this study.