1 Introduction

There is a long tradition in psychology to conceive behaviour as being highly dependent on context [1]. This also applies to the field of human-computer interaction (HCI), where context is considered one of the main determinants of user behaviour when operating interactive systems [2]. The importance of the concept is reiterated by the fact that it is also part of ISO FDIS 9241-210, which defines context as covering users, tasks and equipment, and the specific social and physical environment in which a product is used [3]. In this regard, a product’s usability can only be evaluated taking into account the context in which the product will be used [4]. In other words, a system that may be usable in one context may not be in another. Context factors however play not only an important role in the evaluation of usability of a system but may influence the entire experience of a user. User experience (UX) can be considered an extension of the usability construct, which has a focus on user cognition and performance, by adopting a more holistic approach focusing on user emotions and considering the experience of a user interacting with an interface in its entirety. The assumption of such a holistic approach to the construct suggests to individually assess and report the various facets of experience in UX evaluations.

While users, tasks and equipment are routinely specified in UX studies, the environmental aspects of context are rarely considered in practice and research [5, 6]. However, empirical work has provided evidence that a number of context factors might influence the outcomes of UX tests, such as lab vs. field set-up [7], observer presence [8, 9], or the use of electronic recording equipment [10]. One important characteristic of the usage environment which received little attention in previous research is the influence of the usage domains of leisure and work context on outcomes of UX evaluations.

2 Current State of Research

2.1 The UX Construct

Up into the 1980s, most people experienced interactive technology almost exclusively in the workplace [11]. Since then, information technology became an integral part of our daily lives and increasingly pervades all societal activities (e.g. personal computers have reached people’s homes, and mobile phones have become mobile computing devices). Today, the pervasiveness of interactive technology in all areas of people’s lives, including leisure, is a reality [12]. In this process, the distinction between technology for work and technology for leisure use has become increasingly fuzzy, as devices often cannot be described anymore as clearly being one or the other. Many of today’s technical devices are dual-domain products [13], as they can equally be used in a work context as in a leisure context (e.g. mobile phones and laptops).

Since research in HCI traditionally concentrated on interactive technology in the work domain, the discipline was primarily performance oriented, with the goal to provide highly usable interfaces to increase efficiency at work [14]. This is reflected in the usability definition of the International Standardisation Organisation (ISO), which describes the concept in terms of efficiency and effectiveness (and satisfaction) with which a user can accomplish certain tasks with a system [3]. As a result of a recent shift in the domain towards a less functional and more experiential approach, the notion of UX has gained an increased interest in research and practice. Despite its popularity, the term UX is often criticised for being ill defined and elusive [1, 15]. While some theorists adopted a rather holistic approach describing UX as the totality of actions, sensations, thinking, feelings and meaning-making of a person in a specific situational context [e.g. 16, 17], others attempted to be more concise and focus on emotions or affective states when UX is described [e.g. 18,19,20]. This piece of research follows a holistic view defining UX as an umbrella construct which encompasses the entire experiential space of person interacting with a system. In this respect, UX enlarges the mere functional concept of usability (e.g. effectiveness, efficiency, subjective appraisal) by explicitly encompassing a broad range of other experiential components (c.f. Fig. 1). This holistic view of UX implies however the difficulty of a meaningful and effective assessment of the construct. This is because it might be rather complicated to measure everything a user experiences when interacting with a system. Furthermore, the holistic view of UX makes it difficult to estimate one exact UX score since it is not clear how measures of affective sates are to be combined with indicators of performance and evaluations of satisfaction and workload. In this regard, we suggest to define the most relevant experiential facets or dimensions for each UX evaluation individually based on the specific needs and requirements of the project. These facets or dimensions should then be assessed and reported individually (as suggested to some extent by [21] in their UX scale, see also [6, 15, 22]).

Fig. 1.
figure 1

The user experience construct and its components

2.2 Work vs. Leisure Domain

As a result of the shift from usability to UX, research in HCI has started to increasingly evaluate leisure-oriented technology, such as portable digital audio players [23] or video games [24]. For dual-domain products, however, both domains are equally relevant and different requirements may result from these contexts, which might need to be considered during product evaluation. In order to identify the domain-specific requirements, the differences between work and leisure need to be analysed. There is one previous study that empirically compared work and leisure domains but found little differences between them [13]. However, this may be due to the fact that in addition to usage domain, product aesthetics was manipulated as a second factor. It was assumed that aesthetics would play a more important role in a leisure context than at work but this was not confirmed. Since system usability is a more direct determinant of effectiveness and efficiency of use than aesthetics, its influence in different usage domains is worth examining.

In addition to the lack of empirical research in that field, there have also been difficulties in establishing a clear theoretical distinction between work and leisure so that no widely accepted definition of the concepts has yet been proposed. Three approaches to distinguish between the two domains have been discussed [25]. (a) The purely time-based or ‘residual’ definition of leisure is most commonly used. According to this approach, leisure is when people do not do paid or unpaid work, do not complete personal chores, and do not fulfil obligations. (b) An activity-based approach distinguishes between work and leisure by means of specific behaviours people show in each domain. (c) The third approach conceives work and leisure by the attitudes people have towards their activities. Beatty and Torbert [25] argue that the third approach is considered to be most promising to distinguish leisure from other domains, and there is also some empirical evidence in support of this approach. Several studies confirmed that people described work in terms of goal-directed and performance-oriented behaviour and connected with external rewards while leisure was associated with intrinsic satisfaction, enjoyment, novelty and relaxation [26, 27].

The distinction between work and leisure domain according to this third approach allows for a more precise definition of domain-specific requirements for interactive technology. Since users perceive a work context as more goal- and performance-oriented than a leisure context, usability (e.g. effectiveness and efficiency of task completion) might be perceived by users as a more important requirement in a work than in a leisure context.

2.3 Response Time as a Facet of System Usability

One aspect of system usability which previous research has shown to be directly relevant for various outcome variables is system response time (SRT). SRT is defined as the time it takes from a user input to the moment the system starts to display the response [28]. Although delayed SRT are less of a problem with today’s much increased processing power, delays may still be a problem in human-computer interaction [29, 30].

Negative effects of SRT delays have been shown at several levels. First, there is evidence that response time delays have a negative effect on user satisfaction with a system [31,32,33,34]. Systems with delayed responses are generally perceived as being less usable and more strenuous to operate, which also extends to web pages with long download times being judged to be less interesting [35]. Second, user performance has been shown to be impaired by SRT delays [28, 29, 36, 37]. Third, system delays have resulted in impaired psychophysiological well-being, increased anxiety, frustration and stress, and were even found to reduce job satisfaction [31, 38,39,40,41].

Various moderators of the effects of system response delays have been identified in the context of internet usage, such as webpage properties [42] user expectations [43], and processing information displays such as progress bars [44]. For example, users were less willing to accept download delays when websites were highly graphical compared to plain text documents [42]. It also emerged that information about the duration of the download had a positive effect on user evaluation [43], and progress bars as delay indicators performed best in terms of user preference and acceptability of the waiting time [44]. To our knowledge, only one study has researched SRT in a work context [31]. Conducting a field study in a large telephone circuit utility observing professionals in their work domain, the authors reported that increased SRT not only impaired performance but also system evaluation and even job satisfaction. To our knowledge, no study investigated SRT delays in a leisure context so far.

2.4 The Present Study

The main goal of the present study was to investigate the requirements that result from the two domains of work and leisure with respect to UX design and evaluation. For this purpose, a UX test was conducted in which the two domains of work and leisure were experimentally modelled. The two types of testing context were created by a combination of various experimental manipulations such as lab design (office vs. living room), task wording (work related vs. leisure related) and a priming task which directed participants’ attention towards their own work or leisure activities, respectively. As a second independent variable, system usability was manipulated through SRT delays.

As a test system, an internet site was specifically set up for the experiment, which was designed to offer realistic tasks for both contexts. Care was taken that the tasks for the two experimental conditions were comparable in terms of mental demands but only differed in type of context. The tasks used were information search tasks that required navigating through various levels of a menu hierarchy.

Typical measures for UX evaluation were recorded. Task performance was assessed by task completion rate, page inspection time, and efficiency of task completion. Self-report data was collected for emotion, task load and perceived usability.

Our hypotheses were as follows: (a) Test participants in the work context perform better and report higher perceived task demands than those in the leisure context, since the work context is perceived as more goal- and performance-oriented. (b) Performance is lower when working in the condition with low usability compared to working with high usability. (c) Perceived usability of the system and emotional reactions are less positive in the low usability condition, since the reduced system usability is reflected in participants’ evaluation and emotion. (d) At work, low usability causes a stronger decrease in perceived usability and in emotion than in the leisure context, since the negative impact of system delay on performance is perceived as more relevant in the goal- and performance-oriented work context.

3 Method

3.1 Participants

The sample of the experiment consisted of 60 participants, aged between 19 and 44 years (M = 22.6 yrs; SD = 3.3), the majority of which were female (60.3%). Participants were recruited among students at the University of Fribourg (Switzerland), and it was made sure that none of them had previous usage experience with the specific mobile phone model employed in the experiment. To motivate participants to take part in the study, they could enter a prize draw (worth 50 $).

3.2 Experimental Design

A 2 × 2 between-subjects design was used to investigate the two independent variables usage domain and usability. Usage domain was varied at two levels (work vs. leisure context), and so was usability (high vs. delayed system response time).

3.3 Measures and Instruments

Performance

The following three measures of user performance were recorded: (a) task completion rate (percentage of successfully completed tasks); (b) page inspection time (time a user stays on a page); (c) efficiency of task completion (minimum number of interactions needed for task completion divided by actual number of interactions). Participants were allowed to work on each task for a maximum of 5 min, after which a task was recorded as failed and participants moved on to the next task. All analyses of performance data took into account the shorter overall time participants had available in the delay condition (i.e. delay time was deducted from task completion time).

Affective State

The PANAS scale (‘Positive and Negative Affect Schedule’ [45]) was used to measure short-term changes in affective states before and after task completion. The scale allows the assessment of two independent dimensions of affect: positive and negative affect. It was shown to have good psychometric properties (Cronbach’s α = 0.84). The scale uses 20 adjectives to describe different affective states (e.g. ‘interested’, ‘exciting’, ‘strong’), for which the intensity is rated on a 5-point Likert scale (‘very slightly or not at all’, ‘a little’, ‘moderately’, ‘quite a bit’, ‘extremely’).

Task Load

To assess task load the well-established NASA task load index (TLX) was used [46]. It measures the six dimensions of task load: mental demands, physical demands, temporal demands, performance, effort and frustration. In the subsequent analysis, each dimension was given the same weight. Based on our data, psychometric properties were shown to be satisfactory for the translated scale (Cronbach’s α = 0.72).

Perceived Usability

Perceived usability of the test system was measured by two instruments. First, we used a 100 mm visual analogue scale to measure an overall evaluation of perceived usability (‘This website is usable’) [8]. The use of one-item scales to evaluate technical systems was found to be appropriate, as other work has shown [e.g. 47]. Second, the PSSUQ (‘Post Study System Usability Questionnaire’) [48] was applied, which was slightly modified to be relevant for the test system in question (the term ‘system’ was replaced by ‘software’ to make sure only the software and not the device was judged). The scale consists of 19 items and uses a 7-point Likert scale (ranging from ‘strongly agree’ to ‘strongly disagree’). The questionnaire was developed for usage in usability tests in a lab setting, and the author [48] reports very good psychometric properties (Cronbach’s α > 0.90).

Previous Mobile Phone Experience

Previous mobile phone experience was assessed by a visual analogue scale on which participants reported an intermediate self-rated mobile phone experience of 5.0 on a scale ranging from 0 to 10 (labelled ‘not experienced’ and ‘very experienced’). They indicated using their devices 12.6 times on average during a day. Mobile phone experience and daily usage were used as covariates in the analysis.

3.4 Materials: Mobile Phone, Server and Software

The test device was a Motorola Android smartphone. The web application was running on a server software XAMPP. In the delay condition, a PHP script was running on the server and generated a random system response delay of between 0 s and 3 s (1.3 s on average) whenever a new page was requested. These system delays were chosen based on pre-tests which showed that changing intervals were perceived as more disturbing than constant (and hence predictable) ones. In addition, the pilot tests have shown that latencies of more than 4 s were not considered realistic. A server log recorded the pages viewed, the time at which the page was accessed, the duration during which the page was displayed and the size of the delay.

The web application used for task completion was specifically set up for the experiment. It consisted of a tourist guide for a large European city containing a hierarchical navigation system, offering a number of categories at each level and detailed pages at the deepest level. Navigation options were ‘return to the previous page’, ‘return directly to the home page’, or selecting one of the displayed categories. Scrolling was necessary for some of the pages, which had a larger number of categories than the screen could display. Category labels were deliberately named such that it was not always obvious in which the target page would be found so that a trial and error approach to target search became necessary (e.g. a specific Asian restaurant was located under the category ‘Japanese’, while other categories available included ‘Asian’, ‘Chinese’, ‘German’, ‘Greek’, ‘Indonesian’, and ‘Italian’). A message on the target page stated clearly that the task had been solved and requested that the user directly went back to the home page.

3.5 Procedure

Participants were randomly assigned to one of the four testing conditions. The testing sessions were conducted in a usability laboratory at the University of Fribourg. The experimental manipulation of the usage domain consisted of three factors: laboratory set up, task wording and priming task. With regard to the laboratory set up for the leisure condition, the lab was set up like a living room, containing a sofa (on which the participant was seated), wooden furniture with travel books, a (switched off) TV set, plants on the window sill, and pictures on the wall. In the work condition, the laboratory contained several desks, a (switched off) computer, a desk lamp, some folders and typical office stationery (stapler, etc.). The tasks used in this study were the same in all experimental groups with regard to the interactions they required to be accomplished successfully. Tasks differed however with regard to the framing that was used, with work-specific context presented in the work condition (e.g. plan a meeting in a café with colleagues to discuss a work-related assignment) and leisure-context (e.g. plan a get-together to meet with friends in a café) for the leisure condition. For the priming task in the work condition, participants were asked to imagine that they would be working the following two days and to think about what they would have to do during these days. A similar instruction was given in the condition simulating the leisure context.

The experimenter described the purpose of the experiment as testing the usability of a web application for smartphones, giving an overview of the experimental procedure. Participants filled in the PANAS and the questionnaire measuring previous mobile phone experience. The experimenter presented the test device, showed all functions of the web application and explained how to operate it (e.g. choosing categories, home, back, scrolling). Participants completed a practice trial to become familiar with the web application. They were given the opportunity to ask questions, and then instructions about the usage context were provided and participants were asked to start the introspection phase. After one minute of introspection for putting oneself into a specific work or leisure situation, the experimenter informed the participants about the first task. They had five minutes for each task, but were not informed about this time constraint. If the task was not completed after five minutes, the experimenter thanked them and presented them the next task. After the last task, the participants completed the PANAS a second time, then the NASA-TLX, the subjective usability questionnaires (one-item scale and PSSUQ), and finally the manipulation check. The participants were debriefed and could leave their e-mail address to take part in the draw. The duration of a testing session was about 45 min.

3.6 Manipulation Check

The manipulation check consisted of a visual analogue scale (0–100; ranging from ‘rather leisure-oriented’ to ‘rather work-oriented’), on which participants judged the situation in which they completed the tasks. Results confirmed a significant impact of the context manipulation, as participants indicated to have experienced the situation significantly more work-related in the work context condition (M = 56.4; SD = 2.2), compared to the leisure context condition (M = 27.7; SD = 2.4; t(58) = 4.82; p < 0.001, Cohen’s d = 1.24).

4 Results

Self-reported mobile phone experience, daily mobile usage and gender were entered as covariates into the analysis in order to control for their influence. However, the analysis showed that none of the covariates had a significant influence on the reported findings. Therefore, results of the data analysis without covariates are reported.

4.1 User Performance

Task Completion Rate

Data analysis showed significant differences in the number of completed tasks as a function of the usability manipulation (see Table 1). When operating the system with a delayed response, participants solved significantly fewer tasks (M = 83.3%) than when response time was not delayed (M = 93.3%; F = 5.28; df = 1, 56; p < 0.05; \( \eta_{{\text{partial}}}^{2} \) = .086). Testing context (work vs. leisure) had no effect and there was no interaction between usability and testing context (both F < 1).

Table 1. Measures of user performance, affective state, task load, and perceived usability as a function of usage domain and usability.

Task Completion Time

Overall task completion time differed significantly with regard to the usability manipulation (see Table 1). When operating the system with a delayed response, participants took longer to complete the tasks (corrected by the delay time) than when response time was not delayed (Mdelayed = 526.7, SD = 145.4; Mundelayed = 397, SD = 126.5; F = 13; df = 1, 56; p < 0.05; \( \eta_{{\text{partial}}}^{2} \) = .19). Testing context (work vs. leisure) had no effect and there was no interaction between usability and testing context (both F < 1).

Page Inspection Time

As the data in Table 1 indicates, participants in the delay condition stayed significantly longer on a page (M = 5.84 s) than those working with a non-delayed system (M = 5.33 s). This difference was statistically significant (F = 4.46; df = 1, 56; p < 0.05; \( \eta_{{\text{partial}}}^{2} \) = .188). With regard to the other independent factors, there was neither a significant effect of testing context (F < 1) nor an interaction (F < 1).

Efficiency of Task Completion

An important indicator of user efficiency is determined by the calculation of the ratio of the actual number of user inputs and the optimal number of user inputs. The data in Table 1 indicate overall a medium level of efficiency of about M = 0.4. This efficiency index shows that 40% of the user inputs contributed towards task completion, whereas the remaining inputs did not directly lead to the task goal or were part of a less direct path towards task completion. This indicates that the tasks were reasonably difficult to solve. As the data in Table 1 suggests, there was little difference between conditions, which was confirmed by analysis of variance (all F < 1).

4.2 Subjective Ratings

Affective State

For the analysis of the emotional state of the user as a consequence of using the product, a comparison was made between the baseline measurement (i.e. prior to task completion) and a second measurement taken after task completion. This analysis revealed a change in positive affect as a function of SRT. While participants reported an increase in positive affect after task completion when working with a non-delayed system (M = 0.12), lower positive affect was reported when working with a delayed system (M = −0.17; F = 4.67; df = 1, 56; p < 0.05; \( \eta_{{\text{partial}}}^{2} \) = .077). Regarding the changes in negative affect, no significant difference was found (F < 1). As the data in Table 1 shows, testing context had no effect on the change of positive affect levels and there was no interaction either (both F < 1). Equally, there was no effect on the change of negative affect (F = 2.98; df = 1, 56; p > 0.05; \( \eta_{{\text{partial}}}^{2} \) = .051), nor was there an interaction (F < 1).

Task Load

The data for the overall NASA-TLX score are presented in Table 1. While this indicates overall a low task load score, there was generally very little difference between experimental conditions. This was confirmed by analysis of variance, which revealed neither a main effect for the two independent factors nor an interaction between them (all F < 1). To evaluate whether any differences could be found at the single item level, a separate analysis of the NASA-TLX items was carried out. Also, this analysis did not show any significant effect.

Perceived Usability

The data for perceived usability, as measured by the PSSUQ, are presented in Table 1. Interestingly, the expected effect of SRT did not affect subjective usability evaluations, with ratings being nearly identical in both conditions (F < 1). Usability ratings appeared to be higher in the work domain than in the leisure domain but this difference failed to reach significance (F = 2.01; df = 1, 52; p > 0.05). No interaction between the two factors was found (F < 1). An additional analysis examined the PSSUQ subscales separately but revealed the same pattern of results as for the overall scale. Finally, results for the one-item usability scale indicated an overall rating of M = 52.2 with little differences between the four experimental conditions (all F < 1), herewith confirming the results pattern found for the PSSUQ.

4.3 Correlational Analysis of Data

In addition to the comparisons between experimental conditions using analyses of variance, the size of correlations between variables may provide further insights for a better understanding of the UX construct and the interplay of its different components. This point has notably been addressed by Hornbæk and Law [49] arguing that studies in this domain should report such correlations in order to facilitate interpretation and comparison of outcomes.

Interestingly, correlation coefficients (see Table 2) indicate in general rather low correlations between the different UX measures. Significant relationships were found for the different measures of objective performance, whereas performance indicators showed only small correlations with evaluation of perceived usability and task load. In contrast, subjective evaluation of task load was negatively correlated with perceived usability. Furthermore, negative affect showed significant links with perceived task load and task completion rate (c.f. Table 2).

Table 2. Correlations between different UX measures (N = 60).

5 Discussion

The aim of the study was to investigate the influence of usage domain on the outcomes of user experience evaluations and whether any such influence would be mediated by poor system usability in the form of SRT delays. The findings showed that, contrary to expectations, usage domain did not have the expected impact, with none of the measures showing differences between domains. In contrast, system response time showed the expected effects on performance and on user emotion whereas, surprisingly, no influence on perceived usability was observed.

Given that context of use has been considered an important determinant of usability [4] and that the two domains of work and leisure have been associated with different perceptions and behaviour [26], we expected that testing a product in one domain would produce differences in usability test results compared to the other domain. The manipulation check showed very clearly that participants perceived the leisure domain differently from the work domain. Despite this successful manipulation of context (involving different usability lab set-ups, domain-specific task instructions, and a priming task), there were no differences in usability test results, neither for performance nor for subjective measures. Although it is important to interpret non-significant results with caution [50], the publication and discussion of such findings is still very important [51,52,53].

A possible interpretation of this nil-result might be that there is no need for practitioners to test dual-domain products in both usage domains. The domains of work and leisure may not require specific consideration in test set-ups, as long as the relevant use cases are covered in the test. The absence of an interaction between usage domain and system usability strengthens this argument, suggesting that even under conditions of impaired system usability the work domain provides test results that are no different from the leisure domain. One previous study comparing work and leisure domains also found little difference between these two application domains [13]. However, in that study the usability of the technical device was not manipulated. Taken together, this study and the present work provide first evidence that across a range of conditions (i.e. different levels of product aesthetics and of product usability) the influence of usage domains appears to be of smaller magnitude than expected. The results support Lindroth and Nilsson’s [54] claim that environmental aspects of usage context are generally not an important issue in usability testing as long as stationary technology usage is concerned (which was the case in the present study as the smartphone was operated like a desktop device). While Lindroth and Nilsson did not empirically test their proposition, the present work provides first empirical evidence to support it. Additional research (i.e. replication studies) corroborating these results are required however, in order to be able to interpret such nil-results as arguments for practitioners to refrain from considering usage context in UX evaluation.

While these nil-results do not allow us (yet) to make such a decisive statement with regard to the context-dependency of UX evaluation outcomes, the findings have some implications for researchers and practitioners interested in the domain dependence of UX evaluation. The results of this study indicate that the manipulation of the usage context (as suggested in this piece of research) did not show the expected effect. Although the successful manipulation check indicated that a distinction was made by participants in this study, this manipulation showed no influence on UX measures. This may raise some concerns with regard to the extent to which motivational processes associated with the usage context could be appropriately reproduced in the lab. However, it has to be noted that this problem would affect all lab-based UX assessment, independently of the domain. In addition, previous research has shown that lab-based testing often provides similar results compared to conducting tests in the field [55]. In this context, the sample recruited for this study needs to be considered as a limitation since the work context of students may not be fully transferable to salary workers. Therefore it might be worthwhile to address this research question with an additional user sample. Nonetheless do these results provide a strong argument for the need for additional field-based research addressing the influence of usage-context in UX assessment, preferably by making a direct comparison with lab-based data.

While usage domain had little impact on the results of usability testing, a number of effects of poor system usability were found, confirming several of our research hypotheses. First, it emerged that poor system usability had the hypothesised negative effect on task performance. When SRT was delayed, task completion rate was lower and participants spent more time on a page, compared to participants working with a system without delays. These findings are consistent with an extensive body of research showing a negative impact of delayed system response on performance [e.g. 31, 37]. One explanation for this effect is that users adapt their speed of task completion to SRT and work faster when the system responds more promptly [56]. An alternative explanation for longer page inspection times under delayed SRT could be that participants adjusted their strategy, moving from a trial and error approach to a more reflective one, thus reducing the number of delayed system responses. Previous work has shown that even short SRT delays made participants consider their actions more carefully [38, 57].

Second, poor usability had a negative effect on participants’ affective state, consistent with our hypothesis. When working with a delayed system, participants showed a stronger reduction in positive affect than when working with a non-delayed system. This finding is consistent with an extensive body of research, showing negative effects of delayed SRT on various aspects of affective states, such as frustration, anxiety, stress and impatience [e.g. 38,39,40]. The present study adds to these findings by showing that such effects on emotion may occur, even if such SRT delays are short.

Third, although poor system usability had a negative effect on performance and affected participants’ emotional state, no such effects were found for perceived usability. This observation is of particular interest since other work found a substantial positive association between performance and preference [58]. While users generally provide a more positive evaluation when systems are more usable, Nielsen and Levy [58] also cited some cases in their meta-analysis, in which users preferred systems, with which they performed worse. These systems, however, had rather short SRT delays and performance impairments did not reach critical levels. The magnitude of the delay in our study might have been below that critical level and therefore did not have an effect on perceived usability. An alternative explanation for the observed finding is that participants did perceive such delays but the SRT delay was not associated with the application but with the server from which the pages were downloaded. Similar observations were made in other work where users of internet-based software attributed the cause of delayed response to internet connection rather than the software itself [59]. Overall, although we employed rather short SRT delays, most of our hypotheses were confirmed, which highlights the importance of paying attention to even short delays during system design as it may affect performance and user emotion.

The correlational analysis of the different measures of the UX construct revealed in general rather low correlations. Objective performance measures, subjective evaluations of usability and affective states as important aspects describing an experience episode seem to be rather independent dimensions and hence should be assessed and reported separately when UX is the scope of measurement. Hence, a combination into a single UX-score seems not to be useful. These findings are in line with results of previous studies indicating that measures of user performance often have only a weak link to subjective measures of usability [49, 57, see also 60]. Radar charts might represent a useful and easy to understand way to display results of a holistic UX evaluation consisting of different dimensions or facets. Results of this study could represented as suggested in Fig. 2 with regard to the evaluation of the two different versions of the prototype.

Fig. 2.
figure 2

Presentation of the results of UX evaluations of the two versions of the mobile app as radar diagram

The findings presented have several implications for research and practice. First, the findings provide first evidence that results of a usability test can be transferred between the work and the leisure domain. This would facilitate usability testing of dual-domain products for practitioners since several testing contexts would not have to be covered so that they only need to ensure that the relevant tasks are included in the test set-up. Second, practitioners and researchers interested in context-dependent UX evaluation should address this issue in field-based research. Third, even rather short delays in SRT can have an effect on performance as well as on user emotion, suggesting that careful consideration should be given to SRT in product design and evaluation.