Keywords

1 Introduction

Usability is a basic attribute in software quality. The concept is known for decades and is still evolving. The ISO 9241-210 standard defines usability as “the extent to which a system, product or service can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use” [1].

Lewis identifies two approaches on usability evaluation: (1) summative, “measurement-based usability”, and (2) formative, “diagnostic usability” [2]. Usability evaluation methods are usually classified in two categories: (1) usability testing, based on users’ participation, and (2) inspection methods, based on experts’ judgment.

For more than two decades heuristic evaluation is arguably the most popular usability inspection method [3]. When performing a heuristic evaluation generic or specific heuristics may be used. Nielsen’s ten usability heuristics are well known, but many sets of specific (usually domain-related) usability heuristics were also published [4, 5]. The results of a heuristic evaluation depend on several factors, but at least two of them are critical: (1) evaluators’ expertise, and (2) the set of heuristics that are employed. Heuristic quality scales were proposed.

The paper presents a comparative study on the evaluators’ perception over three sets of usability heuristics, when evaluating the same product: Nielsen’s heuristics [6], a set of cultural-oriented heuristics for e-Commerce [7], and a set of heuristics for smartphones applications - SMASH [8]. The experiment that we made involved 18 graduate and 20 undergraduate Computer Science students, enrolled in an Human-Computer Interaction (Usability and User eXperience oriented) course. We used Expedia.com as case study [9]; web and mobile versions were evaluated.

2 Comparing Three Sets of Usability Heuristics: A Case Study

We conducted studies on the perception of evaluators over generic and specific usability heuristics for several years [10,11,12]. All participants are asked to perform a heuristic evaluation of the same case study. All of them are then asked to participate in a post-experiment survey.

We developed a questionnaire that assesses evaluators’ perception over a set of usability heuristics, concerning 4 dimensions and 3 questions:

  • D1 – Utility: How useful the usability heuristic is.

  • D2 – Clarity: How clear the usability heuristic is.

  • D3 – Ease of use: How easy was to associate identified problems to the usability heuristic.

  • D4 – Necessity of additional checklist: How necessary would be to complement the usability heuristic with a checklist.

  • Q1 – Easiness: How easy was to perform the heuristic evaluation, based on the given set of usability heuristics?

  • Q2 – Intention: Would you use the same set of usability heuristics when evaluating similar software product in the future?

  • Q3 – Completeness: Do you think the set of usability heuristics covers all usability aspects for this kind of software product?

Each heuristic is rated individually, on 4 dimensions (D1 – Utility, D2 – Clarity, D3 – Ease of use, D4 – Necessity of additional checklist). The set of usability heuristics is also rated globally, through the 3 questions (Q1 – Easiness, Q2 – Intention, Q3 – Completeness). In all cases, we are using a 5 points Likert scale (from 1 – worst, to 5 – best).

We made an experiment with 38 Computer Science graduate and undergraduate students enrolled in a HCI introductory course at Pontificia Universidad Católica de Valparaíso (Chile). We did not select samples; all students enrolled in the HCI course were also participants in the experiment. Group composition was as follows:

  • 20 undergraduate students; 13 students without previous experience in heuristic evaluation, 7 students with some experience based on Nielsen’s heuristics.

  • 18 undergraduate students; 11 students without previous experience in heuristic evaluation, 7 students with some experience based on Nielsen’s heuristics.

All students evaluated the online travel agency Expedia.com, following the same protocol, but based on three different sets of heuristics:

  • Nielsen’s 10 usability heuristics [6],

  • A set of e-Commerce (cultural-oriented) heuristics [7],

  • SMASH, a set of usability heuristics for smartphone applications [8].

They evaluated Expedia.com website (based on Nielsen’s and e-Commerce heuristics), and also Expedia mobile application (based on Nielsen’s, e-Commerce, and SMASH heuristics). After performing the heuristic evaluation all participants were asked to rate their experience, based on the standard questionnaire described above.

Additionally, two open questions were asked:

  • OQ1: What did you perceive as most difficult to perform during the heuristic evaluation?

  • OQ2: What domain-related (online travel agencies) aspects do you think the set of usability heuristics does not cover?

3 Results and Discussion

Table 1 presents the average scores for dimensions and questions. Results are presented globally, but also grouped by students’ level (undergraduate or graduate), and level of expertise (with or without previous experience in heuristic evaluation).

Table 1. Average scores for dimensions and questions.

Heuristics’ perceived utility (D1) is high in all cases (average score over 4.00). In the case of students with previous experience, the perceived utility is slightly in favor of e-Commerce and SMASH heuristics, comparing to Nielsen’s heuristics. In general, students with previous experience perceive e-Commerce and SMASH heuristics’ utility better than their novice colleagues.

Students perceived e-Commerce and SMASH heuristics’ clarity (D2) better than Nielsen heuristics’ clarity. The perceived clarity is always better among students with previous experience.

Heuristics’ perceived ease of use (D3) is lower than their perceived utility and clarity. Ease of use perception is always better in the case of students with previous experience. e-Commerce and SMASH heuristics are perceived as (slightly) easier to use than Nielsen’s heuristics.

The perceived necessity for additional checklist (D4) is higher in the case of Nielsen’s heuristics; it is still relatively high for e-Commerce and SMASH heuristics. It is generally higher for novices, comparing to their more experienced colleagues.

The overall perception on easiness (Q1, how easy was to perform the heuristic evaluation) is lower than heuristics’ perceived utility, clarity, and ease of use. It is quite close to the neutral point of the scale (3). As expected, it is lower for novices than for more experienced students.

Even if the heuristic evaluation is not perceived as an easy task, the intention of future use (Q2) is remarkably high for the three sets of heuristics. It is slightly higher among graduate students, comparing to the undergraduate students.

As expected, students consider that e-Commerce and SMASH heuristics covers better than Nielsen’s heuristics the usability aspects of online travel agencies (Q3). Their opinion is less favorable to Nielsen’s heuristics in roughly 1 point.

The descriptive statistics presented above was complemented with inferential statistics. As mentioned, all questionnaire items are based on a 5 points Likert scale. Observations’ scale is ordinal, and no assumption of normality could be made. Therefore the survey results were analyzed using nonparametric statistics tests (Mann-Whitney U, Friedman, and Spearman ρ).

As samples are independent, Mann-Whitney U tests were performed to check the hypothesis:

  • H0: there are no significant differences between the perceptions of students with different background,

  • H1: there are significant differences between the perceptions of students with different background.

As the same group of students evaluated three different sets of heuristics, Friedman test was performed to check the hypothesis:

  • H0: there are no significant differences between students’ perception on Nielsen’s, e-Commerce, and SMASH heuristics,

  • H1: there are significant differences between students’ perception on Nielsen’s, e-Commerce, and SMASH heuristics.

Spearman ρ tests were performed to check the hypothesis:

  • H0: ρ = 0, the dimensions/questions D/Qm and D/Qn are independent,

  • H1: ρ ≠ 0, the dimensions/questions D/Qm and D/Qn are dependent.

In all tests p-value ≤ 0.05 was used as decision rule.

Table 2 shows Mann-Whitney U tests results when comparing students with and without previous experience. Significant differences occur in very few cases:

Table 2. Mann-Whitney U tests results when comparing students with and without previous experience.
  • In the case of Nielsen’ heuristics, regarding Q1 (easiness) for undergraduate students, and regarding Q1 (easiness) and Q3 (completeness) for graduate students.

  • In the case of e-Commerce heuristics, regarding none of the dimensions or questions.

  • In the case of SMASH heuristics, regarding D3 (ease of use) for undergraduate students.

As Table 3 shows, there are no significant differences between undergraduate and graduate students, for none of the three sets of heuristics.

Table 3. Mann-Whitney U tests results when comparing undergraduate and graduate students.
Table 4. Friedman test results when comparing students’ perception on Nielsen’s, e-Commerce, and SMASH heuristics.

Friedman test results (Table 4) show significant differences between students’ perception on Nielsen’s, e-Commerce, and SMASH heuristics in only three cases:

  • D3 (ease of use) and Q3 (completeness) for undergraduate students.

  • Q3 (completeness) for graduate students.

As Mann-Whitney U tests results show no significant differences between undergraduate and graduate students, and very few significant differences when comparing students with and without previous experience, Spearman ρ tests were performed for the whole group of 38 students.

In the case of Nielsen’s heuristics, 8 correlations occur between dimensions/questions (Table 5):

Table 5. Spearman ρ test for Nielsen’s heuristics: correlations between dimensions and questions.
  • A strong correlation between D2–D3; if heuristics are perceived as clear, they are also perceived as easy to use.

  • 4 moderate correlations between D1–D2, D3–Q1, D2–Q2, and Q2–Q3. If heuristics are perceived as clear, they are also perceived as useful, and there is also a declared intention of future use. If heuristics are perceived as easy to use, the whole heuristic evaluation is perceived as easy to perform. If the set of heuristics is perceived as complete, there is a declared intention of future use.

  • 3 weak correlations between D1–D3, D1–Q2, and D3–Q2. If heuristics are perceived as useful, they are also perceived as easy to use and there is also a declared intention of future use. If heuristics are perceived as easy to use, there is a declared intention of future use.

  • It worth mentioning that there is no correlation between D4 (necessity of additional checklist) and any other dimension or question.

As Table 6 indicates, 16 correlations occur in the case of e-Commerce heuristics:

Table 6. Spearman ρ test for e-Commerce heuristics: correlations between dimensions and questions.
  • Dimension D1 is correlated with all others dimensions and questions. When heuristics are perceived as useful, they are also perceived as clear and easy to use (strong correlations D1–D2, D1–D3); however there is also a declared necessity for additional checklist (moderate correlation D1–D4). When heuristics are perceived as useful, the set of heuristics is perceived as complete, the whole heuristic evaluation is perceived as easy to perform (moderate correlations D1–Q1 and D1–Q3), and there is an intention of future use (weak correlation D1–Q2).

  • Dimension D3 is also correlated with all others dimensions and questions. When heuristics are perceived as easy to use, they are also perceived as useful (strong correlation D1–D3), clear (very strong correlation D2–D3), and there is a perceived necessity for additional checklist (moderate correlation D3–D4). The whole evaluation is perceived as easy to perform, there is a declared intention of future use of e-Commerce heuristics, and the set of heuristics is perceived as complete (moderate correlations D3–Q1, D3–Q2, and weak correlation D3–Q3).

  • Dimension D2 is correlated to all others dimensions and questions, excepting one (Q1). Correlations are very strong (D2–D3), strong (D1–D2), moderate (D2–Q2, D2–Q3), or weak (D2–D4).

  • Question Q3 is also correlated to all others dimensions and questions, excepting one (D4). Correlations are moderate (D1–Q3, D2–Q3) or weak (D3–Q3, Q1–Q3, Q2–Q3).

  • Question Q2 is correlated to all others dimensions and questions, excepting D4 and Q1. Correlations are moderate (D2–Q2, D3–Q2) or weak (D1–Q2, Q2–Q3).

  • Fewer correlations occur for Q1 (moderate correlations with D1 and D3, weak correlation with Q3), and D4 (weak to moderate correlations with other dimensions, but not with questions).

Table 7 highlights 15 correlations in the case of SMASH heuristics:

Table 7. Spearman ρ test for SMASH heuristics: correlations between dimensions and questions.
  • As in the case of e-Commerce heuristics, dimension D1 is correlated to all others dimensions and questions (strong correlations between D1–D2 and D1–Q2, moderate correlation between D1–D3, and weak correlations between D1–D4, D1–Q1, and D1–Q3).

  • Dimensions D2 and D3 are correlated to all others dimensions and questions, excepting Q3. Correlations are strong or moderate. There is a very strong correlation between D2–D3.

  • Dimension D4 is correlated to all others dimensions and questions, excepting Q1 and Q3. Correlations are moderate or weak.

  • Question Q2 is correlated with all dimensions (weak to strong correlations), and with question Q3 (moderate correlation).

  • Question Q1 is correlated only with dimensions D1 (weak correlation), D2 and D3 (moderate correlation).

  • Question Q3 is correlated only with dimension D1 (weak correlation) and question Q2 (moderate correlation).

Correlations are fewer in the case of general (Nielsen’s) heuristics (7) than in the case of specific heuristics (e-Commerce, 16, and SMASH, 15). When occur, all correlations are positive.

4 Conclusions

Heuristic evaluation is arguably the most popular usability inspection method. We systematically conduct studies on the perception of evaluators over generic and specific usability heuristics. We are using a questionnaire that evaluates each heuristic individually (Utility, Clarity, Ease of use, Necessity of additional checklist), but also the set of heuristics as a whole (Easiness, Intention, Completeness).

Performing heuristics evaluation based on Nielsen’s heuristics is a standard practice when we are teaching Human-Computer Interaction courses. This time we asked students to evaluate the same product based on three sets of heuristics: Nielsen’s heuristics, e-Commerce heuristics, and SMASH heuristics.

The experiment involved graduate and undergraduate students. There were no significant differences between undergraduate and graduate students’ perception, for none of the three sets of heuristics. When comparing students with and without previous experience, significant differences occurred in very few cases. Friedman test results showed significant differences between students’ perception on Nielsen’s, e-Commerce, and SMASH heuristics in only three cases.

Correlations were fewer in the case of general (Nielsen’s) heuristics (7) than in the case of specific heuristics (e-Commerce, 16, and SMASH, 15). When occurred, all correlations were positive.

In general, students’ perception on specific (e-Commerce and SMASH) heuristics was slightly better than on generic heuristics (Nielsen). As expected, students consider that e-Commerce and SMASH heuristics covers better than Nielsen’s heuristics the usability aspects of online travel agencies.

As future work we intend to analyze the perception of each heuristic individually. We will also analyze students’ comments to open questions.