1 Introduction

The user interface of Open Access (OA) repositories has an effect on their users’ performance and satisfaction. To add to the ongoing development of these types of repositories, usability evaluations need to be implemented on the user interface. There are two foci of this research: to evaluate the usability of Institutional Repositories as part of universities’ digital libraries interface using Nielsen’s heuristics to uncover usability problems and to examine the differences between user-interface experts and non-experts in uncovering problems with the interface.

1.1 What Is Usability?

In 1998, the term “user friendly” reached a level of vagueness and subjective definitions, which led to the start of the use of the term “usability” instead [1]. The International Standards Organization (ISO) [26] defines usability as “the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use.

Nielsen [2] suggests that usability cannot be measured by one dimension; these five attributes are associated with the usability components, which include learnability, memorability, efficiency, error recovery, and satisfaction. While Hix and Hartson [3] suggest that usability relies on the following factors, which include first impression, initial performance, long-term performance, and user satisfaction. Also, Booth [4], Brink et al. [5] share similar viewpoints that define usability as the effectiveness, efficiency, ease to learn, low error rate and pleasing. Nielsen’s and ISO’s usability definitions are the most widely used [6, 27, 28].

1.2 Usability Evaluations

Evaluation is considered as a basic step in the iterative design process. There are varieties of approaches to follow in evaluating the usability, which include formal usability inspection by Kahn and Prail [7], the cognitive walkthrough by Wharton et al. [8], heuristic evaluation by Nielsen [2, 9, 10], Contextual Task Analysis [11], paper prototyping by Lancaster [12].

1.3 What Are Institutional Repositories?

Institutional repositories are popular among universities worldwide [13]. IR as a channel allowing the university structuring its contribution to the global community, there exists the responsibility for reassessment of both culture and policy and their relationship to one another [14].

Over the past fifteen to twenty years, research libraries have been used to create, store, manage, and preserve scholarly documents in digital forms and make these documents available online via digital Institutional Repositories [15]. IRs host various types of documents [15]. An Example of IRs is DSpace [16]. In 2000, the Hewlett-Packard Company (HP) at MIT Libraries was authorized to, cooperatively, build DSpace, which is as Institutional Repository for hosting the intellectual output of “multi-disciplinary” organizations in digital formats [17].

1.4 Nielsen’s List of Heuristics

The set of heuristics was constructed from some usability aspects and interface guidelines [18]. The heuristics include visibility of the system status, match between system and the real world, user control and freedom, consistency and standards, error prevention, recognition rather than recall, flexibility and efficiency of use, aesthetic and minimalist design, help users recover from errors and help and documentation [9].

2 Related Work

Ping, Ramaiah, and Foo [19] tested the user interface of the Gateway to Electronic Media Services system at the Nanyang Technological University. The researchers’ goal was to apply Nielsen’s Heuristics to find strengths and weaknesses of the system. In their findings, they determined that the heuristic evaluation helped to uncover major problems such as being not able to have search results as desired. Researchers suggested that the uncovering of these problems ensures that the GEMS system needs development.

Qing and Ruhua [20] point out that the usability evaluation of Discipline Repositories offers the digital library developers a critical understanding of four areas: understanding the target users’ needs, finding design problems, create a focus for development, and the importance in doing so to establish a valid acceptability of such educational interactive technological tool. Three DRs were evaluated include arXiv,Footnote 1 PMCFootnote 2(PubMed Central) and E-LIS.Footnote 3 The three DRs are different in the subject domain and their design structures. The findings show that DRs inherit some of the already successful features form DLs. The three digital repositories provide limited ways, regarding the advanced search tools, to display and refine the search results.

Hovater et al. [21] examined the Virtual Data Center (VDC) interface that is classified as an open access web-based digital library. VDC collects and manages the research in the social science field. The researchers conducted a usability evaluation followed by a user testing. They found minor and major problems that included “lack of documentation, unfamiliar language, and inefficient search functionality”.

Zimmerman and Paschal [15] examined the digital collection of Colorado State University by completing some tasks that focused only on the search functions of the website. The talk-aloud approach was used to observe participants. Researchers found that two-fifths of users had problems downloading documents, which would discourage them from using the service. The findings suggest that the interface should be evaluated periodically to ensure the usability of the features.

Zhang et al. [22] evaluated three operational digital libraries, which include the ACM DL, the IEEE Computer Society DL, and the IEEE Xplore DL. Heery and Anderson’s [23] conducted a review to form a report on Digital Repositories sent to repository software developers. Heery and Anderson [23] impart, that engaging users is vital during developing open access repositories.

3 Heuristic Evaluation Study

The heuristic Evaluation study was conducted on a DSpace as an extension of university library services that enables users to browse the university’s collections and academic scholarly output. Our focus on evaluating Institutional Repositories (IRs) is motivated by the need to focus on the usability of the interface while the concept of usability evaluation implemented on IRs is fairly new. The research objectives of evaluating the university repository interface include:

  • To determine the usability problems of the University Repository Interface

  • To provide solutions and guidelines regarding the uncovered problems.

  • To provide the development team in the University with the suggested solutions to be used in the iterative design process for development purposes

Two key aspects are investigated: Does the expertise and number of evaluators affect the reliability of the results from applying the heuristic evaluation to the user interface? To answer the first of those general questions, we consider the following hypotheses:

  • Severe problems will be uncovered by experts while the minor problems will be uncovered by novices

  • Difficult problems can only be uncovered by experts and easy problems can be uncovered by both experts and novices

  • The best evaluator will be an expert

  • As Nielsen and Mack [24] reported for the traditional heuristic evaluation, experts will tend to produce better results than novices

  • The average of number of problems uncovered by experts and novices will differ. Experts are expected to find more problems than novices

To answer the second of those general questions, does the number of evaluators affect the reliability of the results? we consider the following hypotheses:

  • A small set of evaluators (experts) can find about 75 % of the problems in the user interface as Nielsen and Mack [24] suggest.

  • More of the serious problems will be uncovered by the group (experts or non-experts) with the most members

3.1 Participants

To produce a reliable list of usability problems, having multiple evaluators is better than only one because different people uncover different problems from different perspectives. A total of 16 participants were recruited and were university students who were divided into three groups 9 regular experts, four amateur and 2 novices.

3.2 Tasks

The tasks were designed according to most important elements in the interface that should be examined according to the result from previous study called user personas [29]. Each task is designed to describe the following:

  • The goal of the task;

  • The type of the task, is it regular, important, critical task;

  • The actual steps that a typical user would follow to perform the task;

  • The possible problems that users might face during performing the task;

  • Time for expert to reach the goal;

  • And the scenario.

3.3 Methodology

We started with conducting a tutorial lecture about the heuristics and how evaluators should apply them on the interface dialogs during the evaluation session. Examples usually are better than just lecturing. The researcher explained each heuristic’s main concept and gave examples. This was meant to help in carrying out the evaluations without having problems while referring to the heuristics. Evaluators who have not performed a heuristic evaluation before were required to attend the lecture to increase their knowledge about heuristics and the overall method. Other evaluators, who have experience in heuristic evaluation, would not need to review the heuristics, but they would need to be trained in using the interface. Therefore, the objective of this lecture is to increase evaluators’ knowledge about how to applying the heuristics.

The study lasted for 120 min. Participants started with the training session followed by the evaluation session. Then the severity rating was assigned for each uncovered usability problem. Finally, the solutions session was conducted to discuss problems and propose guidelines for the uncovered problems.

4 Results and Discussion

4.1 Problems Report

A report that describes each uncovered problem was delivered to the developers of the University DSpace for development purposes. We believe that uncovering these problems would benefit any university that utilizes a DSpace Repository as part of the digital library to maintain its scholarly output.

4.2 Number of Problems

For each problem and evaluator, data were coded as 1 for detected and 0 for not detected. Table 1 shows that the average number of problems found by experts was 6.8 while the average number of problems found by amateurs and novices were 3.5 and 2.5 respectively. 4.57. There is no significant difference between the means (F(2,13) = 3.205, p < .075) with an effect size of η2 = .330. Some would say that it the effect is marginal. The lack of significance combined with the reasonable effect size, is likely due to the small sample sizes.

Table 1. Evaluators’ performance

As would be expected, the largest differences were found between Experts and Novices. However, further analyses indicated that Experts were not different from Amateurs (F(1,12) = 3.639, p < .081, η2 = .233), that Experts were not different from Novices (F(1,10) = 3.141, p < .107, η2 = .239) and that Amateurs were not different from Novices (F(1,4) = 1.333, p < .970, η2 = .195)

Not surprisingly, the best evaluator was an expert, (evaluator ID 10) with a total of 21 % of the all problems (note, the total number of problems is the final number after applying the aggregation process, not including the “non-issues”). However, the best amateur found only 7.6 % of the total and the best novice only found 4.5 % of the total. The worst expert, amateur, and the novice found just 3 % of the total.

4.3 The Severity of Uncovered Problems

Of that 66, 17 were classified as Catastrophic (Level 4), 17 as Major (Level 3), 21 as Minor (Level 2) and 11 as Cosmetic (Level 1). Minor problems were the most common, but this difference was not significant using a chi-square analysis (χ2(3) = 3.09, p < .377). The lack of more severe (catastrophic or more) problems is likely attributable to the fact that the DSpace website has been in use for a number of years. It is likely that the majority of major and catastrophic problems have been uncovered and fixed.

4.4 The Severity by Expertise Interaction

Nielsen [18] suggested that usability specialists are better in uncovering problems than novices. To examine that, I compared the type of usability problems that were uncovered by both experts and novices. Each level of severity (Catastrophic, Major, Minor, Cosmetic, not including Non-Issues) was considered in isolation. The full analysis is a mixed ANOVA with one between subjects factor (Groups) and one within subjects factor (Severity). This analysis indicated that there were no differences for groups (F(2,13) = 3.205, p < .075, η2 = .330, as noted above), no differences for Severity (F(3,39) = 1.375, p < .698, η2 = .051) and no interaction (F(3,39) = 0.521, p < .265, η2 = .039) . However, one must again be mindful of the small sample sizes. The means are provided in Table 2 and Fig. 1.

Table 2. Evaluators’ performance within each level of severity
Fig. 1.
figure 1

Evaluators mean performance as a function of problem severity

Because we were more concerned about the severity of the problems found by each group of evaluators, specific tests for each level of severity were computed. For Catastrophic problems (Level 4), the number of problems detected by Experts was higher than the number of problems detected by Amateur and Novices, but the difference was not significant. Further analyses revealed that Experts were not different from Amateurs (F(1,12) = 3.377, p < .091, η2 = .220), that Experts were not different from Novices (F(1,10) = 0.714, p < .418, η2 = .067) and that Amateurs were not different from Novices (F(1,4) = 1.333, p < .312, η2 = .250). The same results held for Major (Level 3), Minor (Level 2) and Cosmetic (Level 1) problems. For Major problems, Experts were not different from Amateurs (F(1,12) = 2.455, p < .143, η2 = .170), Experts were not different from Novices (F(1,10) = 4.276, p < ..127, η2 = .217), and Amateurs were not different from Novices (F(1,4) = 0.333, p < .506, η2 = .118). For Minor problems, Experts were not different from Amateurs (F(1,12) = 0.489, p < .498, η2 = .039), Experts were not different from Novices (F(1,10) = 0.542, p < .478, η2 = .051) and Amateurs were not different from Novices (F(1,4) = 0.038, p < .855, η2 = .009). Finally, for Cosmetic problems, Experts were not different from Amateurs (F(1,12) = 0.023, p < .822, η2 = .002), Experts were not different from Novices (F(1,10) = 0.437, p < .524, η2 = .042), and Amateurs were not different from Novices (F(1,4) = 1.091, p < .355, η2 = .214).

4.5 Does One Need Experts Amateurs and Novices?

Even though the differences were not significant, Experts consistently found more problems than Amateurs, and Amateurs consistently (excepting catastrophic) found more problems than Novices.

Clearly, it would seem that experts will find “most” of the problems, and experts will find more of the serious problems. However, the simple presentation of Table 3 confounds the fact that there were more Experts (n = 10) than Amateurs (n = 4) or novices (n = 2). That is, more people imply that more problems can be found. As such, the analysis presented in Table 7 is a better measure of the capabilities of a single evaluator. However, this data does provide the opportunity to estimate the number of each category that would be required to find all problems. That is, using simple linear extrapolation (i.e., ratio), as shown in Table 4, one could conclude that it would require 17 Novices, or 24 Amateurs or 12 Experts to find all the Catastrophic problems.

Table 3. Severity of problems uncovered by evaluators
Table 4. Number of evaluators who would be required to find all problems

Implications: This is consistent with the notions of Nielsen [25]. The severity of problems uncovered by experts is higher than the severity of problems uncovered by the novices. Hence, one could conclude that a small set of expert evaluators is needed to find severe usability problems.

4.6 Difficulty of Uncovering Problems

The performance of evaluators can be rated according to the difficulty of uncovering problems in the DSpace interface. We mean that an Easy problem is one that is found by many evaluators, whereas a Hard problem is one that is found by a few evaluators, or even just one evaluator.

One can also rate the ability of each evaluator to find usability problems from Good to Poor. An evaluator who found many problems would have high ability whereas an evaluator who found few problems would have low ability. These two factors were investigated.

Some might think that experts can only uncover difficult problems and both experts and novices can uncover easy problems. This raises three questions: do experts, who are presumed to have a high ability to uncover problems, find only difficult problems? Do novices uncover only easy problems? Most importantly, can novices, who have presumed to have lower ability, find difficult problems? To address these questions, Fig. 2 summarizes the ability of evaluators to uncover problems. The blue diamonds represent the Novices, the red squares represent the Amateurs and the green triangles represent the Experts. Red Xs represents experts. Each row represents one of the 66 problems, and the column represents one of the 16 evaluators.

Fig. 2.
figure 2

Problems found by evaluators

We can see from Fig. 2 that the two types of evaluators are fairly interspersed. In this, one must be mindful of the fact that there are ties (e.g., three evaluators found 2 problems, two found 3, 4 and 5, three found 6, and one found 7, 9, 10 and 16). However, in the top rows, one can see that both Amateurs and Experts found the hardest problems, and both all groups found the easiest (lowest rows) problems. Generally, the Experts cluster to the upper right while the Novices and Amateurs cluster to the lower left.

4.7 The Violated Heuristics and Type of Problems

It was essential to investigate the number of times each heuristic was violated. Figure 2 provides the same information graphically.

Figure 3 shows the recommended priority levels for violated heuristics starting by problems associated with heuristic 4, 8, 3, 5, and 7 respectively (Fig. 4).

Fig. 3.
figure 3

Heuristics violated and type of problems

Fig. 4.
figure 4

Suggested priorities according to the violated heuristics and problems severity

4.8 Duplicate Problems with Different Severity Ratings

In some conditions, two or more evaluators found the same problems but assigned different severity ratings to those problems of this type were found. For the purpose of analysis, we considered the duplicates as new problems under each problem category with clear indication that these problems are duplicates.

5 Conclusion

Two main contributions were derived from the heuristic evaluation study. First, we added to the literature in cooperating the results from a previous study “user personas” to focus on some important elements on the interface and study users’ needs. Second, we have added to the traditional heuristic evaluation by separating the sessions and add a new session, which is the proposed solutions session. The results from the study show that applying the heuristic evaluation on DSpace produced a large number of usability problems that will improve the service if fixed. The findings from the heuristic evaluations study suggest a list of usability problems classified depending on their severity ratings. Two key aspects are investigated: Does the expertise and number of evaluators affect the reliability of the results from applying the heuristic evaluation to University DSpace user interface? To communicate the initial hypotheses with the findings, I examined the evaluators’ performance according to three factors: the number of problems found by each evaluator and the severity of the uncovered problems. The best evaluator among the group of evaluators (both experts and novices) is an amateur who found 21 % of the total number of problems. The best expert found 13 %. This contradicts the initial hypothesis that the best evaluator will be an expert. From this point, I conclude that only one evaluator cannot find all the usability problems even if this evaluator is an expert, which agrees with Nielsen suggestion (1994) that it is advisable to have more than one evaluator to inspect the interface. Compared to Nielsen’s finding, one evaluator can find 35 % of the usability problems in the user interface while, from the study findings, 21 % of the total number of problems was uncovered by the best evaluator. We conclude that the majority of the problems found by experts were serious (catastrophic and major). Finally, we believe that applying the heuristic evaluation methodology to Institutional Repositories as apart of the Digital Libraries and based on Dspaces would uncover usability problems and, if fixed, increase the libraries’ usability.