Keywords

1 Introduction

In materials science and many other research areas numerous experiments are conducted and large amounts of data are collected. Archiving these experiments is not limited to storing the test results in some digital format [15]. It is equally important to store the underlying experimental parameters which define the varying independent and dependent variables together with the controlled and contextual parameters. Most of the time, these data are collected for the purpose of scientific publications, but the data themselves are not published along with the corresponding results. Consequently, experiments are hardly reproducible by the scientific community, as often, crucial details are missing in the publication. Reproducibility of results is not only a challenge in the case of public publications, but also when data is stored internally within an institute over longer periods of time. Missing experimental details may lead to misinterpretation when re-examining the data or to involuntary duplication of experiments.

Many researchers, in theory, support an open research culture with public accessible data [7]. Ideally, according to the “FAIR” principles: Findable, Accessible, Interoperable, Reusable [13]. Data repositories, such as the Open Science Framework (OSF) [3] or Dendro [11], try to embed data storage into the research process, but they do not fully represent the structure of an experiment as part of the digital record.

When it comes to creating digital infrastructures for experimental data, scientist are demanding users:

  • On the one hand, they have not much time for data entry, especially when they have to do it more than once.

  • On the other hand, they need to store complex and frequently unique experiments with large amounts of parameters.

Therefore, creating data repositories which address only the data storage aspects do not fully address their needs. Researchers avoid additional work and data management is still not seamlessly embedded into their daily research workflow. One of the main challenges when documenting experiments is the need for repeated entry of similar experiment protocols which only vary in a small number of parameters. This process does not only create administrative overhead for the researchers but may also prevent them from discovering structures, similarities, or missing configurations in a series of related experiments.

In this paper, we focus on a method for structured input of scientific experiments which is flexible and not time consuming so that scientists can describe reproducible experiments and store the corresponding data in a way which can be easily understood by others. Therefore, we designed an information system which allows the user to describe scientific experiments and the context in which they are conducted. We present the concept and implementation of an approach for the input of structured experimental series. Such a series describe sets of scientific experiments with their controlled parameters, independent and dependent variables. To evaluate this prototype, we designed and conducted a laboratory usability study with naïve users as well as domain experts to compare two data input systems for experiments with or without the ability to define and use an experimental series.

The paper is structured as follows: In Sect. 2, we discuss related work. Section 3 presents our series based experiment management system, while Sect. 4 discusses the study design for our experiment. In Sect. 5, we give an overview over our results; Sect. 6 finally summarizes the paper.

2 Related Work

In recent years a variety of research data repositories have been developed and were made available to the scientific community. These systems represent different approaches to creating digital infrastructures for scientific data. In some cases the approaches differ in their disciplinary scope:

  • Some systems are designed as multi-domain systems, e.g. the Open Science Framework [3],

  • Others are domain-specific ones focusing on a singular scientific field or type of experiment, e.g. for marine science [8] or systems biology [12].

Another dimension concerns the level of formal explication. Most systems store some meta-data about the experiments using various standards, e.g. using the Dublin Core Meta Data, but they differ in terms of modeling the content data inside explicitly or implicitly.

  • One frequent approach is to store data records and their meta data in some human readable format, without having an explicit formal model of the individual entities contained in these records, e.g. PANGEA [4] or OSF [3].

  • Others seek to provide structured information of the data themselves along with the published record, e.g. SEEK [14] or InfoSys [15].

Lastly, we see the genesis of the experimental records as an additional dimension to be considered.

  • Most of the aforementioned systems constitute post hoc data management. That means that after the experiments have been conducted the resulting data is formated and entered into some system, e.g. [3, 4, 14] or [8].

  • Currently only a few systems seek to support concurrent data management. goes a step beyond. Here the question arises of how to address the problem of involving data management as early as possible into the research workflow and to make it an integral part of it, e.g. [11] or [15].

An overview of the dimensions and possible realizations is given in Table 1. Orthogonal, but nonetheless relevant to the data management aspects, there are already some systems that try to help researcher with finding appropriate study designs and also data collection tools, like [9, 10] or [5]. A common, but nonetheless important aspect for all those systems is to design user friendly interfaces which support domain experts without specific technical knowledge with their research.

Table 1. Dimension of infrastructures for scientific data management

3 System Design

As seen in the related work, different types of systems for experimental scientists to design and store their experimental data and results have recently emerged. In this section, we want to introduce and situate our approach. Our system is:

  • domain-specific for research experiment with various kinds of materials in the materials science,

  • provides an explicit fine-grained model of the experimental conditions and results, including fixed parameters, independent and dependent variables,

  • facilitates the creation and management of experiment series of related experiments,

  • is designed to be employed concurrently as part of the actual workflow at hand.

In a nutshell the researcher designs and specifies the experiment using the system, which then serves as a specification for the laboratory technician who, in turn, performs the experiment and files out the specified measurements and results in the same system. Lastly, when finished the results are available to the researcher for further analyses.

3.1 Requirements for Experiment Management

To fulfil our goal of creating a usable information system rather than another burdensome requirement for the researchers involved, it is necessary to get a precise understanding of the research workflow conducted by experimental scientists. Other data infrastructure projects affirm the importance of including domain experts into the design process as well [6].

We analysed the procedure of planning and conducting experiments by running rows of structured and unstructured interviews as well as a number of lab visits where we could observe the scientists while doing experiments. For this purpose, we choose scientists from two materials science labs which work with different classes of materials, namely steel and fiber reinforced materials. While the steel lab already uses an experiment result management system the other lab uses a laboratory information system (LIMS) which covers only the organizational aspects of conducting experiments, e.g. by whom and when the experiment took place, rather then contextual experiment data.

Within the interviews, some researchers reported misgivings being forced to specify each experiment individually. Furthermore, we analysed the existing data stored in the experiment management system. This data showed that experiments are typically conducted with varying only one or two of the whole set of attributes. The other parameters remain fixed, but may be varied in different, but similar experiments. This, consequently led to the inclusion of the concept of experimental series which gives the researchers the opportunity to subsume individual experiments, thereby, further integrating the data storage process within their typical research processes.

3.2 The Experiment Series Model

Figure 1 gives an overview of the experiment series model. A series groups multiple individual experiments. It contains the fixed parameters common to all grouped experiments of both the examined material, i.e. specimen and settings with their attributes. Every experiment inherits the fixed parameters of the associated experiment series and add values to the independent variables. The independent variables in the different experiments instantiate and determine the values of dependent variables stored in the test results. Since we want to focus on the experiment designing and conducting part within the research workflow, we do not have a further look on test results in this paper. The specimen describes all attributes of the material and its treatments up to the start of the actual experiment, e.g. a steel with heat treatments. The settings describe the conditions under which the specimen is treated in the experiment, e.g. the test machine and its parameters.

Fig. 1.
figure 1

The experiment series model

Compared to a test result oriented experiment model, the experiment series model promises multiple benefits:

  • First, we gain additional information about the structure and relationship between multiple experiments.

  • Second, there is much less input effort required for the researcherFootnote 1.

The differences become clear on an instance level. In Fig. 2 we show the same example with and without an experiment series. All fixed parameters are captured in a series instance (Fig. 2b) while the experiment only contains the independent variables, i.e. those attributes which are varied across the multiple experiment instances. In the hierarchical structured test result oriented model (Fig. 2a), the settings needs to be redefined for every individual specimen.

Fig. 2.
figure 2

An example of two experiment instances with or without an experiment series

Fig. 3.
figure 3

Specimen input step during creation of a new experiment series. The users select all independent variables and enter values for all fixed parameters.

3.3 Interface for Experiment Series

Figure 3 shows the interface for specimen attributes during the creation of a new experiment series. For each attribute the form features three parts: first, the name of the attribute, second a checkbox for specifying whether the attribute is a fixed parameter or an independent variable. Third, for all fixed parameters the user should enter the value. All independent variables need to be explicitly defined as such. To do so, the user has to de-select the selection box that is situated to the left of the input field. After this, the attribute is marked as independent variable and the user cannot enter a value anymore. By asking the user to define independent variables explicitly rather than leaving the value empty, it is possible to differentiate between attributes which do not have a value, e.g. were forgotten during the attribute input and those which are actually independent variables. This may help to avoid mistakes during the data input phase. When adding an experiment instance to an existing experiment series, only the values of all independent variables need to be specified which are typically only a few.

Beside the more intuitive interface, the experiment series model allows to neatly arrange multiple experiments on screen, as shown in Fig. 4. The user can get an overview of all attributes, fixed and variable, of a range of experiments without navigating through multiple experiment instances and comparing their individual values. In the upper part of an experiment series detail page, all fixed parameters are listed while in the lower part all experiment instances with the values of the independent variables, as part of the experiment, are catalogued.

Fig. 4.
figure 4

Overview of an experiment series. All fixed parameters and their values are listed in the upper part. All experiment instances with the values of the independent variables are listed in the lower part.

4 Study Design

To evaluate the series-oriented input paradigm for the specification of the experimental design, we designed and conducted a laboratory experiment with a total 28 participants. As a benchmark we employed a simplified version the current experiment management system which has been deployed and is being used by the materials science department. This system features a strict hierarchical model, where Projects can have multiple albeit unrelated Experiments based on individual Materials, as depicted in the example shown in Fig. 2a. The existing benchmark system will henceforth be referred as System B and compared to the new series-based system called System S. Both systems are designed according to the same design basics, as we use the same styles and forms of interaction. Next, we will introduce our hypotheses regarding the series-based experiment management approach and the corresponding study design.

4.1 Hypotheses

With our study we want to examine the following hypotheses regarding the usability of the new approach. In our study we will focus on three usability criteria specifically (1) efficiency, (2) safety (error-avoidance) and (3) user satisfaction. The ensuing corresponding hypotheses, therefore, are:

  1. H1

    Data entry with the series-based System S is faster than using the benchmark System B.

  2. H2

    Users make less input errors with the series-based System S in comparison to the benchmark System B.

  3. H3

    Users are more satisfied with the series based System S than with the benchmark System B.

4.2 Task Design

We use tasks from a real-world scenario as found in the materials science. The participants were asked to enter data from four different sets of experiments. For this we selected attributes and possible values from the existing experiment management system. As our focus lies on the structured input and not on typing speed, we limited the tasks to a small amount of attributes per experiment. The data to be entered was presented to the participants on a sheet of paper which they could consult during data entry.

To examine the input correctness, we do not include any kind of input verification to the study prototype, e.g. no “Not saved” alert or verification of input fields. Doing so, we avoid influences on the correctness of user inputs and expect a better insight in how well they understand the system.

Table 2 summarizes the four tasks and the minimum required effort for input with the two systems, measured in number of operations. An input operation \(O_I\) corresponds to the input of a single attribute value, e.g. entering a free text or a selection from a selection list. A structure operation \(O_S\) is an operation which manipulates the structure, e.g. adding or saving additional experiments.

Table 2. Tasks and their Input (\(O_I\)) - and Structure (\(O_S\)) operations

4.3 Participants

To collect more information about the system and to be able to perform a quantitative analysis, we use naïve users as well as domain experts for our system evaluation. We performed our study with twenty naïve users and eight expert users. The naive users were divided into four groups with five people each. We mixed a Between-Subject and a Within-Subject design, as shown in Table 3. To avoid ordering effects, half of the participants started with the series-based system (System S), the others started with the benchmark system (System B). The tasks were comparable in both runs, using the same structure with changing only the face values of the attributes. In Run 1 each group of users had to interact with only one type of system, either System B or System S. Thereafter, in Run 2, half of the users switched systems for four additional tasks while the other half of each group continued with the system they had been assigned to initially. Each subject had to fill out a standardized usability questionnaire, the System Usability Score (SUS) [2]. Here the switchers filled out one questionnaire after the fours tasks in Run1 and another after the four tasks of Run 2, that is once before changing the systems and once at the end, while the continuers filled out the questionnaire after eight tasks. At the end, we conducted a structured interview to collect statements about experienced differences. The group of expert users was treated a little bit different, as they have prior knowledge with experiment management system, i.e. System B. All of them started with only one task with System B and continued with all four task with System S.

To get used to the systems, all participants saw an introduction video before they started with the tasks. This video gave a short introduction to experiments in materials science and demonstrated the mechanics of the respective system.

Table 3. Experiment design

5 Results

Based on our study design introduced in Sect. 4, we want examine our hypotheses H1, H2 and H3 respectively. Furthermore, we will summarize pertinent findings from the qualitative data gathered in the interviews.

Fig. 5.
figure 5

Mean times needed and standard deviations for fulfilling the tasks of System S (blue) and System B (red) (Color figure online)

5.1 Efficiency

  1. H1

    Data entry with the series-based System S is faster than using the benchmark System B.

To examine this, we measured the time a user needed from the start of a task until they finished it by pressing a “Finish Task” button in the user interface. To avoid priming effects we examined only the data obtained in run 1 for all groups of naïve users in this evaluation.

Figure 5 shows the mean values for the task duration together with their standard deviations for all \(n_A=n_B=10\) participants in their first run. As expected, the task duration of the benchmark system (System B, red bars) increases from to Task 1 to Task 4, corresponding to the increasing necessary effort (in number of operations, see Table 2). The task duration therefore increases as the tasks become more extensive. In comparison, task duration for the series-based System S does not increase substantially from Task 1 to Task 4. This is in line with the only moderate increase of required number of operations for these tasks and a progressing familiarization of users with the system. The standard deviation for the series-based system is almost half compared to the benchmark system, which shows that users are consistently more efficient.

An independent-samples t-test was conducted to compare task duration of the two systems. This test revealed a significant difference in the task duration for Task 2 (\(t(18)=-2.64, p=0.017\)), Task 3 (\(t(18)=-3.47, p=0.0027\)) and Task 4 (\(t(18)=-2.44, p=0.026\)). As expected, there is no significant difference between the systems for Task 1 which does not contain an experiment series. Overall, these results support our hypothesis H1.

A special case is Task 4 which users took unexpectedly long to solve with the benchmark system. As shown in Table 2, Task 4 can be solved with fewer operations than Task 3 using the benchmark system when exploiting the hierarchical structure of the data. Some participants did not recognize this and proceeded exactly as for Task 2 and Task 3, although this is not the optimal solution with this system. A reason for this behavior might be that the hierarchical structure of the benchmark system is not easy to understand as expected.

5.2 Error-Avoidance

  1. H2

    Users make less input errors with the series-based System S in comparison to the benchmark System B.

To measure input errors the users made, we compared the entered data with the optimal solution and counted every deviation from the sample solution. This includes errors in input operations, for example typing errors or selecting the wrong value from a selection list and errors in structure operations, for example unsaved data.

In both settings the users did not perform many input errors. All users together made absolute in average \(\bar{x}=1.675, s=1.64\) errors during the task. With \(\bar{x}_{S}=0.85, s_S=1.39\) structure errors and \(\bar{x}_{I}=0.825, s_I=1.28\) input errors.

The most frequent error was a structural one in task T1 with System S, i.e. an experiment without variation: In this case an empty experiment should be created, since the series is only an abstract data model and would not allow to enter test results. In only nine of twenty runs with System S an experiment was created correctly. In the other eleven runs, the experiments were missing. In a production system with all functions and test results, this error would naturally not occur, as that it is not a key aspect of the system.

Because of the small numbers in errors it is not meaningful to do statistical analysis. This would be different in a production setting with outside influences and should be examined in a different experiment.

5.3 User Satisfaction

  1. H3

    Users are more satisfied with the series based System S than with the benchmark System B.

To investigate the usability of the system, we used the System Usability Scale (SUS) [2], a standardized questionnaire for usability evaluation. Continuers (Group 1 & 4), filled all one SUS-questionnaire after both runs, while the switchers filled out two after both runs. Figure 6 shows the mean and the standard deviation for \(n_S=15\) naive users with System S, \(n_B=15\) users with System B and the \(n_E=8\) user of the expert group. With \(\bar{x}_S=77.83, s_S=17.03\) the series based system is rated as “good” [1]. While the benchmark system (\(\bar{x}_B=59.5\), \(s_B=20.79\)) is only rated as OK. An independent-sample t-test was conducted to compare the SUS-Score for both system with the following result: \(t(28)=2.64\), \(p=0.013\). There is a significant difference between the two systems and the corresponding null hypothesis can be rejected.

Fig. 6.
figure 6

SUS-Scores of both systems and the expert group

For the two-system groups, Group 2 and Group 4, we evaluated although the differences between ratings of both. For this we ran an dependent-sample t-test. With \(t(4)=4.60\), \(p=0.010\) (Group 2 - SB) and \(t(4)=-2.81\), \(p=0.048)\) (Group 4 - BS) the differences are significant for both groups.

5.4 Qualitative Evaluation

Additionally to the evaluation of the quantitative measures, we interviewed all naive users after finishing both runs. These interviews confirm our findings from the quantitative analysis. 9 out of 10 users from Group 2 and Group 3 (Those groups which handled both systems) find the series based System S more intuitive and even 10 of 10 user from those groups prefer the series based System S over the benchmark System B. Although, we expected the series based system to be more complex, the users viewed the benchmark system to be more complex and unclear because of its hierarchical structure.

For the series based system we received additionally constructive feedback, especially the input of a series using a keyboard only was mentioned very often. Typically, the users did the series input in two steps, first selecting the independent variables using the mouse, and then entering all fixed parameter using the keyboard. It is necessary to skip the “Independent variable”-checkbox by a double tab when switching from one input field to another. This additional effort was criticized very often. Although the users noted the layout of both (checkbox and input-field) next to each other as being very clear.

One of the most common mistakes using the series based system, was the missing additional experiment as task T1. In the interviews, some users mentioned this mistake as self-motivated. It seems that the learning effect by using the system multiple times with different tasks raises the awareness for our concept of experiment series and its abstract characteristics.

5.5 Expert Evaluation

As well as the naïve users we evaluated the system with eight domain experts from the two research labs which we interviewed before. In this expert group were laboratory technicians as well as researchers. All of them are touched with experiments in their daily professional routine.

The expert group did a SUS questionnaire for the series based System S. The corresponding results are displayed in the right bar of in Fig. 6. With \(\bar{x}=77.5\) and \(s=16.48\) their ratings were in the same range as those of the naive user group.

Beside the general questions about the system, for which we received the same answers as from the naive users, we ask them additional questions how their rate the system and how well it fits into their typical research workflow. All domain experts were satisfied with the experiment series based systems and prefer it much over the benchmark system which is comparable to the system some of them are using. Seven out of eight domain experts answered that the system fulfill their requirements for designing experiments. One test user did not answer this question clearly.

6 Discussion and Outlook

This paper investigates a shift in the explicit model for scientific experiments used for research data management. We claim that fitting such models to the research workflow is of paramount importance to systems that aim to provide more than black-box records of experiments. Modelling the data explicitly certainly brings about a range of advantages such as comparability and enhanced data mining and knowledge discovery. Nevertheless, converting data into a structured model in a post-hoc manner requires a lot of effort from the involved researchers, which is why we seek to integrate the data acquisition into the daily workflow of designing, conducting and analysing experiments. In our minds this explains why a system with a better fitting model, even if structurally more complex than the previous one, outperforms the original system in the key usability factors of efficiency, error-avoidance and user satisfaction.

We also want to emphasize that user satisfaction, even when dealing with a professional system not intended for a general consumer market, is critical, as researchers will opt out of using it and revert to their familiar spreadsheets and idiosyncratic formats if they do not like the system and find it too cumbersome to use.

As a next step, we will deploy the new model into the InfoSys systemFootnote 2, a research data management system from material science. We will further long term evaluations in its daily productive usage. At the same time data mining efforts are underway to discover hidden experimental series in the existing data, thereby highlighting the benefits of having explicit fine-grained models of the data.

We are also working on further individualization of the input interfaces so that scientists can individually vary the relevant attributes and the order of the production steps involved in creating the test specimen. These efforts will ultimately lead towards the conversion of the static system into an authoring system where researchers collaboratively specify the model as they are using the system. In our minds this will facilitate the expansion from a single domain system into a multi domain environment for managing research data from various fields.