Keywords

1 Introduction

Traditionally, medical images have been analysed qualitatively. This type of analysis relies on the experience and knowledge of specialised radiologists in charge of carrying out the report. This entails a high temporal and economic cost. The rise of computer image analysis techniques and the improvement of computer systems lead to the advent of quantitative analysis. Contrary to qualitative analysis, the quantitative analysis aims to measure different characteristics of a medical image (for example, the size, texture, or function of a tissue or organ and the evolution of these features in time) to provide radiologists and physicians with additional, objective information as a diagnostic aid. In turn, the quantitative analysis requires image acquisitions with the highest possible quality to ensure the accuracy of the measurements.

An imaging biomarker is a characteristic extracted from medical images, regardless of the acquisition modality. These characteristics must be measured objectively and should depict changes caused by pathologies, biological processes or surgical interventions [21, 27]. The availability of large population sets of medical images, the increase of their quality and the access to affordable intensive computing resources has enabled the rapid extraction of a huge amount of imaging biomarkers from medical images. This process allows to transform medical images into mineable data and to analyze the extracted data for decision support. This practice, known as radiomics, provides information that cannot be visually assessed by qualitative radiological reading and reflects underlying pathophysiology. This methodology is designed to be applied at a population level to extract relationships with the clinical endpoints of the disease that can be directed to manage the disease of an individual patient.

The execution of medical image processing tasks, such as biomarkers, is a process that sometimes requires high-performance computing infrastructures and, in some cases, specific hardware (GPUs) that is not available in most medical institutions. Cloud service providers make it possible to access specific and powerful hardware that fits the needs of the workload [28]. Another interesting advantage of the Cloud platforms is the capability of fitting the infrastructure capacity to the dynamic workload, thus improving cost contention. The seamless transition from local image processing and data analytic development environments to cloud-based production-quality processing services implies a Continuous Integration and Deployment DevOps problem that is not properly addressed in current platforms.

1.1 Motivation and Objectives

The objective of this work is to design and implement a cloud-based platform that could address the needs of developing and exploiting medical image processing tools. In this sense, the work focuses on the development of an architecture focusing on the following principles:

  • Agnostic to the platform, so the same solution can be deployed on different public and on-premise cloud offerings, adapting to the different needs and requirements of the users and avoiding lock-in.

  • Capable of integrating High-performance computing and storage back-ends to deal with the processing of massive sets of medical images.

  • Seamlessly integrating development, pre-processing, validation and production from the same platform and automatically.

  • Open, reusable, extendable, secure and traceable platform.

1.2 Requirements

A requirement elicitation and analysis process was performed, leading to the identification of 13 requirements, classified into 9 mandatory requirements, 3 recommendable requirements and 3 desirable requirements. The requirements are described in Tables 1, 2 and 3.

Table 1. Requirements of the infrastructure (Mandatory, Recommended, Desirable)
Table 2. Requirements for job execution.
Table 3. Requirements with respect to the data.

2 State of the Art

Since the appearance of Cloud services, a large number of applications have been adapted to facilitate the access of the application users to Cloud infrastructures. In the field of biomedicine we find examples in [22, 32]. On the other hand, there are works that offer pre-configured platforms with a large number of tools for bioinformatic analysis. An example is the Galaxy Project [7], a web platform that can be deployed on public and on-premises Cloud offerings (e.g. using CloudMan [5] for Amazon EC2 [1] and OpenStack [15]).

Another example is Cloud BioLinux [4], a project that offers a series of preconfigured virtual machine images for Amazon EC2, VirtualBox and Eucalyptus [6]. Finally, the solution proposed in [30] is specially designed for medical image analysis using ImageJ [8] in an on-premise infrastructure using Eucalyptus.

Before taking a decision on the architecture, an analysis has been done at three levels: Container technologies, Resource Managers and Job Scheduling.

Containers are a set of methodologies and technologies that aim at isolating execution environments at the level of processes, network namespaces, disk areas and resource limitations. Containers are isolated with respect to: the host filesystem (as they can only see a limited section of it, using techniques such as chroot or FreeBSD Jails), the processes running on the host (only the processes derived executed within the container are visible, for example, using namespaces), and the resources the container can use (as the processes in a container can be bound to a CPU, memory or I/O share, for example, using cgroups).

Containers are used in application delivery, isolation and light encapsulation of resources in the same tenant, execution of processes with incompatible dependencies and improved system administration. Three of the most prominent technologies in the market supporting Containers are Docker [23], Linux Containers [10] and Singularity [24]. Docker has reached the maximum popularity for application delivery due to its convenient and rich ecosystem of tools. However, Docker containers run under the root user space and do not provide multi-tenancy. On the other side, Singularity run containers on the user-space, but access to specific devices is complex. LxC/D is better in terms of isolation but have limited support (e.g. LxD only works in ubuntu [11]). The solution for container isolation selected will be Docker on top of isolated virtual machines.

Resources should be provisioned and allocated for deploying containers. Resource Management Systems (RMSs) deal with the principles of managing a pool of resources and splitting them across different workloads. RMSs manage the resources of physical and virtual machines reserving a fraction of them for a specific workload. RMSs deal with different functionalities, such as: Resource discovery, Resource Monitoring, Resource allocation and release and Coordination with Job Schedulers. We identify 3 technologies to orchestrate resources are: Kubernetes [9], Mesos [2] and EC3 [17, 18].

Finally, Job schedulers manage the remote execution of a job on the resources provided by the RMS. Job Schedulers retrieve job specifications from different interfaces, run them on the remote nodes using a dedicated input and output sandbox for the job, monitor its status and retrieve the results.

Job Schedulers may provide other features, such as fault tolerance, complex jobs (bag of tasks, parallel or workflow jobs) and deeply interact with the RMS to access and release the needed resources. We consider in this analysis Marathon [12] Chronos [3], Kubernetes [9] and Nomad [13]. Marathon and Chronos require a Mesos Resource Management System and can deploy containers as long-term fault-tolerant services (Marathon) or periodic jobs (Chronos). Kubernetes has the capacity of deploying containers (mainly Docker but not limited to it) as services, or running batch jobs as containers. However, any of them deal seamlessly with non-containerised and container-based jobs. In this sense, Nomad can deal with multi-platform hybrid workloads with minimal installation requirements.

Fig. 1.
figure 1

Overview of Container Delivery architecture

Fig. 2.
figure 2

Overview of Container Execution architecture

3 Architecture

The service-oriented architecture is described and implemented in a modular manner, so components could easily be replaced. In Sect. 3.1, the architecture is described in a technology-agnostic way, so different solutions could fit into the architecture. Section 3.2 describes all architecture components and how they fulfill the use case requirements identified in Sect. 1.2. Finally, Sect. 3.3 shows the final version of the architecture including the technologies selected and how each one addresses the requirements.

3.1 Overview of the Architecture

The system architecture addresses the requirements described in Sect. 1.2. The architecture can be divided into two parts: Container Delivery (CD), and Container Execution (CE). It should be noted that although all the components of the architecture can be installed in different nodes, in some cases they could be installed in the same node to reduce costs. As it can been seen in Fig. 1, the CD architecture is composed of three components: the Container Image Registry, the Source Code Manager (SCM), and the Continuous Integration (CI) tool.

Figure 2 depicts the CE architecture. The system consists of four types of logical nodes: Front-end, job schedulers nodes, working nodes and Container Image Registry (this component also appear in CD architecture describe above). The Front-end and job schedulers nodes are interconnected using a private network. The Front-end logical node exposes a REST API, which allows load-balanced communication with the REST API of the job scheduler nodes. Furthermore, it contains the Resource Management Service, which is composed of the Cloud Orchestrator and the horizontal elasticity manager. Job schedulers nodes comprise the master services of the Job Scheduler (JS). Furthermore, as the Front-end is the gateway between users and the job scheduler service, a service for providing authorization and load balancing (between job scheduler nodes) is required. Different working nodes will run the Job scheduler executors. Working nodes mount locally a volume available from a global external storage. Is should be noted that the set of working nodes can be heterogeneous.

3.2 Components

  • Resource Management service (RMS): It is in charge of deploying the resources, configuring them and to reconfigure them according to the changes on the workload. The requirements stated in Sect. 1.2 focus on facilitating deployment, higher isolation, scalability, application releases management and generic authentication and authorisation mechanisms. RMS may require to interact with the infrastructure provider to deploy new resources (or undeploy them), and to configure the infrastructure accordingly. Furthermore, the deployment should be maintainable, reliable, extendable and platform agnostic.

  • Job Scheduling service: It will perform the execution of containerised jobs requested remotely by a client application through the REST API. It is required that the job scheduler service includes a monitoring system to provide up-to-date information on the status of the jobs running in the system. Clients will submit jobs through the load balancer service providing a job description formed by any additional information required by the Biomarkers platform, the information related to the container image, the input and output sandbox, and the software and hardware requirements (such GPUs, memory, etc.).

  • Horizontal Elasticity service: It is necessary to fulfill the Resource scalability requirement, which is strongly related to the Job scheduler and the Resource Manager. Horizontal elasticity tool has to be able to monitor the job scheduler queue and running jobs, and current working node resources in order to scale in or scale out the infrastructure. It is desirable that the horizontal elasticity manager could maintain a certain number of nodes always idle or stopped to reduce the time that the jobs are queued.

  • Source Code Manager (SCM): It is required to manage the coding source for developers. Due to the release management requirement and the development complexity, it is mandatory to lean on this kind of tools.

  • Container Image Registry: In order to store and delivery the biomarker applications, it is necessary to use a Container Image Registry. Biomarker applications could be bound to Intellectual Property Rights (IPR) restrictions so Container Image Registry must be private. For this reason, authentication mechanisms are required for obtaining images from the registry. Working nodes will pull the application images when the container image not exists or recent version exists.

  • Continuous Integration (CI) tool: CI eases the development cycle because it automates building, testing and pushing to the Container Image Resgistry the biomarker application (with a certain version or tag). Developers or (the CI experts) define this workflow to do it. Furthermore, some of CI tools could trigger these tasks for each SCM commits.

  • Storage: Biomarker applications make use of legacy code and standard libraries which expect data to be provided in a POSIX filesystem. For this reason, the storage technology must mount the filesystem in the container.

3.3 Detailed Architecture

The previous sections describe the architecture and its components. After the technology study done in Sect. 2 and the feature identification for each architecture component in Sect. 3.2, the technologies selected for addressing the requirements are the following:

  • Jenkins as the CI tool due to the wide variety of available plugins (such us SCM plugins). Furthermore, it allows you to easily define workflows for each task using a file named Jenkinsfile. Jenkins provides means to satisfy the requirements RI4, RI5 and RI6.

  • GitHub as SCM because it supports both private (commercial license) and public repositories, which are linked to the CI tool. Furthermore, there is a Jenkins plugin that could scan a GitHub organization and create Jenkins tasks for each repository (and also for each branch) that contains a Jenkinsfile, addressing to requirement RI4. This component meets the requirements RI5, RI6 and RI7.

  • Hashicorp Nomad is the Job scheduler. We selected Nomad instead of Kubernetes due to Kubernetes can only run Docker containers. Furthermore, Nomad incorporates job monitoring that can be consulted by users. Additionally, it is designed to work in High Availability mode, addressing requirement RI7. Hashicorp Consul is used to resources service discovery as Nomad could use it natively. By using this job scheduler, the architecture meets the requirements RI2, RI4 (job versioning), RE1, RE2, RE3 and satisfyies RI5 and RI6 using its Access Control List (ACL) feature.

  • Docker is the container platform selected because it is the most popular container technology and it is supported by wide variety of Job schedulers. It provides the resources isolation required by RI2 and support version management (by tagging the different images) of RI4.

  • As Docker is the container platform used in this work, Docker Hub and Docker Registry are used as, respectively, public and private container image registry. Requirements RI4 and RI6 are address by using Docker.

  • Infrastructure Manager (IM) [17] is the orchestrator chosen because it is open source, cloud agnostic and provides the required functionality to fulfill the use case requirements RI1, RI3, RI5 and RI6. In [25], IM is used for deploy 50 simultaneous nodes.

  • CLUES [19] has been chosen for addressing RI3 because it is open source and can scale up or down infrastructures using IM by monitoring the Nomad jobs queue. CLUES can auto-scale the infrastructure according to different types of workloads [16, 25].

  • The RMS selected is EC3, which is a tool for system administrators that combines IM and CLUES to configure, create, suspend, restart and remove infrastructures. By using EC3 the system could address the requirements RI, RI3, RI5 and RI6.

  • Due to the experimentation will be done in Azure and the current storage solution of QUIBIM is Azure Files, it has been selected as storage. It allows to mount (entirely o partially) the data as a POSIX filesystem, which is the requirement RD1. Also, it provides the mechanisms required to fulfill RI5, RI6 and RD2. Additionally, it allows to mount the same filesystem concurrently.

  • HAProxy is used for load balancing because it is reliable, open source and support LDAP or OAuth for authentication.

Fig. 3.
figure 3

Proposed architecture with selected technologies.

It should be remarked that the proposed architecture (which is depicted in Fig. 3) is a simplification of Fig. 2. For the experiment performed, the job scheduling services are deployed in the Front-end node but users connect with the job schedulers using the load balancer service. So, this simplification does not affect to the users-services communication. Furthermore, in order to avoid costs, the CI tool (Jenkins) and the Docker Private Registry are in the same resource.

4 Results

The experiments have been performed on the public Cloud Provider Microsoft Azure. The infrastructure is composed by three type of nodes. The front-end node corresponds with the A2 v2 instance, which has two Intel(R) Xeon(R) CPU E5-2660 2.20 GHz, 3.5 GB of RAM memory, 20 GB of Hard Drive disk and two network interfaces. IM version 1.7.6 and CLUES version 2.2.0b has been installed in this node. HAProxy 1.8.14-52e4d43 and Consul 1.3.0 are running on Docker containers also in that node. The second type of node, smallwn, corresponds with the NC6 instance with six Intel(R) Xeon(R) CPU E5-2690 v3 2.60 GHz, 56 of memory RAM, 340 GB of Hard Drive disk, one NVIDIA Tesla K80 and one network interface. Finally, the largewn node type corresponds with the D13 v2 instance, which has eight Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60 GHz, 56 GB of RAM memory, 400 GiB of Hard Drive disk and one network interface. The operating system is CentOS Linux release 7.5.1804. Nomad version 0.8.6 and Docker version 18.09.0 build 4d60db4 are installed in all nodes.

4.1 Deployment

The infrastructure configuration is coded into Ansible rolesFootnote 1 and RADL recipes, and they include parameters to differentiate among the deployment. The roles will reference a local repository of packages or specific versions to minimize the impact of changes in public repositories, as well as certified containers. Deployment time is the time required to create and configure a resource. The deployment time of the front-end takes 29 min 28 s on average. The time required to configure each worker node is 9 min 30 s.

4.2 Use Case - Emphysema

The use case selected was the automatic quantification of lung emphysema biomarkers from computed tomography images. This pipeline features a patented air thresholding algorithm from QUIBIM [26, 29] for emphysema quantification and an automatic lung segmentation algorithm. Two versions were implemented. A fast one with rough lung segmentation can be used during the interactive inspection and validation of parameters. Another one with higher segmentation accuracy is implemented for the batch, production case. This brings the need of supporting short and long cases, which take respectively, 4 and 20 min.

The small cases are related with executions that take minutes to be completed. In order to provide QoS, CLUES is configured to provide always more resources than required (one node always free). As these type of jobs are very fast, the small-jobs nodes that are IDLE too much time are suspended for avoiding the deployment time. The large cases of QUIBIM biomakers could take many hours and use huge amount of resources, so the deployment time is negligible. For this reason, although the large case of this work takes the same amount of time that the deployment time and the application does not need the all resources of the VM, large-jobs nodes are not suspended and restarted, and only one large-job can run concurrently on the same VM. The “small” Emphysema used in this work consumes 15 GB of memory RAM and 2 vCPUs, so three Emphysema small-jobs could run simultaneously on the same node.

The main goal of the experiment is to demonstrate the capabilities of the proposed architecture. The experiment consists of submitting 35 small-jobs and 5 large-jobs in order to ensure that there are workload peaks that require starting up new VMs and idle periods long enough to remove (in case of large-jobs nodes) or suspend VMs (in case of small-jobs nodes). Table 4 shown the time frames were jobs are submitted.

Table 4. Scheduling of the jobs to be executed.

Figure 4 shows the number of jobs (vertical axis) along time (horizontal axis) in the different status: SUBMITTED, STARTED, FINISHED, QUEUED and RUNNING. The first three metrics denote the cumulative number of jobs that have been submitted, have actually started and have been completed over time, respectively. The remaining metrics denote the number of jobs that are queued or concurrently running at a given time.

As depicted in Fig. 4, the length of the queue does not grow above ten jobs. The delay between the submission and the start of a job (the difference in the horizontal axis between submitted and started lines) is negligible during the first twenty minutes of the experiment. After a period without new submissions (between 00:24:00 and 00:47:00 min), the workload grows again due to the submission of new jobs, triggering the deployment of four new nodes (as it can be seen in Fig. 5).

The largest number of queued jobs (10) is reached during this second deployment of new resources at 01:15:09, although it decreases to four only three minutes later (and to two at 1:23:00). Besides, the four last jobs queued were large-jobs. It should be pointed out that large-jobs have greater delays than short-jobs between submission and starting as they use dedicated nodes and this type of nodes are eliminated after 1 min without jobs allocated (a higher value will be used in production).

Figure 5 depicts the status of the nodes along the experiment, which could be USED (executing jobs), IDLE (powered-on and without jobs allocated), POWON (being started, restarted or configured), POWOFF (being suspended or removed), OFF (not deployed or suspended) or FAILED. CLUES is configured to ensure that a small-node is always active (in status IDLE or USED). It can be seen in Fig. 5 that two nodes are deployed at the start of the experiment. Then, during the period without new submissions, the running jobs end their executions, so these new nodes are powered off after one minute in the IDLE status. As the workload grows, the system deployed four new nodes between 00:50:00 and 01:13:00. Besides, CLUES tried to deploy one more node (a large-node) but, as the quota limit of the cloud provider’s account has reached, the deployment was failed. It should be noted that the system is resilient to this type of problems and successfully ended the experiment. After two hours, all jobs are completed, so CLUES suspends or removes all nodes (except one small-node that has to be active always).

Fig. 4.
figure 4

Status of jobs during the experiment.

Fig. 5.
figure 5

Status of working nodes.

5 Conclusions and Future Work

This paper has presented a agnostic and elastic architecture and a set of open-source tools for the execution of medical imaging biomarkers. Regarding the technical requirements defined in Sect. 1.2, the experiment of a real use case and the results exposed, it can be concluded that all the requirements proposed were fulfilled by the architecture presented. In Sect. 4.2, a combination of 40 batch jobs was scheduled to be executed in a specific time and the cluster achieve to execute all of them by adjusting the resources available. Furthermore, when there are wasted resources too much time, the nodes are suspended (or eliminated). The architecture uses IM and CLUES which has proven a good scalability [25] and the capability to work with unplanned workloads [16].

The proposed architecture are not only related to the execution of batch jobs, it provides to developers a workflow to ease the building, testing, delivery and version management of their application.

Future work includes implementing the proposed architecture on QUIBIM ecosystem, testing other solutions for distributed storage as Ceph [33] or OneData [20], the study of Function As a Services (FaaS) frameworks for executing batch jobs (SCAR [31] or OpenFaas [14]) or using Kubernetes for ensuring that services (Nomad, Consul and HAProxy) are always up.