Keywords

1 Introduction

In High Performance Computing (HPC), the demand for computation power is steadily increasingĀ [27]. To meet up the challenge of always more performances, while being constrained by ever growing energy costs, the architecture of supercomputers also grows in complexity at the whole machine scale. This complexity arises from various factors: firstly, the size of the machines (supercomputers now integrates millions of cores); secondly, the heterogeneity of the resources (various architectures of computing nodes, mixed workloads of computing and analytics, nodes dedicated to I/O, etc.); and lastly, the interconnection topology. The architectural evolutions of the interconnection networks at the whole machine scale pose two main challenges that are described as follows. First, the community proposed several types of topologies including hierarchies and heterarchies (which are based on structural well-suited topologies), the trend today is to create mixed solutions of tree-like machines with local structured toplogiesĀ [22]; and second, the interconnection network is usually unique within the machine (which means that the network is shared for various mixed data flows). Sharing such a single multi-purpose interconnection network begets complex interactions (e.g., network contention) between running applications. These interactions have a strong impact on the performances of the applicationsĀ [4, 15], and hamper the understanding of the system by the usersĀ [11]. As the volume of processed data increases, so does the impact of the network.

We propose in this work a generic framework for interference-aware scheduling. More precisely, we identify two main types of interleaved flows: the flows induced by data exchanges for computations, and the flows related to I/O. Rather than explicitly taking into account these network flows, we address the issue of harmful or inefficient interactions by constraining the shape of the allocations. Such an approach aims at taking into account the complexity of the new HPC platforms in a qualitative way that is more likely to scale properly. The scheduling problem is then defined as an optimization problem with the platform (nodes and topology) and the jobsā€™ description as input. The objective is to minimize the maximum completion time, maximize the throughput or optimize any other relevant objective while enforcing constraints on the allocations.

The purpose of this paper is to describe the methodology for interference-aware scheduling. The design of an algorithm and the corresponding simulations/experiments are another side of this subject. We are currently studying efficient solutions for assessing this methodology, but this paper does not focus on this point.

2 General Problem Setting

Modelization. A platform is of a set \(\mathcal {V}\) of \(m\)Ā nodes divided in two sets: \(m^{\text {C}}\)Ā nodes dedicated to computations \(\mathcal {V}^{\text {C}}\), and \(m^{{\textit{I/O}}}\)Ā nodes that are entry points to a high performance file system \(\mathcal {V}^{{\textit{I/O}}}\). The nodes are indexed by \(i \in 0, \dots , m - 1\). This numbering provides an arbitrary ordering of the nodes. We distinguish two interesting distributions of the nodes:

  1. 1.

    coupled I/O, where some computeĀ nodes are also entry points for the I/OĀ operations (i.e., \(\mathcal {V}^{{\textit{I/O}}}\subseteq \mathcal {V}^{\text {C}}= \mathcal {V}\));

  2. 2.

    separate I/O, when there is no overlap between compute and I/O nodes (i.e., \(\mathcal {V}^{{\textit{I/O}}}\cap \mathcal {V}^{\text {C}}= \emptyset \)).

We also distinguish two ways of interacting with the I/OĀ nodes, namely, shared I/O when any number of jobs can access an I/OĀ node at any time, and exclusive I/O when an I/OĀ node is exclusively allocated to a job for the jobā€™s lifespan. We further annotate node symbols with \({\star }^{{\textit{I/O}}}\) (\({\star }^{\text {C}}\), resp.) if there is a need to distinguish I/OĀ nodes (computeĀ nodes, resp.).

The nodes communicate thanks to an interconnection network with a given topology (i.e., the connected graph of the interconnection) or by a hierarchical topology (tree-like interconnection). The localization of every node within the topology is known. We define the distance that intrinsically derives from a topology as follows:

Definition 1

(Distance). The distance \({\text {dist}}\left( i,i'\right) \) between two nodes \(i\) and \(i'\) (either compute or I/O) is defined as the minimum number of hops to go from \(i\) to \(i'\). For hierarchical topologies, the distance is defined as the number of traversed levels (switches) to go from \(i\) to \(i'\).

Batch schedulers are a critical part of the software stack managing supercomputers: their goal is to efficiently allocate resources (nodes from \(\mathcal {V}\) in our case) to the jobs submitted by the users of the platform. The jobs are queued in a set \(\mathcal {J}\) of \(n\)Ā jobs. Each job \(j\) requires a number of computeĀ nodes \(q^{\text {C}}_j\) and some I/OĀ nodes \(q^{{\textit{I/O}}}_j\). The I/OĀ nodes requirements can either be a number of nodes (unpinned I/O), or a dedicated subset of \(\mathcal {V}^{{\textit{I/O}}}\) (pinned I/O). The number of allocated nodes is fixed (i.e., the job is rigidĀ [17]). We denote by \(\mathcal {V}(j)\) the nodes allocated to the jobĀ \(j\). Each job \(j\) requires a certain time \(p_j\) to be processed, and it is independent of every other jobs. Once a job starts executing, it runs until completion (i.e., it cannot be preempted). Finally, any computeĀ node is able to process at most one job at any time.

Before presenting the constraints we consider in this work, we need to precisely define the network flows we target. We distinguish two types of flows, directly deriving from the fact that we are dealing with two kinds of nodes.

Definition 2

(Communication types). We distinguish two types of communications (see Fig.Ā 1):

  • compute communications are the communications induced by data exchanges for computations. Such communications occur between two computeĀ nodes allocated to the same application.

  • I/O communications are the communications induced by data exchanges between computeĀ nodes and I/OĀ nodes. Such communications occur when computeĀ nodes read input data, checkpoint the state of the application, or save output results.

Fig. 1.
figure 1

Figuration of the two distinguished types of communications. Note that some communications stay within the allocation, while others do not. White nodes represent computeĀ nodes, and black nodes represent I/OĀ nodes.

As stated in the introduction, we do not aim at finely modeling the context of execution. We propose here to model the platform in such a way that network interactions are implicitly taken into account. We enrich the scheduling problem with alien geometric constraints on the allocations deriving from the platform topology or the application structure.

Most scheduler implementations are naive, in the sense that they allocate resources greedily. This is known to impact performancesĀ [15], and is the core difference between parallel machine scheduling and packing problems. Constraining the allocations to enhance performance is however no new idea. For example, LucarelliĀ etĀ al. studied the impact of enforcing contiguity or locality constraints in backfilling schedulingĀ [23]. They showed that enforcing these constraints can be done at a small computational cost, and has minimum negative impact on usual metrics such as makespan (i.e., maximum completion time), flow-time (i.e., absolute time spent in the system), or stretch (i.e., time spent in the system relative to each job size). One may refer to [9, 14] for a detailed definition of classic optimization objectives in scheduling.

We go further with this model as we target heterogeneous machines, and distinguish network flows. We seek the following properties for the constraints:

  • It captures part of the execution context: enforcing the constraint should help minimize nocuous effects arising from the execution context.

  • It derives from minimal reliable data: constraints on the allocations are enforced ahead of the scheduling decisions. As a result, the proposed constraints only use the topology of the interconnection network and the size of the allocation as input data.

  • It is cheap to compute: enumerating the list of allocations respecting some constraints cannot be a performance bottleneck for the scheduler.

We study in more detail in the two following sections how to consider these constraints for structured topologies and for hierarchical topologies.

3 Intrinsic Constraints for Structured Topologies

For the sake of clarity, we consider here a 2D-torus (the same constraints also hold for other regular topologies like higher-dimensional torus or hypercubes).

Avoiding Compute-Communication Interactions. Considering this classification of network flows, we first expose three constraints targeting compute communications.

Definition 3

(Connectivity). An allocation \(\pi \) is said to be connected iff there exists a subset \(\mathcal {V}_\pi \) of \(\mathcal {V}^{{\textit{I/O}}}\) such that \(\left( \pi \cap \mathcal {V}^{\text {C}}\right) \cup \mathcal {V}_\pi \) is connected in the graph-theory sense. \(\mathcal {V}_\pi \) may be empty.

The connectivity constraint ensures, for a given allocation, that there exists a path without interference between any pair of compute nodes of the allocation. This however, with regard to the interconnection topology, can either require support for dynamic routing or demand to the application to implement its own routing policy. Moreover, it may lead to islets of isolated compute nodes. Hence, although satisfactory from the graph theoretical point of view, the connectivity constraint is not sufficient to ensure that compute communication do not interfere. We propose the convexity constraint with the goal of overcoming these limits.

Definition 4

(Convexity). An allocation is said to be convex iff it is impossible for compute communications from any other potential allocation to share an interconnect link with respect to the underlying routing algorithm.

By taking into account the effective routing policy, and by forbidding any potential sharing, the convexity constraint does forbid interactions.

Note that the convexity constraint dominates the connectivity constraint, as stated in the following Proposition.

Proposition 1

Given any topology, any convex allocation is connected (Fig.Ā 2).

Fig. 2.
figure 2

Example of a convex allocation (dotted orange contour), and a non-convex, but connected allocation (dashed blue contour). The underlying topology is a 2D-torus, with dimension-order routing. White nodes represent computeĀ nodes, and black nodes represent I/OĀ nodes. (Color figure online)

Definition 5

(ContiguityĀ [6, 23]). An allocation is said to be contiguous if and only if the nodes of the allocation form a contiguous range with respect to the nodesā€™ ordering.

One has to note that the contiguity constraint is intrinsically unidimensional as it relies on the nodesā€™ ordering. For topologies such as trees, lines or rings the ordering is natural. On higher dimension topologies, no natural ordering exists, and an arbitrary mapping is needed. An usual strategy to order nodes is to use space-filling curves (e.g., Z-order curveĀ [24], Hilbert curveĀ [20], etc.) as they enforce a strong spatial locality. Albing proposes various orderings that may be more suited for HPC use cases, and a method to evaluate themĀ [3]. Contiguity is an interesting relaxation of convexity as it offers good spatial locality properties for a reasonable computing cost. It is however unable to ensure that no jobs could interact.

Avoiding I/O-Communication Interactions. The constraints exposed so far are well suited to take into account the compute communications, but not the I/O communications. Indeed, the compute communications may occur between any pair of compute nodes within an allocation: we usually describe this pattern as all-to-all communications. I/O communications, on the other hand, generate traffic towards few identified nodes in an all-to-one or one-to-all pattern. Hence, we propose the locality constraint, whose goal is to limit the impact of the I/O flows to the periphery of the job allocations (see Fig.Ā 3). We must emphasize that the locality constraint proposed here is not related to the locality constraint previously described by LucarelliĀ etĀ al.Ā [23].

Definition 6

(Locality). A given allocation for a jobĀ \(j\) is said to be local iff it is connected, and every I/OĀ nodes from \(\mathcal {V}^{{\textit{I/O}}}(j)\) are adjacent to computeĀ nodes from \(\mathcal {V}^{\text {C}}(j)\), with respect to the underlying topology. In other words, \(\mathcal {V}^{{\textit{I/O}}}(j)\) is a subset of the closed neighborhood of \(\mathcal {V}^{\text {C}}(j)\).

Fig. 3.
figure 3

Given an allocation (dotted orange contour) for a jobĀ \(j\), the allocation is local iff \(j\) uses a subset of the I/OĀ nodes marked with the orange dot. Foreign computeĀ nodes potentially impacted by I/O communications of \(j\) are depicted in gray: these nodes can only be in the neighborhood of the allocation thanks to the locality constraint. The underlying topology is a 2D-torus, with dimension-order routing. White nodes represent computeĀ nodes, and black nodes represent I/OĀ nodes.

Interestingly, the locality constraint enforces a bound on the number of concurrent jobs that can target a given I/OĀ node.

Proposition 2

Given any topology, any I/OĀ node \(i\), at any time, the number of local jobs targeting \(i\) cannot exceed the number of adjacent computeĀ nodes ofĀ \(i\).

As a consequence, if the I/OĀ nodes can be shared, the number of concurrent jobs targeting a given I/OĀ node is bounded by the degree of this I/OĀ node. This identity obviously also holds for exclusive I/O, but has limited interest in this case.

4 Intrinsic Constraints for Hierarchical Topologies

Hierarchical platforms are composed of computing nodes and communication switches. The interconnect is a tree where the leaves are the computing nodes, and the internal nodes correspond to the switches. A group of leaves connected by the same switch is a cluster. The communications inside a cluster are negligible while external communications require to cross all the switches along the unique path from a node to another. FigureĀ 4 depicts a model of hierarchical platform.

Fig. 4.
figure 4

Example of a hierarchical topology. White nodes represent computeĀ nodes, and black nodes represent internal nodes (switches).

Avoiding Compute-Communication Interactions. In tree-like topologies the three constraints introduced for torus topologies should be revisited. Specifically, the convexity constraint is not relevant for hierarchical topologies since it implies that the internal nodes (switches) should be exclusively used by a single application, which significantly affects the platform utilization. On the other hand, the contiguity constraint can be naturally applied, by considering an arbitrary order of the children of any internal node and then numbering the leaves from left to right. Finally, the definition of connectivity constraint does not directly apply to hierarchical topologies.

The main characteristic of the hierarchical topologies is that there is no reason to distinguish among nodes that are connected under a common switch. However, the distance among two nodes of the same allocation is very important. In what follows, we define two new constraints that are better suited to tree-like topologies.

Definition 7

(Proximity). A given allocation \(\pi \) for a jobĀ \(j\) satisfies the proximity constraint iff the quantity \(\max _{i,i' \in \pi } dist(i,i')\) is minimized.

In other words, the maximum distance among any two computing nodes assigned to the jobĀ \(j\) should be minimum. Hence, the allocation affects the minimum number of levels of the tree (see Figs.Ā 5b and c).

Definition 8

(Compacity). A given allocation \(\pi \) for a jobĀ \(j\) is called compact iff the quantity \(\sum _{i,i' \in \pi } dist(i,i')\) is minimized.

Intuitively, the compacity constraint intents not only to use the minimum number of levels in the tree, but also to consider two qualitative properties of the allocation (see Fig.Ā 5c). First, compacity implies that an allocation spans as few clusters as possible. Second, if a cluster is used, the compacity constraint aims at maximizing the number of nodes allocated within this cluster.

Fig. 5.
figure 5

Figuration of the constraints on a hierarchical topology. The depicted allocations contain five computeĀ nodes (i.e., \(q_{j} = 5\)). White nodes represent computeĀ nodes, and black nodes represent internal nodes (switches).

Avoiding I/O-Communication Interactions. In the previous presentation of hierarchical topologies, the I/O nodes have been implicitly placed at the switch levels, as it is common in many existing architecturesĀ [1]. Let notice that our analysis also holds where the I/O nodes are located at the leaves level as it is the case in some architectures like in the interconnect of the private cloudĀ [25].

5 Related Work

Tackling the nocuous interactions arising from the context of executionā€”or, more specifically, network contentionā€”can be seen as a scheduling problem with uncertainties. Within this framework, there exist two main approaches to abate the uncertainty: by either preventing some uncertainties from happening (proactive approach), or by mitigating the uncertainties impact (reactive approach)Ā [5]. We start reviewing some related works in the prevention/mitigation of interactions before discussing monitoring techniques.

Interactions Prevention. Some steps have been taken towards integrating more knowledge about the communication patterns of applications into the batch scheduler. For example, GeorgiouĀ etĀ al. studied the integration of TreeMatch into SLURMĀ [19]. Given the communication matrix of an application, the scheduler minimizes the load of the network links by smartly mapping the applicationā€™s processes on the resources. This approach however is limited to tree-like topologies, and does not consider the temporality of communications. Targeting the mesh/torus topologies, the works of TuncerĀ etĀ al.Ā [29] and PascualĀ etĀ al.Ā [26] are noteworthy. Another way to prevent interactions is to force the scheduler to use only certain allocation shapes with good properties: this strategy has been implemented in the BlueĀ Waters schedulerĀ [15]. The administrators of BlueĀ Waters let the scheduler pick a shape among \(460\) precomputed cuboids.

Yet, the works proposed above only target compute communications. HPC applications usually rely on highly tuned libraries such as MPI-IO, parallel netCDF or HDF5 to perform their I/O. TessierĀ etĀ al. propose to integrate topology awareness into these librariesĀ [28]. They show that performing data aggregation while considering the topology allow to diminish the bandwidth required to perform I/O. The CLARISSE approach proposes to coordinate the data staging steps while considering the full I/O stackĀ [21].

Interactions Mitigation. Given a set of applications, GainaruĀ etĀ al. propose to schedule I/O flows of concurrent applicationsĀ [18]. Their work aim at mitigating I/O congestion within the interconnection once applications have been allocated computation resources. To achieve such a goal, their algorithm relies on past I/O patterns of the applications to either maximize the global system utilization, or minimize the maximum slowdown induced by sharing bandwidth. Deeper in the I/O stack, at the I/OĀ node level, the I/O flows can be reorganized to better match the characteristics of the storage devicesĀ [8].

Application/Platform Instrumentation. The approaches discussed above require the knowledge of the application communication patterns (either compute or I/O communications). A lot of effort has been put into developing tools to better understand the behavior of HPC applications. Characterizing I/O patterns is key as it allows the developers to identify performance bottlenecks, and allows the system administrator to better configure the platforms. Some tools, such as DarshanĀ [10], instrument the most used I/O libraries, and record every I/O-related function call. The gathered logs provide valuable data for postmortem analysis. Taking a complementary path, Omniscā€™IO aims at predicting I/O performances during executionĀ [13]. The predictions rely on a formal grammar to model the I/O behavior of the instrumented application.

These instrumentation efforts allow for a better use of the scarce communication resources. However, as they are application-centric, they fail to capture inter-application interactions. Monitoring of the platform is a way of getting insight on the inter-application interactionsĀ [2, 16]. For example, the OVIS/LDMS system deployed on BlueĀ Waters collect \(194\)Ā metrics on every \(27648\)Ā nodes every minuteĀ [2]. Among the metrics of interest are the network counters: the number of stalls is a good indicator of congestionĀ [12].

6 Conclusion and Future Work

The goal of this paper was to propose a methodology for handling data communications in modern parallel platforms for both structured topology and hierarchical interconnects. Our proposal was to identify relevant constraints that can easily be integrated into an optimization problem. We have successfully applied this methodology for a specific topology (line/ring of processors)Ā [7].

Defining constraints that work well for any kind of topologies has been troublesome. This raises the question to know if a topology-agnostic heuristic can be designed at a reasonable cost with decent performances. If not, it would be interesting to classify the topologies, and propose class-specific constraints and heuristics. We are currently working on the design of a generic heuristic that can address several topologies.

The proposed constraints are strongly expected to have a positive impact on the performances as they implicitly emphasize data locality. We however did not verify through experiments that these constraints indeed have a positive impact on the network usage. The benefits from these experiments will be twofold: first, validate the proposed constraints; second, provide a feedback to design better suited constraints.