1 Introduction

In the past few years, data science has grown considerably in importance and now heavily influences many domains, ranging from economy and finance to biology and medicine. As we rely more and more on data science for making decisions, we become increasingly vulnerable to programming errors.

Fig. 1.
figure 1

Overview of the program semantics presented in the paper. The dependency semantics, derived by abstraction of the trace semantics, is sound and complete for data usage. Further sound but not complete abstractions are shown on the right.

Programming errors can cause frustration, especially when they lead to a program failure after hours of computation. However, programming errors that do not cause failures can have more serious consequences as code that produces an erroneous but plausible result gives no indication that something went wrong. A notable example is the paper “Growth in a Time of Debt” published in 2010 by economists Reinhart and Rogoff, which was widely cited in political debates and was later demonstrated to be flawed. Notably, one of the flaws was a programming error, which entirely excluded some data from the analysis [23]. Its critics hold that this paper led to unjustified adoption of austerity policies for countries with various levels of public debt [30]. Programming errors in data analysis code for medical applications are even more critical [27]. It is thus paramount to achieve a high level of confidence in the correctness of data science code.

The likelihood that a programming error causes some input data to remain unused is particularly high for data science applications, where data goes through long pipelines of modules that acquire, filter, merge, and manipulate it. In this paper, we propose an abstract interpretation [14] framework to automatically detect unused input data. We characterize when a program uses (some of) its input data using the notion of dependency between the input data and the outcome of the program. Our notion of dependency accounts for non-determinism and non-termination. Thus, it encompasses notions of dependency that arise in many different contexts, such as secure information flow and program slicing [1], as well as provenance or lineage analysis [9], to name a few.

Following the theory of abstract interpretation [12], we systematically derive a new program semantics that precisely captures exactly the information needed to reason about input data usage, abstracting away from irrelevant details about the program behavior. Figure 1 gives an overview of our approach. The semantics is first expressed in a constructive fixpoint form over sets of sets of traces, by partitioning the operational trace semantics of a program based on its outcome (cf. outcome semantics in Fig. 1), and a further abstraction ignores intermediate state computations (cf. dependency semantics in Fig. 1). Starting the development of the semantics from the operational trace semantics enables a uniform mathematical reasoning about programs semantics and program properties (Sect. 3). In particular, since input data usage is not a trace property or a subset-closed property [11] (Sect. 4), we show that a formulation of the semantics using sets of sets of traces is necessary for a sound validation of input data usage via fixpoint approximation [28].

This clear design principle provides a unifying framework for reasoning about existing analyses based on dependencies. We survey existing analyses and identify key design decisions that limit or facilitate their applicability to input data usage, and we assess their precision. We show that non-interference analyses [6] are sound for proving that a terminating program does not use any of its input data; although this is too strong a property in general. We prove that strongly live variable analysis [20] is sound for data usage even for non-terminating programs, albeit it is imprecise with respect to implicit dependencies between program variables. We then derive a more precise static analysis similar to dependency analyses used in the context of backward program slicing [37]. Finally, we demonstrate the value of expressing these analyses as abstract interpretations by combining them with an existing abstraction of compound data structures such as arrays and lists [16]. This allows us to detect unused chunks of the input data, and thus apply our work to realistic data science applications.

2 Trace Semantics

The semantics of a program is a mathematical characterization of its behavior when executed for all possible input data. We model the operational semantics of a program as a transition system \(\langle \varSigma ,\tau \rangle \) where \(\varSigma \) is a (potentially infinite) set of program states and the transition relation \(\tau \subseteq \varSigma \times \varSigma \) describes the possible transitions between states [12, 14]. Note that this model allows representing programs with (possibly unbounded) non-determinism. The set \(\varOmega {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\{s \in \varSigma ~|~\forall s' \in \varSigma :~\langle s,s' \rangle \not \in \tau \}\) is the set of final states of the program.

In the following, let \(\varSigma ^n {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\left\{ s_0 \cdots s_{n-1} \mid \forall i < n: s_i \in \varSigma \right\} \) be the set of all sequences of exactly n program states. We write \(\varepsilon \) to denote the empty sequence, i.e., \(\varSigma ^0 {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\left\{ \varepsilon \right\} \). Let \(\varSigma ^\star {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\bigcup _{n \in \mathbb {N}} \varSigma ^n\) be the set of all finite sequences, \(\varSigma ^+ {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\varSigma ^\star \setminus \varSigma ^0\) be the set of all non-empty finite sequences, \(\varSigma ^\omega \) be the set of all infinite sequences, \(\varSigma ^{+\infty }{\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\varSigma ^+ \mathrel {\cup } \varSigma ^\omega \) be the set of all non-empty finite or infinite sequences and \(\varSigma ^{\star \infty } {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\varSigma ^\star \mathrel {\cup } \varSigma ^\omega \) be the set of all finite or infinite sequences of program states. In the following, we write \(\sigma \sigma '\) for the concatenation of two sequences \(\sigma , \sigma ' \in \varSigma ^{\star \infty }\) (with \(\sigma \varepsilon = \varepsilon \sigma = \sigma \), and \(\sigma \sigma ' = \sigma \) when \(\sigma \in \varSigma ^\omega \)), \(T^+ {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}T \mathrel {\cap } \varSigma ^+\) and \(T^\omega {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}T \mathrel {\cap } \varSigma ^\omega \) for the selection of the non-empty finite sequences and the infinite sequences of \(T \in \mathcal {P}\left( \varSigma ^{\star \infty }\right) \), and \(T \mathrel {;}T' {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\left\{ \sigma s \sigma ' \mid s \in \varSigma \wedge \sigma s \in T \wedge s \sigma '\in T'\right\} \) for the merging of two sets of sequences \(T \in \mathcal {P}\left( \varSigma ^+\right) \) and \(T' \in \mathcal {P}\left( \varSigma ^{+\infty }\right) \), when a finite sequence in T terminates with the initial state of a sequence in \(T'\).

Fig. 2.
figure 2

First fixpoint iterates of the trace semantics \(\varLambda \).

Given a transition system \(\langle \varSigma ,\tau \rangle \), a trace is a non-empty sequence of program states described by the transition relation \(\tau \), that is, \(\langle s,s' \rangle \in \tau \) for each pair of consecutive states \(s, s' \in \varSigma \) in the sequence. The set of final states \(\varOmega \) and the transition relation \(\tau \) can be understood as sets of traces of length one and length two, respectively. The trace semantics \(\varLambda \in \mathcal {P}\left( \varSigma ^{+\infty }\right) \) generated by a transition system \(\langle \varSigma ,\tau \rangle \) is the union of all finite traces that are terminating with a final state in \(\varOmega \), and all infinite traces. It can be expressed as a least fixpoint in the complete lattice [12]:

(1)

where the computational order is . Figure 2 illustrates the first fixpoint iterates. The fixpoint iteration starts from the set of all infinite sequences of program states. At each iteration, the final program states in \(\varOmega \) are added to the set, and sequences already in the set are extended by prepending transitions to them. In this way, we add increasingly longer finite traces, and we remove infinite sequences of states with increasingly longer prefixes not forming traces. In particular, the i-th iterate builds all finite traces of length less than or equal to i, and selects all infinite sequences whose prefixes of length i form traces. At the limit we obtain all infinite traces and all finite traces that terminate in a final state in \(\varOmega \). Note that \(\varLambda \) is suffix-closed.

The trace semantics \(\varLambda \) fully describes the behavior of a program. However, to reason about a particular property of a program, it is not necessary to consider all aspects of its behavior. In fact, reasoning is facilitated by the design of a semantics that abstracts away from irrelevant details about program executions. In the next sections, we define our property of interest and use abstract interpretation [14] to systematically derive, by successive abstractions of the trace semantics, a semantics that precisely captures such a property.

3 Input Data Usage

A property is specified by its extension, that is, the set of elements having such a property [14, 15]. Thus, properties of program traces in \(\varSigma ^{+\infty }\) are sets of traces in \(\mathcal {P}\left( \varSigma ^{+\infty }\right) \), and properties of programs with trace semantics in \(\mathcal {P}\left( \varSigma ^{+\infty }\right) \) are sets of sets of traces in \(\mathcal {P}\left( \mathcal {P}\left( \varSigma ^{+\infty }\right) \right) \). Accordingly, a program P satisfies a property \(\mathcal {H}\in \mathcal {P}\left( \mathcal {P}\left( \varSigma ^{+\infty }\right) \right) \) if and only if its semantics \([\![{P} ]\!] \in \mathcal {P}\left( \varSigma ^{+\infty }\right) \) belongs to \(\mathcal {H}\):

$$\begin{aligned} P \models \mathcal {H}\Leftrightarrow [\![{P} ]\!] \in \mathcal {H}\end{aligned}$$
(2)

Some program properties are defined in terms of individual program traces and can be equivalently expressed as trace properties. This is the case for the traditional safety [26] and liveness [4] properties of programs. In such a case, a program P satisfies a trace property \(\mathcal {T}\) if and only if all traces in its semantics \([\![{P} ]\!]\) belong to the property: \(P \models \mathcal {T}\Leftrightarrow [\![{P} ]\!] \subseteq \mathcal {T}\).

Program properties that establish a relation between different program traces cannot be expressed as trace properties [11]. Examples are security properties such as non-interference [21, 35]. In this paper, we consider a closely related but more general property called input data usage, which expresses that the outcome of a program does not depend on (some of) its input data. The notion of outcome accounts for non-determinism as well as non-termination. Thus, our notion of dependency encompasses non-interference as well as notions of dependency that arise in many other contexts [1, 9]. We further explore this in Sects. 8 to 10.

Let each program P with trace semantics \([\![{P} ]\!]\) have a set \(\mathrm {I}_P\) of input variables and a set \(\mathrm {O}_P\) of output variablesFootnote 1. For simplicity, we can assume that these variables are all of the same type (e.g., boolean variables) and their values are all in a set \(\mathrm {V} \) of possible values (e.g., where is the boolean value true and is the boolean value false). Given a trace \(\sigma \in [\![{P} ]\!]\), we write \(\sigma [0]\) to denote its initial state and \(\sigma [\omega ]\) to denote its outcome, that is, its final state if the trace is finite or \(\bot \) if the trace is infinite. The input variables at the initial states of the traces of a program store the values of its input data: we write \(\sigma [0](i)\) to denote the value of the input data stored in the input variable i at the initial state of the trace \(\sigma \), and \(\sigma _1[0] \ne _i \sigma _2[0]\) to denote that the initial states of two traces \(\sigma _1\) and \(\sigma _2\) disagree on the value of the input variable i but agree on the values of all other variables. The output variables at the final states of the finite traces of a program store its result: we write \(\sigma [\omega ](o)\) to denote the result stored in the output variable o at the final state of a finite trace \(\sigma \). We can now formally define when an input variable \(i \in \mathrm {I}_P\) is unused with respect to a program with trace semantics \([\![{P} ]\!] \in \mathcal {P}\left( \varSigma ^{+\infty }\right) \):

$$\begin{aligned} \begin{aligned} \textsc {unused}_i([\![{P} ]\!]) {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}~&\forall \sigma \in [\![{P} ]\!], v \in \mathrm {V} :\sigma [0](i) \ne v \mathrel {\Rightarrow } \\&\exists \sigma ' \in [\![{P} ]\!]:\sigma '[0] \ne _i \sigma [0] \mathrel {\wedge } \sigma '[0](i) = v \mathrel {\wedge } \sigma [\omega ] = \sigma '[\omega ] \end{aligned} \end{aligned}$$
(3)

Intuitively, an input variable i is unused if all feasible program outcomes (e.g., the outcome \(\sigma [\omega ]\) of a trace \(\sigma \)) are feasible from all possible initial values of i (i.e., for all possible initial values v of i that differ from the initial value of i in \(\sigma \), there exists a trace with initial value v for i that has the same outcome \(\sigma [\omega ]\)). In other words, the outcome of the program is the same independently of the initial value of the input variable i. Note that this definition accounts for non-determinism (since it considers each program outcome independently) and non-termination (since a program outcome can be \(\bot \)).

Fig. 3.
figure 3

Simple program to check if a student has passed three school subjects. The programmer has made two mistakes at line 7 and at line 9, which cause the input data stored in the variables \(\texttt {english}\) and \(\texttt {science}\) to be unused.

Example 1

Let us consider the simple program P in Fig. 3. Based on the input variables \(\texttt {english}\), \(\texttt {math}\), and \(\texttt {science}\) (cf. lines 1–3), the program is supposed to check if a student has passed all three considered school subjects and store the result in the output variable \(\texttt {passing}\) (cf. line 11). For mathematics and science, the student is allowed a bonus based on the input variable \(\texttt {bonus}\) (cf. line 8 and 9). However, the programmer has made two mistakes at line 7 and at line 9, which cause the input variables \(\texttt {english}\) and \(\texttt {science}\) to be unused.

Let us now consider the input variable \(\texttt {science}\). The trace semantics of the program (simplified to consider only the variables \(\texttt {science}\) and \(\texttt {passing}\)) is:

where each state \((v_1v_2)\) shows the boolean value \(v_1\) of \(\texttt {science}\) and \(v_2\) of \(\texttt {passing}\), and \(\_\) denotes any boolean value. We omitted the trace suffixes for brevity. The input variable \(\texttt {science}\) is unused, since each result value ( or ) for \(\texttt {passing}\) is feasible from all possible initial values of \(\texttt {science}\). Note that all other outcomes of the program (i.e., non-termination) are not feasible.

Let us now consider the input variable \(\texttt {math}\). The trace semantics of the program (now simplified to only consider \(\texttt {math}\) and \(\texttt {passing}\)) is the following:

In this case, the input variable \(\texttt {math}\) is used since only the initial state yields the result value for \(\texttt {passing}\) (in the final state ).    \(\blacksquare \)

The input data usage property \(\mathcal {N} \) can now be formally defined as follows:

$$\begin{aligned} \mathcal {N} {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\left\{ [\![{P} ]\!] \in \mathcal {P}\left( \varSigma ^{+\infty }\right) \mid \forall i \in \mathrm {I}_P:\textsc {unused}_i([\![{P} ]\!]) \right\} \end{aligned}$$
(4)

which states that the outcome of a program does not depend on any input data. In practice one is interested in weaker input data usage properties for a subset J of the input variables, i.e., \(\mathcal {N} _J {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\{[\![{P} ]\!] \in \mathcal {P}\left( \varSigma ^{+\infty }\right) \mid \forall i \in J \subseteq \mathrm {I}_P:\textsc {unused}_i([\![{P} ]\!])\}\).

In the following, we use abstract interpretation to reason about input data usage. In the next section, we discuss the challenges to the application of the standard abstract interpretation framework that emerge from the fact that input data usage cannot be expressed as a trace property.

4 Sound Input Data Usage Validation

In the standard framework of abstract interpretation, one defines a semantics that precisely captures a property \(\mathcal {S}\) of interest by abstraction of the trace semantics \(\varLambda \) [12]. Then, further abstractions \(\varLambda ^\natural \) provide sound over-approximations \(\gamma (\varLambda ^\natural )\) of \(\varLambda \) (by means of a concretization function \(\gamma \)): \(\varLambda \subseteq \gamma (\varLambda ^\natural )\). For a trace property, an over-approximation \(\gamma ([\![{P} ]\!]^\natural )\) of the semantics \([\![{P} ]\!]\) of a program P allows a sound validation of the property: since \([\![{P} ]\!] \subseteq \gamma ([\![{P} ]\!]^\natural )\), we have that \(\gamma ([\![{P} ]\!]^\natural ) \subseteq \mathcal {S}\Rightarrow [\![{P} ]\!] \subseteq \mathcal {S}\) and so, if \(\gamma ([\![{P} ]\!]^\natural ) \subseteq \mathcal {S}\), we can conclude that \(P \models \mathcal {S}\) (cf. Sect. 3). This conclusion is also valid for all other subset-closed properties [11]: since by definition \(\gamma ([\![{P} ]\!]^\natural ) \in \mathcal {S}\Rightarrow \forall T \subseteq \gamma ([\![{P} ]\!]^\natural ):T \in \mathcal {S}\), we have that \(\gamma ([\![{P} ]\!]^\natural ) \in \mathcal {S}\Rightarrow [\![{P} ]\!] \in \mathcal {S}\) (and so we can conclude that \(P \models \mathcal {S}\)). However, for program properties that are not subset-closed, we have that \(\gamma ([\![{P} ]\!]^\natural ) \in \mathcal {S}\not \Rightarrow [\![{P} ]\!] \in \mathcal {S}\) [28] and so we cannot conclude that \(P \models \mathcal {S}\), even if \(\gamma ([\![{P} ]\!]^\natural ) \in \mathcal {S}\) (cf. Eq. 2).

We have seen in the previous section that input data usage is not a trace property. The example below shows that it is not a subset-closed property either.

Example 2

Let us consider again the program P and its semantics \([\![{P} ]\!]_\texttt {science}\) and \([\![{P} ]\!]_{\texttt {math}}\) shown in Example 1. We have seen in Example 1 that the semantics \([\![{P} ]\!]_{\texttt {science}}\) belongs to the data usage property \(\mathcal {N} \): \([\![{P} ]\!]_{\texttt {science}} \in \mathcal {N} \). Let us consider now the following subset T of \([\![{P} ]\!]_{\texttt {science}}\):

In this case, the input variable \(\texttt {science}\) is used. Indeed, we can observe that T coincides with \([\![{P} ]\!]_{\texttt {math}}\) (except for the considered input variable). Thus \(T \not \in \mathcal {N} \) even though \(T \subseteq [\![{P} ]\!]_{\texttt {science}}\).    \(\blacksquare \)

Since input data usage is not subset-closed, we are in the unfortunate situation that we cannot use the standard abstract interpretation framework to soundly prove that a program does not use (some of) its input data using an over-approximation of the semantics of the program: \(\gamma ([\![{P} ]\!]^\natural ) \in \mathcal {N} _J \not \Rightarrow [\![{P} ]\!] \in \mathcal {N} _J\).

We solve this problem in the next section, by lifting the trace semantics \([\![{P} ]\!] \in \mathcal {P}\left( \varSigma ^{+\infty }\right) \) of a program P (i.e., a set of traces) to a set of sets of traces [28]. In this setting, a program P satisfies a property \(\mathcal {H}\) if and only if its semantics is a subset of \(\mathcal {H}\):

(5)

As we will explain in the next section, now an over-approximation of allows again a sound validation of the property: since , we have that (and so we can conclude that \(P \models \mathcal {H}\)).

More specifically, in the next section, we define a program semantics that precisely captures which subset J of the input variables is unused by a program P. In later sections, we present further abstractions that over-approximate the subset of the input variables that may be used by P, and thus allows a sound validation of an under-approximation \(J^\natural \) of J: . In other words, this means that every input variable reported as unused by an abstraction is indeed not used by the program.

5 Outcome Semantics

We lift the trace semantics \(\varLambda \) to a set of sets of traces by partitioning. The partitioning abstraction \(\alpha _Q:\mathcal {P}\left( \varSigma ^{+\infty }\right) \rightarrow \mathcal {P}\left( \mathcal {P}\left( \varSigma ^{+\infty }\right) \right) \) of a set of traces T is:

$$\begin{aligned} \alpha _Q(T) {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\left\{ T \cap C \mid C \in Q \right\} \end{aligned}$$
(6)

where \(Q \in \mathcal {P}\left( \mathcal {P}\left( \varSigma ^{+\infty }\right) \right) \) is a partition of sequences of program states.

More specifically, to reason about input data usage of a program P, we lift the trace semantics \([\![{P} ]\!]\) to by partitioning it into sets of traces that yield the same program outcome. The key insight behind this idea is that, given an input variable i, the initial states of all traces in a partition give all initial values for i that yield a program outcome; the variable i is unused if and only if these initial values are all the possible values for i (or the set of values is empty because the outcome is unfeasible, cf. Eq. 3). Thus, if the trace semantics \([\![{P} ]\!]\) of a program P belongs to the input data usage property \(\mathcal {N} _J\), then each partition in must also belong to \(\mathcal {N} _J\), and vice versa: we have that , which is precisely what we want (cf. Eq. 5).

Let \(T^+_{o = v}\) denote the subset of the finite sequences of program states in \(T \in \mathcal {P}\left( \varSigma ^{+\infty }\right) \) with value v for the output variable o in their outcome (i.e., their final state): \(T^+_{o = v} {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\left\{ \sigma \in T^+ \mid \sigma [\omega ](o) = v \right\} \). We define the outcome partition \(O \in \mathcal {P}\left( \mathcal {P}\left( \varSigma ^{+\infty }\right) \right) \) of sequences of program states:

$$\begin{aligned} O {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\left\{ \varSigma ^+_{o_1 = v_1, ..., o_k = v_k} \mid v_1, \dots , v_k \in \mathrm {V} \right\} \cup \left\{ \varSigma ^\omega \right\} \end{aligned}$$

where \(\mathrm {V} \) is the set of possible values of the output variables \(o_1, \dots , o_k\) (cf. Sect. 3). The partition contains all sets of finite sequences that agree on the values of the output variables in their outcome, and all infinite sequences of program states (i.e., all sequences with outcome \(\bot \)). We instantiate \(\alpha _Q\) above with the outcome partition to obtain the outcome abstraction \(\alpha _\bullet :\mathcal {P}\left( \varSigma ^{+\infty }\right) \rightarrow \mathcal {P}\left( \mathcal {P}\left( \varSigma ^{+\infty }\right) \right) \):

$$\begin{aligned} \alpha _\bullet (T) {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\left\{ T^+_{o_1 = v_1, ..., o_k = v_k} \mid v_1, \dots , v_k \in \mathrm {V} \right\} \mathrel {\cup } \left\{ T^\omega \right\} \end{aligned}$$
(7)

Example 3

The program P of Example 1 has only one output variable \(\texttt {passing}\) with boolean value or . Let us consider again the trace semantics \([\![{P} ]\!]_{\texttt {math}}\) shown in Example 1. Its outcome abstraction \(\alpha _\bullet ([\![{P} ]\!]_{\texttt {math}})\) is:

Note that all traces with different result values for the output variable \(\texttt {passing}\) belong to different sets of traces (i.e., partitions) in \(\alpha _\bullet ([\![{P} ]\!]_{\texttt {math}})\). The empty set corresponds to the (unfeasible) non-terminating outcome of the program.    \(\blacksquare \)

We can now use the outcome abstraction \(\alpha _\bullet \) to define the outcome semantics \(\varLambda _\bullet \in \mathcal {P}\left( \mathcal {P}\left( \varSigma ^{+\infty }\right) \right) \) as an abstraction of the trace semantics \(\varLambda \):

Definition 1

The outcome semantics \(\varLambda _\bullet \in \mathcal {P}\left( \mathcal {P}\left( \varSigma ^{+\infty }\right) \right) \) is defined as:

$$\begin{aligned} \varLambda _\bullet {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\alpha _\bullet (\varLambda ) \end{aligned}$$
(8)

where \(\alpha _\bullet \) is the outcome abstraction (cf. Eq. 7) and \(\varLambda \in \mathcal {P}\left( \varSigma ^{+\infty }\right) \) is the trace semantics (cf. Eq. 1).

The outcome semantics contains the set of all infinite traces and all sets of finite traces that agree on the value of the output variables in their outcome.

In the following, we express the outcome semantics \(\varLambda _\bullet \) in a constructive fixpoint form. This allows us to later derive further abstractions of \(\varLambda _\bullet \) by fixpoint transfer and fixpoint approximation [12]. Given a set of sets of traces S, we write \(S^+_{o = v} {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\left\{ T \in S \mid T = T^+_{o = v} \right\} \) for the selection of the sets of traces in S that agree on the value v of the output variable o in their outcome, and \(S^\omega {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\left\{ T \in S \mid T = T^\omega \right\} \) for the selection of the sets of infinite traces in S. When \(S^+_{o = v}\) (resp. \(S^\omega \)) contains a single set of traces T, we abuse notation and write \(S^+_{o = v}\) (resp. \(S^\omega \)) to also denote T. The following result gives a fixpoint definition of \(\varLambda _\bullet \) in the complete lattice , where the computational order is defined (similarly to , cf. Sect. 2) as:

Theorem 1

The outcome semantics \(\varLambda _\bullet \in \mathcal {P}\left( \mathcal {P}\left( \varSigma ^{+\infty }\right) \right) \) can be expressed as a least fixpoint in as:

(9)

where .

Fig. 4.
figure 4

First iterates of the outcome semantics \(\varLambda _\bullet \) for a single output variable o.

Figure 4 illustrates the first fixpoint iterates of the outcome semantics for a single output variable o. The fixpoint iteration starts from the partition containing the set of all infinite sequences of program states and the empty set (which represents an empty set of finite traces). At the first iteration, the empty set is replaced with a partition of the final states \(\varOmega \) based on the value v of the output variable o, while the infinite sequences are extended by prepending transitions to them (similarly to the trace semantics, cf. Eq. 1). At the next iterations, all sequences contained in each partition are further extended, and the final states that agree on the value v of o are again added to the matching set of traces that agree on v in their outcome. At the limit, we obtain a partition containing the set of all infinite traces and all sets of finite traces that agree on the value v of the output variable o in their outcome.

To prove Theorem 1 we first need to show that the outcome abstraction \(\alpha _\bullet \) preserves least upper bounds of non-empty sets of sets of traces.

Lemma 1

The outcome abstraction \(\alpha _\bullet \) is Scott-continuous.

Proof

We need to show that for any non-empty ascending chain C of sets of traces with least upper bound \(\sqcup C\), we have that , that is, \(\alpha _\bullet (\sqcup C)\) is the least upper bound of \(\alpha _\bullet (C)\), the image of C via \(\alpha _\bullet \).

First, we know that \(\alpha _\bullet \) is monotonic, i.e., for any two sets of traces \(T_1\) and \(T_2\) we have . Since \(\sqcup C\) is the least upper bound of C, for any set T in C we have that and, since \(\alpha _\bullet \) is monotonic, we have that . Thus \(\alpha (\sqcup C)\) is an upper bound of \(\left\{ \alpha _\bullet (T) \mid T \in C \right\} \).

To show that \(\alpha (\sqcup C)\) is the least upper bound of \(\alpha _\bullet (C)\), we need to show that for any other upper bound U of \(\alpha _\bullet (C)\) we have . Let us assume by absurd that . Then, there exists \(T_1 \in \alpha _\bullet (\sqcup C)\) and \(T_2 \in U\) such that : \(T_1^+ \supset T_2^+\) or \(T_1^\omega \subset T_2^\omega \). Let us assume that \(T_1^+ \supset T_2^+\). By definition of \(\alpha _\bullet \), we observe that \(T_1\) is a partition of and, since is the least upper bound of C, U cannot be an upper bound of \(\alpha _\bullet (C)\) (since \(T_2\) does not contain enough finite traces). Similarly, if \(T_1^\omega \subset T_2^\omega \), then U cannot be an upper bound of \(\alpha _\bullet (C)\) (since \(T_2\) contains too many infinite traces). Thus, we must have and we can conclude that \(\alpha (\sqcup C)\) is the least upper bound of \(\alpha _\bullet (C)\).    \(\square \)

We can now prove Theorem 1 by Kleenian fixpoint transfer [12].

Proof

(Sketch). The proof follows by Kleenian fixpoint transfer. We have that is a complete lattice and that \(\phi ^{+\infty }\) (cf. Eq. 1) and \(\varTheta _\bullet \) (cf. Eq. 8) are monotonic function. Additionally, we have that the outcome abstraction \(\alpha _\bullet \) (cf. Eq. 7) is Scott-continuous (cf. Lemma 1) and such that \(\alpha _\bullet (\varSigma ^\omega ) = \left\{ \varSigma ^\omega , \emptyset \right\} \) and \(\alpha _\bullet \circ \phi ^{+\infty }= \varTheta _\bullet \circ \alpha _\bullet \). Then, by Kleenian fixpoint transfer, we have that . Thus, we can conclude that .    \(\square \)

Finally, we show that the outcome semantics \(\varLambda _\bullet \) is sound and complete for proving that a program does not use (a subset of) its input variables.

Theorem 2

A program does not use a subset J of its input variables if and only if its outcome semantics \(\varLambda _\bullet \) is a subset of \(\mathcal {N} _J\):

$$\begin{aligned} P \models \mathcal {N} _J \Leftrightarrow \varLambda _\bullet \subseteq \mathcal {N} _J \end{aligned}$$

Proof

(Sketch). The proof follows immediately from the definition of \(\mathcal {N} _J\) (cf. Eq. 3 and Sect. 4) and the definition of \(\varLambda _\bullet \) (cf. Eq. 8).    \(\square \)

Example 4

Let us consider again the program P and its semantics \([\![{P} ]\!]_{\texttt {science}}\) shown in Example 1. The corresponding outcome semantics \(\alpha _\bullet ([\![{P} ]\!]_{\texttt {science}})\) is:

Note that all sets of traces in \(\alpha _\bullet ([\![{P} ]\!]_{\texttt {science}})\) belong to \(\mathcal {N} _{\left\{ \texttt {science}\right\} }\): the initial states of all traces in a non-empty partition contain all possible initial values ( or ) for the input variable \(\texttt {science}\). Thus, P satisfies \(\mathcal {N} _{\left\{ \texttt {science}\right\} }\) and, indeed, the input variable \(\texttt {science}\) is unused by P.    \(\blacksquare \)

As discussed in Sect. 4, we now can again use the standard framework of abstract interpretation to soundly over-approximate \(\varLambda _\bullet \) and prove that a program does not use (some of) its input data. In the next section, we propose an abstraction that remains sound and complete for input data usage. Further sound but not complete abstractions are presented in later sections.

6 Dependency Semantics

We observe that, to reason about input data usage, it is not necessary to consider all intermediate state computations between the initial state of a trace and its outcome. Thus, we can further abstract the outcome semantics \(\varLambda _\bullet \) into a set \(\varLambda _\rightsquigarrow \) of (dependency) relations between initial states and outcomes of a set of traces.

We lift the abstraction defined for this purpose on sets of traces [12] to \(\alpha _\rightsquigarrow :\mathcal {P}\left( \mathcal {P}\left( \varSigma ^{+\infty }\right) \right) \rightarrow \mathcal {P}\left( \mathcal {P}\left( \varSigma \times \varSigma _\bot \right) \right) \) on sets of sets of traces:

(10)

where \(\varSigma _\bot {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\varSigma \cup \left\{ \bot \right\} \). The dependency abstraction \(\alpha _\rightsquigarrow \) ignores all intermediate states between the initial state \(\sigma [0]\) and the outcome \(\sigma [\omega ]\) of all traces \(\sigma \) in all partitions T of S. Observe that a trace \(\sigma \) that consists of a single state s is abstracted as a pair \(\langle s, s \rangle \). The corresponding dependency concretization function \(\gamma _\rightsquigarrow :\mathcal {P}\left( \mathcal {P}\left( \varSigma \times \varSigma _\bot \right) \right) \rightarrow \mathcal {P}\left( \mathcal {P}\left( \varSigma ^{+\infty }\right) \right) \) over-approximates the original sets of traces by inserting arbitrary intermediate states:

$$\begin{aligned} \gamma _\rightsquigarrow (S) {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\left\{ T \in \mathcal {P}\left( \varSigma ^{+\infty }\right) \mid \left\{ \langle \sigma [0], \sigma [\omega ] \rangle \in \varSigma \times \varSigma _\bot \mid \sigma \in T \right\} \in S \right\} \end{aligned}$$
(11)

Example 5

Let us consider again the program of Example 1 and its outcome semantics \(\alpha _\bullet ([\![{P} ]\!]_{\texttt {math}})\) shown in Example 3. Its dependency abstraction is:

which explicitly ignores intermediate program states.    \(\blacksquare \)

Using \(\alpha _\rightsquigarrow \), we now define the dependency semantics \(\varLambda _\rightsquigarrow \in \mathcal {P}\left( \mathcal {P}\left( \varSigma ^{+\infty }\right) \right) \) as an abstraction of the outcome semantics \(\varLambda _\bullet \).

Definition 2

The dependency semantics \(\varLambda _\rightsquigarrow \in \mathcal {P}\left( \mathcal {P}\left( \varSigma ^{+\infty }\right) \right) \) is defined as:

$$\begin{aligned} \varLambda _\rightsquigarrow {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\alpha _\rightsquigarrow (\varLambda _\bullet ) \end{aligned}$$
(12)

where \(\varLambda _\bullet \in \mathcal {P}\left( \mathcal {P}\left( \varSigma ^{+\infty }\right) \right) \) is the outcome semantics (cf. Eq. 8) and \(\alpha _\rightsquigarrow \) is the dependency abstraction (cf. Eq. 10).

Neither the Kleenian fixpoint transfer nor the Tarskian fixpoint transfer can be used to obtain a fixpoint definition for the dependency semantics, but we have to proceed by union of disjoint fixpoints [12]. To this end, we observe that the outcome semantics \(\varLambda _\bullet \) can be equivalently expressed as follows:

(13)

where \(\varLambda _\bullet ^+\) and \(\varLambda _\bullet ^\omega \) separately compute the set of all sets of finite traces that agree on their outcome, and the set of all infinite traces, respectively.

In the following, given a set of traces \(T \in \mathcal {P}\left( \varSigma ^{+\infty }\right) \) and its dependency abstraction \(\alpha _\rightsquigarrow (T)\), we abuse notation and write \(T^+\) (resp. \(T^\omega \)) to also denote \(\alpha _\rightsquigarrow (T)^+ {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\alpha _\rightsquigarrow (T) \cap (\varSigma \times \varSigma )\) (resp. \(\alpha _\rightsquigarrow (T)^\omega {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\alpha _\rightsquigarrow (T) \cap (\varSigma \times \left\{ \bot \right\} )\)). Similarly, we reuse the symbols for the computational order , least upper bound , and greatest lower bound , instead of their abstractions. We can now use the Kleenian and Tarskian fixpoint transfer to separately derive fixpoint definitions of \(\alpha _\rightsquigarrow (\varLambda _\bullet ^+)\) and \(\alpha _\rightsquigarrow (\varLambda _\bullet ^\omega )\) in .

Lemma 2

The abstraction \(\varLambda _\rightsquigarrow ^+ {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\alpha _\rightsquigarrow (\varLambda _\bullet ^+) \in \mathcal {P}\left( \mathcal {P}\left( \varSigma \times \varSigma \right) \right) \) can be expressed as a least fixpoint in as:

(14)

Proof

(Sketch). By Kleenian fixpoint transfer (cf. Theorem 17 in [12]).    \(\square \)

Lemma 3

The abstraction \(\varLambda _\rightsquigarrow ^\omega {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\alpha _\rightsquigarrow (\varLambda _\bullet ^\omega ) \in \mathcal {P}\left( \mathcal {P}\left( \varSigma \times \varSigma \right) \right) \) can be expressed as a least fixpoint in as:

(15)

Proof

(Sketch). By Tarskian fixpoint transfer (cf. Theorem 18 in [12]).    \(\square \)

The fixpoint iteration for \(\varLambda _\rightsquigarrow ^+\) starts from the set containing only the empty relation. At the first iteration, the empty relation is replaced by all relations between pairs of final states that agree on the values of the output variables. At each next iteration, all relations are combined with the transition relation to obtain relations between initial and final states of increasingly longer traces. At the limit, we obtain the set of all relations between the initial and the final states of a program that agree on the final value of the output variables. The fixpoint iteration for \(\varLambda _\rightsquigarrow ^\omega \) starts from the set containing (the set of) all pairs of states and the \(\bot \) outcome, and each iteration discards more and more pairs with initial states that do not belong to infinite traces of the program.

Now we can use Lemmas 2 and 3 to express the dependency semantics \(\varLambda _\rightsquigarrow \) in a constructive fixpoint form (as the union of \(\varLambda _\rightsquigarrow ^+\) and \(\varLambda _\rightsquigarrow ^\omega \)).

Theorem 3

The dependency semantics \(\varLambda _\rightsquigarrow \in \mathcal {P}\left( \mathcal {P}\left( \varSigma \times \varSigma _\bot \right) \right) \) can be expressed as a least fixpoint in as:

(16)

Proof

(Sketch). The proof follows immediately from Lemmas 2 and 3.    \(\square \)

Finally, we show that the dependency semantics \(\varLambda _\rightsquigarrow \) is sound and complete for proving that a program does not use (a subset of) its input variables.

Theorem 4

A program does not use a subset J of its input variables if and only if the image via \(\gamma _\rightsquigarrow \) of its dependency semantics \(\varLambda _\rightsquigarrow \) is a subset of \(\mathcal {N} _J\):

$$\begin{aligned} P \models \mathcal {N} _J \Leftrightarrow \gamma _\rightsquigarrow (\varLambda _\rightsquigarrow ) \subseteq \mathcal {N} _J \end{aligned}$$

Proof

(Sketch). The proof follows from the definition of \(\varLambda _\rightsquigarrow \) (cf. Eq. 12) and \(\gamma _\rightsquigarrow \) (cf. Eq. 11), and from Theorem 2.    \(\square \)

Example 6

Let us consider again the program P and its outcome semantics \(\alpha _\bullet ([\![{P} ]\!]_{\texttt {science}})\) from Example 4. The corresponding dependency semantics is:

and, by definition of \(\gamma _\rightsquigarrow \), we have that its concretization \(\gamma _\rightsquigarrow (\alpha _\rightsquigarrow (\alpha _\bullet ([\![{P} ]\!]_{\texttt {science}})))\) is an over-approximation of \(\alpha _\bullet ([\![{P} ]\!]_{\texttt {science}})\). In particular, since intermediate state computations are irrelevant for deciding the input data usage property, all sets of traces in \(\gamma _\rightsquigarrow (\alpha _\rightsquigarrow (\alpha _\bullet ([\![{P} ]\!]_{\texttt {science}})))\) are over-approximations of exactly one set in \(\alpha _\bullet ([\![{P} ]\!]_{\texttt {science}})\) with the same set of initial states and outcome. Thus, in this case, we can observe that all sets of traces in \(\gamma _\rightsquigarrow (\alpha _\rightsquigarrow (\alpha _\bullet ([\![{P} ]\!]_{\texttt {science}})))\) belong to \(\mathcal {N} _{\left\{ \texttt {science}\right\} }\) and correctly conclude that P does not use the variable \(\texttt {science}\).    \(\blacksquare \)

At this point we have a sound and complete program semantics that captures only the minimal information needed to decide which input variables are unused by a program. In the rest of the paper, we present various static analyses for input data usage by means of sound abstractions of this semantics, which under-approximate (resp. over-approximate) the subset of the input variables that are unused (resp. used) by a program.

7 Input Data Usage Abstractions

We introduce a simple sequential programming language with boolean variables, which we use for illustration throughout the rest of the paper:

$$\begin{aligned} e&::= v \mid x \mid \texttt {not}~e \mid e~\texttt {and}~e \mid e~\texttt {or}~e&{\text {(expressions)}} \\ s&::= \texttt {skip} \mid x = e \mid \texttt {if}~e:s~\texttt {else}:s \mid \texttt {while}~e:s \mid s~s&{\text {(statements)}} \end{aligned}$$

where v ranges over boolean values, and x ranges over program variables. The \(\texttt {skip}\) statement, which does nothing, is a placeholder useful, for instance, for writing a conditional \(\texttt {if}\) statement without an \(\texttt {else}\) branch: \(\texttt {if}~e:s~\texttt {else}:\texttt {skip}\). In the following, we often simply write \(\texttt {if}~e:s\) instead of \(\texttt {if}~e:s~\texttt {else}:\texttt {skip}\). Note that our work is not limited by the choice of a particular programming language, as the formal treatment in previous sections is language independent.

In Sects. 8 and 9, we show that existing static analyses based on dependencies [6, 20] are abstractions of the dependency semantics \(\varLambda _\rightsquigarrow \). We define each abstraction \(\varLambda ^\natural \) over a partially ordered set called abstract domain. More specifically, for each program statement s, we define a transfer function \(\varTheta ^\natural [\![{s} ]\!]:\mathcal {A} \rightarrow \mathcal {A}\), and the abstraction \(\varLambda ^\natural \) is the composition of the transfer functions of all statements in a program. We derive a more precise static analysis similar to dependency analyses used for program slicing [37] in Sect. 10. Finally, Sect. 11 demonstrates the value of expressing such analyses as abstract domains by combining them with an existing abstraction of compound data structures such as arrays and lists [16] to detect unused chunks of input data.

8 Secure Information Flow Abstractions

Secure information flow analysis [18] aims at proving that a program will not leak sensitive information. Most analyses focus on proving non-interference [35] by classifying program variables into different security levels [17], and ensuring the absence of information flow from variables with higher security level to variables with lower security level. The most basic classification comprises a low security level L, and a high security level H: program variables classified as L are public information, while variables classified as H are private information.

In our context, if we classify input variables as H and all other variables as L, possiblistic non-interference [21] coincides with the input data usage property \(\mathcal {N} \) (cf. Eq. 4) restricted to consider only terminating programs. However, in general, (possibilistic) non-interference is too strong for our purposes as it requires that none of the input variables is used by a program. We illustrate this using as an example a non-interference analysis recently proposed by Assaf et al. [6] that is conveniently formalized in the framework of abstract interpretation. We briefly present here a version of the originally proposed analysis, simplified to consider only the security levels L and H, and we point out the significance of the definitions for input data usage.

Let \(\mathcal {L} {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\left\{ L, H\right\} \) be the set of security levels, and let the set \(\mathrm {X} \) of all program variables be partitioned into a set \(\mathrm {X} _L\) of variables classified as L and a set \(\mathrm {X} _H\) of variables classified as H (i.e., the input variables). A dependency constraint \(L \rightsquigarrow x\) expresses that the current value of the variable x depends only on the initial values of variables having at most security level L (i.e., it does not depend on the initial value of any of the input variables). The non-interference analysis \(\varLambda _\mathrm {F} \) proposed by Assaf et al. is a forward analysis in the lattice where \(\mathrm {F} {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\left\{ L \rightsquigarrow x \mid x \in \mathrm {X} \right\} \) is the set of all dependency constraints, , and \(S_1 \sqcup _\mathrm {F} S_2 {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}S_1 \cap S_2\). The transfer function \(\varTheta _\mathrm {F} [\![{s} ]\!]:\mathcal {P}\left( \mathrm {F} \right) \rightarrow \mathcal {P}\left( \mathrm {F} \right) \) for each statement s in our simple programming language is defined as follows:

where \(\textsc {w}(s)\) denotes the set of variables modified by the statement s, and \(\mathcal {V} _\mathrm {F} [\![{e} ]\!]S\) determines whether a set of dependencies S guarantees that the expression e has a unique value independently of the initial value of the input variables. For a variable x, \(\mathcal {V} _\mathrm {F} [\![{x} ]\!]S\) is true if and only if \(L \rightsquigarrow x \in S\). Otherwise, \(\mathcal {V} _\mathrm {F} [\![{e} ]\!]S\) is defined recursively on the structure of e, and it is always true for a boolean value v [6]. An assignment \(x = e\) discards all dependency constraints related to the assigned variable x, and adds constraints \(L \rightsquigarrow x\) if e has a unique value independently of the initial values of the input variables. This captures an explicit flow of information between e and x. A conditional statement \(\texttt {if}~e:s_1~\texttt {else}:s_2\) joins the dependency constraints obtained from \(s_1\) and \(s_2\), if e does not depend on the initial values of the input variables (i.e., \(\mathcal {V} _\mathrm {F} [\![{e} ]\!]S\) is true). Otherwise, it discards all dependency constraints related to the variables modified in either of its branches. This captures an implicit flow of information from e. The initial set of dependencies contains a constraint \(L \rightsquigarrow x\) for each variable x that is not an input variable. We exemplify the analysis below.

Example 7

Let us consider again the program P from Example 1 (stripped of the input and print statements, which are not present in our simple language):

figure a

The analysis begins from the set of dependency constraints \(\left\{ L \rightsquigarrow \texttt {passing}\right\} \), which classifies input variables as H and all other variables as L. The assignment at line 1 leaves the set unchanged as the value of the expression on the right-hand side of the assignment does not depend on the initial value of the input variables. The set remains unchanged by the conditional statement at line 2, even though the boolean condition depends on the input variable \(\texttt {english}\), because the variable \(\texttt {passing}\) is not modified. Finally, at line 3 and 4, the analysis captures an explicit flow of information from the input variable \(\texttt {bonus}\) and an implicit flow of information from the input variable \(\texttt {math}\). Thus, the set of dependency constraints becomes empty at line 3, and remains empty at line 4.

Observe that, in this case, non-interference does not hold since the result of the program depends on some of the input variables. Therefore, the analysis is only able to conclude that at least one of the input variables may be used by the program, but it cannot determine which input variables are unused.    \(\blacksquare \)

The example shows that non-interference is too strong a property in general. Of course, one could determine which input variables are unused by running multiple instances of the non-interference analysis \(\varLambda _\mathrm {F} \), each one of them classifying a single different input variable as H and all other variables as L. However, this becomes cumbersome in a data science application where a program reads and manipulates a large amount of input data.

Moreover, we emphasize that our input data usage property is more general than (possibilistic) non-interference since it also considers non-termination. We are not aware of any work on termination-sensitive possibilistic non-interference.

Example 8

Let us modify the program P shown in Example 7 as follows:

figure b

In this case, since the loop at line 2 does not modify the output variable \(\texttt {passing}\), the non-interference analysis \(\varLambda _\mathrm {F} \) will leave the initial set of dependency constraints \(\left\{ L \rightsquigarrow \texttt {passing}\right\} \) unchanged, meaning that the result of the program does not depend on any of its input variables. However, the input variable \(\texttt {english}\) is used since its value influences the outcome of the program: the program terminates if \(\texttt {english}\) is true, and does not terminate otherwise.    \(\blacksquare \)

The example demonstrates that the analysis is unsound for a non-terminating program.Footnote 2 We show that the non-interference analysis \(\varLambda _\mathrm {F} \) is sound for proving that a program does not use any of its input variables, only if the program is terminating. We define the concretization function \(\gamma _\mathrm {F} :\mathcal {P}\left( \mathrm {F} \right) \rightarrow \mathcal {P}\left( \mathcal {P}\left( \varSigma \times \varSigma \right) \right) \):

(17)

The abstraction function \(\alpha _\mathrm {F} :\mathcal {P}\left( \mathcal {P}\left( \varSigma \times \varSigma \right) \right) \rightarrow \mathcal {P}\left( \mathrm {F} \right) \) maps each relation R between states of a program to the corresponding set of dependency constraints: \(\alpha _\mathrm {F} (R) {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\left\{ L \rightsquigarrow x \mid x \in \mathrm {X} _L \wedge \forall i \in \mathrm {X} _H:\textsc {unused}_{i,x}(R) \right\} \), where \(\textsc {unused}_{i,x}\) is the relational abstraction of \(\textsc {unused}_i\) (cf. Eq. 3) in which we compare only the result stored in the variable x (i.e., we compare \(\sigma [\omega ](o)\) and \(\sigma '[\omega ](o)\), instead of \(\sigma [\omega ]\) and \(\sigma '[\omega ]\) as in Eq. 3).

Theorem 5

A terminating program does not use any of its input variables if the image via \(\gamma _\rightsquigarrow \circ \gamma _\mathrm {F} \) of its non-interference abstraction \(\varLambda _\mathrm {F} \) is a subset of \(\mathcal {N} \):

$$\begin{aligned} \gamma _\rightsquigarrow (\gamma _\mathrm {F} (\varLambda _\mathrm {F})) \subseteq \mathcal {N} \Rightarrow P \models \mathcal {N} \end{aligned}$$

Proof

Let us assume that \(\gamma _\rightsquigarrow (\gamma _\mathrm {F} (\varLambda _\mathrm {F})) \subseteq \mathcal {N} \). By definition of \(\gamma _\mathrm {F} \) (cf. Eq. 17), since the program is terminating, we have that \(\varLambda _\rightsquigarrow \subseteq \gamma _\mathrm {F} (\varLambda _\mathrm {F})\) and, by monotonicity of the concretization function \(\gamma _\rightsquigarrow \) (cf. Eq. 11), we have that \(\gamma _\rightsquigarrow (\varLambda _\rightsquigarrow ) \subseteq \gamma _\rightsquigarrow (\gamma _\mathrm {F} (\varLambda _\mathrm {F}))\). Thus, since \(\gamma _\rightsquigarrow (\gamma _\mathrm {F} (\varLambda _\mathrm {F})) \subseteq \mathcal {N} \), we have that \(\gamma _\rightsquigarrow (\varLambda _\rightsquigarrow ) \subseteq \mathcal {N} \). The conclusion follows from Theorem 4.    \(\square \)

Note that the termination of the program is necessary for the proof of Theorem 5. Indeed, for a non-terminating program, we have that \(\varLambda _\rightsquigarrow \not \subseteq \gamma _\mathrm {F} (\varLambda _\mathrm {F})\) (since \(\varLambda _\rightsquigarrow \) includes relational abstractions of infinite traces that are missing from \(\gamma _\mathrm {F} (\varLambda _\mathrm {F})\)) and thus we cannot conclude the proof.

This result shows that the non-interference analysis \(\varLambda _\mathrm {F} \) is an abstraction of the dependency semantics \(\varLambda _\rightsquigarrow \) presented earlier. However, we remark that the same result applies to all other instances in this important class of analysis [5, 25, etc.], which are therefore subsumed by our framework.

9 Strongly Live Variable Abstraction

Strongly live variable analysis [20] is a variant of the classic live variable analysis [32] performed by compilers to determine, for each program point, which variables may be potentially used before they are assigned to. A variable is strongly live if it is used in an assignment to another strongly live variable, or if is used in a statement other than an assignment. Otherwise, a variable is considered faint.

Strongly live variable analysis \(\varLambda _\mathrm {X} \) is a backward analysis in the complete lattice \(\langle \mathcal {P}\left( \mathrm {X} \right) , \subseteq , \cup , \cap , \emptyset , \mathrm {X} \rangle \), where \(\mathrm {X} \) is the set of all program variables. The transfer function \(\varTheta _\mathrm {X} [\![{s} ]\!]:\mathcal {P}\left( \mathrm {X} \right) \rightarrow \mathcal {P}\left( \mathrm {X} \right) \) for each statement s is defined as:

$$\begin{aligned} \varTheta _\mathrm {X} [\![{\texttt {skip}} ]\!](S)&{\mathop {=}\limits ^{{\tiny \mathrm{def}}}}S \\ \varTheta _\mathrm {X} [\![{x = e} ]\!](S)&{\mathop {=}\limits ^{{\tiny \mathrm{def}}}}{\left\{ \begin{array}{ll} (S \setminus \left\{ x\right\} ) \cup \textsc {vars} (e) &{} x \in S \\ S &{} \mathrm {otherwise} \end{array}\right. } \\ \varTheta _\mathrm {X} [\![{\texttt {if}~b:s_1~\texttt {else}:s_2} ]\!](S)&{\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\textsc {vars} (b) \cup \varTheta _\mathrm {X} [\![{s_1} ]\!](S) \cup \varTheta _\mathrm {X} [\![{s_2} ]\!](S) \\ \varTheta _\mathrm {X} [\![{\texttt {while}~b:s} ]\!](S)&{\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\textsc {vars} (b) \cup \varTheta _\mathrm {X} [\![{s} ]\!](S) \\ \varTheta _\mathrm {X} [\![{s_1~s_2} ]\!](S)&{\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\varTheta _\mathrm {X} [\![{s_1} ]\!] \circ \varTheta _\mathrm {X} [\![{s_2} ]\!](S) \end{aligned}$$

where \(\textsc {vars} (e)\) is the set of variables in the expression e. For input data usage, the initial set of strongly live variables contains the output variables of the program.

Example 9

Let us consider again the program P shown in Example 7. The strongly live variable analysis begins from the set \(\left\{ \texttt {passing}\right\} \) containing the output variable \(\texttt {passing}\). At line 3, the set of strongly live variables is \(\left\{ \texttt {math}, \texttt {bonus}\right\} \) since \(\texttt {bonus}\) is used in an assignment to the strongly live variable \(\texttt {passing}\), and \(\texttt {math}\) is used in the condition of the \(\texttt {if}\) statement. Finally, at line 1, the set of strongly live variables is \(\left\{ \texttt {english}, \texttt {math}, \texttt {bonus}\right\} \) because \(\texttt {english}\) is used in the condition of the \(\texttt {if}\) statement at line 2. Thus, strongly live variable analysis is able to conclude that the input variable \(\texttt {science}\) is unused. However, it is not precise enough to determine that the variable \(\texttt {english}\) is also unused.    \(\blacksquare \)

The imprecision of the analysis derives from the fact that it does not capture implicit flows of information precisely (cf. Sect. 8) but only over-approximates their presence. Thus, the analysis is unable to detect when a conditional statement, for instance, modifies only variables that have no impact on the outcome of a program; a situation likely to arise due to a programming error, as shown in the previous example. However, in virtue of this imprecise treatment of implicit flows, we can show that strongly live variable analysis is sound for input data usage, even for non-terminating programs.

We define the concretization function \(\gamma _\mathrm {X} :\mathcal {P}\left( X\right) \rightarrow \mathcal {P}\left( \mathcal {P}\left( \varSigma \times \varSigma _\bot \right) \right) \) as:

$$\begin{aligned} \gamma _\mathrm {X} (S) {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\left\{ R \in \varSigma \times \varSigma _\bot \mid \forall i \in \mathrm {X} \setminus S:\textsc {unused}_i(R) \right\} \end{aligned}$$
(18)

where we abuse notation and use \(\textsc {unused}_i\) (cf. Eq. 3) to also denote its dependency abstraction (cf. Eq. 10). We now show that strongly live variable analysis is sound for proving that a program does not use the faint variables.

Theorem 6

A program does not use a subset J of its input variables if the image via \(\gamma _\rightsquigarrow \circ \gamma _\mathrm {X} \) of its strongly live variable abstraction \(\varLambda _\mathrm {X} \) is a subset of \(\mathcal {N} _J\):

$$\begin{aligned} \gamma _\rightsquigarrow (\gamma _\mathrm {X} (\varLambda _\mathrm {X})) \subseteq \mathcal {N} _J \Rightarrow P \models \mathcal {N} _J \end{aligned}$$

Proof

Let us assume that \(\gamma _\rightsquigarrow (\gamma _\mathrm {X} (\varLambda _\mathrm {X})) \subseteq \mathcal {N} _J\). By definition of \(\gamma _\mathrm {X} \) (cf. Eq. 18), we have that \(\varLambda _\rightsquigarrow \subseteq \gamma _\mathrm {X} (\varLambda _\mathrm {X})\) and, by monotonicity of \(\gamma _\rightsquigarrow \) (cf. Eq. 11), we have that \(\gamma _\rightsquigarrow (\varLambda _\rightsquigarrow ) \subseteq \gamma _\rightsquigarrow (\gamma _\mathrm {X} (\varLambda _\mathrm {X}))\). Thus, since \(\gamma _\rightsquigarrow (\gamma _\mathrm {X} (\varLambda _\mathrm {X})) \subseteq \mathcal {N} _J\), we have that \(\gamma _\rightsquigarrow (\varLambda _\rightsquigarrow ) \subseteq \mathcal {N} _J\). The conclusion follows from Theorem 4.    \(\square \)

This result shows that also strongly live variable analysis is subsumed by our framework as it is an abstraction of the dependency semantics \(\varLambda _\rightsquigarrow \).

10 Syntactic Dependency Abstractions

In the following, we derive a more precise data usage analysis based on syntactic dependencies between program variables. For simplicity, the analysis does not take program termination into account, but we discuss possible solutions at the end of the section. Due to space limitations, we only provide a terse description of the abstraction and refer to [36] for further details.

Fig. 5.
figure 5

Hasse diagram for the complete lattice .

In order to capture implicit dependencies from variables appearing in boolean conditions of conditional and while statements, we track when the value of a variable is used or modified in a statement based on the level of nesting of the statement in other statements. More formally, each program variable maps to a value in the complete lattice shown in Fig. 5: the values \(U\) (used) and \(N\) (not-used) respectively denote that a variable may be used and is not used at the current nesting level; the values \(B\) (below) and \(W\) (over written) denote that a variable may be used at a lower nesting level, and the value W additionally indicates that the variable is modified at the current nesting level.

A variable is used (i.e., maps to \(U\)) if it is used in an assignment to another variable that is used in the current or a lower nesting level (i.e., a variable that maps to \(U\) or \(B\)). We define the operator \(\textsc {assign} [\![{x = e} ]\!]\) to compute the effect of an assignment on a map \(m:\mathrm {X} \rightarrow \textsc {usage}\), where X is the set of all variables:

$$\begin{aligned} \textsc {assign} [\![{x = e} ]\!](m) {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\lambda y. {\left\{ \begin{array}{ll} W&{} y = x \wedge y \not \in \textsc {vars} (e) \wedge m(x) \in \left\{ U, B\right\} \\ U&{} y \in \textsc {vars} (e) \wedge m(x) \in \left\{ U, B\right\} \\ m(y) &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(19)

The assigned variable is overwritten (i.e., maps to \(W\)), unless it is used in e.

Another reason for a variable to be used is if it appears in the boolean condition e of a statement that uses another variable or modifies another used variable (i.e., there exists a variable x that maps to \(U\) or \(W\)):

$$\begin{aligned} \textsc {filter} [\![{e} ]\!](m) {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\lambda y. {\left\{ \begin{array}{ll} U&{} y \in \textsc {vars} (e) \wedge \exists x \in \mathrm {X} :m(x) \in \left\{ U, W\right\} \\ m(y) &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(20)

We maintain a stack of these maps that grows or shrinks based on the level of nesting of the currently analyzed statement. More formally, a stack is a tuple \(\langle m_0, m_1, \dots , m_k \rangle \) of mutable length k, where each element \(m_0, m_1, \dots , m_k\) is a map from \(\mathrm {X} \) to \(\textsc {usage}\). In the following, we use \(\mathrm {Q} \) to denote the set of all stacks, and we abuse notation by writing \(\textsc {assign} [\![{x = e} ]\!]\) and \(\textsc {filter} [\![{e} ]\!]\) to also denote the corresponding operators on stacks:

$$\begin{aligned} \textsc {assign} [\![{x = e} ]\!](\langle m_0, m_1, \dots , m_k \rangle )&{\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\langle \textsc {assign} [\![{x = e} ]\!](m_0), m_1, \dots , m_k \rangle \\ \textsc {filter} [\![{e} ]\!](\langle m_0, m_1, \dots , m_k \rangle )&{\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\langle \textsc {filter} [\![{e} ]\!](m_0), m_1, \dots , m_k \rangle \end{aligned}$$

The operator \(\textsc {push} \) duplicates the map at the top of the stack and modifies the copy using the operator \(\textsc {inc} \), to account for an increased nesting level:

$$\begin{aligned} \begin{aligned} \textsc {push} (\langle m_0, m_1, \dots , m_k \rangle )&{\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\langle \textsc {inc} (m_0), m_0, m_1, \dots , m_k \rangle \\ \textsc {inc} (m)&{\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\lambda y. {\left\{ \begin{array}{ll} B&{} m(y) \in \left\{ U\right\} \\ N&{} m(y) \in \left\{ W\right\} \\ m(y) &{} \text {otherwise} \end{array}\right. } \end{aligned} \end{aligned}$$
(21)

A used variable (i.e., mapping to \(U\)) becomes used below (i.e., now maps to \(B\)), and a modified variable (i.e., mapping to \(W\)) becomes unused (i.e., now maps to \(N\)). The dual operator \(\textsc {pop} \) combines the two maps at the top of the stack:

$$\begin{aligned} \begin{aligned} \textsc {pop} (\langle m_0, m_1, \dots , m_k \rangle )&{\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\langle \textsc {dec} (m_0, m_1), \dots , m_k \rangle \\ \textsc {dec} (m, k)&{\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\lambda y. {\left\{ \begin{array}{ll} k(y) &{} m(y) \in \left\{ B, N\right\} \\ m(y) &{} \text {otherwise} \end{array}\right. } \end{aligned} \end{aligned}$$
(22)

where the \(\textsc {dec} \) operator restores the value a variable y mapped to before increasing the nesting level (i.e., k(y)) if it has not changed since (i.e., if the variable still maps to \(B\) or \(N\)), and otherwise retains the new value y maps to.

We can now define the data usage analysis \(\varLambda _\mathrm {Q} \), which is a backward analysis on the lattice . The partial order and the least upper bound \(\sqcup _\mathrm {Q} \) are the pointwise lifting, for each element of the stack, of the partial order and least upper bound between maps from \(\mathrm {X} \) to \(\textsc {usage}\) (which in turn are the pointwise lifting of the partial order and least upper bound \(\sqcup _\textsc {usage}\) of the \(\textsc {usage}\) lattice, cf. Fig. 5). We define the transfer function \(\varTheta _\mathrm {Q} [\![{s} ]\!]:\mathrm {Q} \rightarrow \mathrm {Q} \) for each statement s in our simple programming language as follows:

The initial stack contains a single map, in which the output variables map to the value \(U\), and all other variables map to \(N\). We exemplify the analysis below.

Fig. 6.
figure 6

Data usage analysis of the last statement of the program shown in Example 7. Stack elements are separated by \(\mid \) and, for brevity, variables mapping to \(N\) are omitted.

Example 10

Let us consider again the program P shown in Example 7. The initial stack begins with a single map m, in which the output variable \(\texttt {passing}\) maps to \(U\) and all other variables map to \(N\).

At line 4, before analyzing the body of the conditional statement, a modified copy of m is pushed onto the stack: this copy maps \(\texttt {passing}\) to \(B\), meaning that \(\texttt {passing}\) is only used in a lower nesting level, and all other variables still map to \(N\) (cf. Eq. 21). As a result of the assignment (cf. Eq. 19), \(\texttt {passing}\) is overwritten (i.e., maps to \(W\)), and bonus is used (i.e., maps to \(U\)). Since the body of the conditional statement modifies a used variable and uses another variable, the analysis of its boolean condition makes \(\texttt {math}\) used as well (cf. Eq. 20). Finally, the maps at the top of the stack are merged and the result maps \(\texttt {math}\), \(\texttt {bonus}\), and \(\texttt {passing}\) to \(U\), and all other variables to \(N\) (cf. Eq. 22). The analysis is visualized in Fig. 6.

The stack remains unchanged at line 3 and line 2, since the statement at line 3 is identical to line 4 and the body of the conditional statement at line 2 does not modify any used variable and does not use any other variable. Finally, at line 1 the variable \(\texttt {passing}\) is modified (i.e., it now maps to \(W\)), while \(\texttt {math}\) and \(\texttt {bonus}\) remain used (i.e., they map to \(U\)). Thus, the analysis is precise enough to conclude that the input variables \(\texttt {english}\) and \(\texttt {science}\) are unused.    \(\blacksquare \)

Note that, similarly to the non-interference analysis presented in Sect. 8, the data usage analysis \(\varLambda _\mathrm {Q} \) does not consider non-termination. Indeed, for the program shown in Example 8, the analysis does not capture that the input variable \(\texttt {english}\) is used, even though the termination of the program depends on its value. We define the concretization function \(\gamma _\mathrm {Q} :\mathrm {Q} \rightarrow \mathcal {P}\left( \mathcal {P}\left( \varSigma \times \varSigma \right) \right) \) as:

$$\begin{aligned} \gamma _\mathrm {Q} (\langle m_0, \dots , m_k \rangle ) {\mathop {=}\limits ^{{\tiny \mathrm{def}}}}\left\{ R \in \varSigma \times \varSigma \mid \forall i \in \mathrm {X} :m_0(i) \in \left\{ N\right\} \Rightarrow \textsc {unused}_i(R) \right\} \end{aligned}$$
(23)

where again we write \(\textsc {unused}_i\) (cf. Eq. 3) to also denote its dependency abstraction. We now show that \(\varLambda _\mathrm {Q} \) is sound for proving that a program does not use a subset of its input variables, if the program is terminating.

Theorem 7

A terminating program does not use a subset J of its input variables if the image via \(\gamma _\rightsquigarrow \circ \gamma _\mathrm {Q} \) of its abstraction \(\varLambda _\mathrm {Q} \) is a subset of \(\mathcal {N} _J\):

$$\begin{aligned} \gamma _\rightsquigarrow (\gamma _\mathrm {Q} (\varLambda _\mathrm {Q})) \subseteq \mathcal {N} _J \Rightarrow P \models \mathcal {N} _J \end{aligned}$$

Proof

Let us assume that \(\gamma _\rightsquigarrow (\gamma _\mathrm {Q} (\varLambda _\mathrm {Q})) \subseteq \mathcal {N} _J\). Since the program is terminating, we have that \(\varLambda _\rightsquigarrow \subseteq \gamma _\mathrm {Q} (\varLambda _\mathrm {Q})\), by definition of the concretization function \(\gamma _\mathrm {Q} \) (cf. Eq. 23). Then, by monotonicity of \(\gamma _\rightsquigarrow \) (cf. Eq. 11), we have that \(\gamma _\rightsquigarrow (\varLambda _\rightsquigarrow ) \subseteq \gamma _\rightsquigarrow (\gamma _\mathrm {Q} (\varLambda _\mathrm {Q}))\). Thus, since \(\gamma _\rightsquigarrow (\gamma _\mathrm {Q} (\varLambda _\mathrm {Q})) \subseteq \mathcal {N} _J\), we have that \(\gamma _\rightsquigarrow (\varLambda _\rightsquigarrow ) \subseteq \mathcal {N} _J\). The conclusion follows from Theorem 4.    \(\square \)

In order to take termination into account, one could map each variable appearing in the guard of a loop to the value \(U\). Alternatively, one could run a termination analysis [3, 33, 34], along with the data usage analysis, and only map to \(U\) variables appearing in the loop guard of a possibly non-terminating loop.

11 Piecewise Abstractions

The static analyses presented so far can be used only to detect unused data stored in program variables. However, realistic data science applications read and manipulate data organized in data structures such as arrays, lists, and dictionaries. In the following, we demonstrate that having expressed the analyses as abstract domains allows us to easily lift the analyses to such a scenario. In particular, to detect unused chunks of the input data, we combine the more precise data usage analysis presented in the previous section with the array content abstraction proposed by Cousot et al. [16]. Due to space limitations, we provide only an informal description of the resulting abstract domain and refer to [36] for further details and examples. The analyses presented in earlier sections can be similarly combined with the array abstraction for the same purpose.

We extend our small programming language introduced in Sect. 7 with integer variables, arithmetic and boolean comparison expressions, and arrays:

$$\begin{aligned} e&::= \dots \mid a[e] \mid \texttt {len}(a) \mid e~\oplus ~e \mid e~\bowtie ~e&{\text {(expressions)}} \\ s&::= \dots \mid a[e] = e&{\text {(statements)}} \end{aligned}$$

where \(\oplus \) and \(\bowtie \) respectively range over arithmetic and boolean comparison operators, a ranges over array variables, and \(\texttt {len}(a)\) denotes the length of a.

Piecewise Array Abstraction. The array abstraction [16] divides an array into consecutive segments, each segment being a uniform abstraction of the array content in that segment. The bounds of the segments are specified by sets of side-effect free expressions restricted to a canonical normal form, all having the same (concrete) value. The abstraction is parametric in the choice of the abstract domains used to manipulate sets of expressions and to represent the array content within each segment. For our analysis, we use the octagon abstract domain [31] for the expressions, and the \(\textsc {usage}\) lattice presented in the previous section (cf. Fig. 5) for the segments. Thus, an array a is abstracted, for instance, as \(\left\{ 0, i\right\} N\left\{ j + 1\right\} ?~U\left\{ \texttt {len}(a)\right\} \), where the symbol ? indicates that the segment \(\left\{ 0, i\right\} N\left\{ j + 1\right\} \) might be empty. The abstraction indicates that all array elements (if any) from index i (which is equal to zero) to index j (the bound \(j + 1\) is exclusive) are unused, and all elements from \(j + 1\) to \(\texttt {len}(a) - 1\) may be used. Let \(\mathrm {A} \) be the set of all such array abstractions. The initial segmentation of an array \(a \in \mathrm {A} \) is a single segment with unused content (i.e., \(\left\{ 0\right\} N\left\{ \texttt {len}(a)\right\} ?\)).

For our analysis, we augment the array abstraction with new backward assignment and filter operators. The operators \(\textsc {assign} _\mathrm {A} [\![{a[i] = e} ]\!]\) and \(\textsc {filter} _\mathrm {A} [\![{e} ]\!]\) split and fill segments to take into account assignments and accesses to array elements that influence the program outcome. For instance, an assignment to a[i] with an expression containing a used variable modifies the segmentation \(\left\{ 0\right\} N\left\{ \texttt {len}(a)\right\} ?\) into \(\left\{ 0\right\} N\left\{ i\right\} ?~U\left\{ i+1\right\} N\left\{ \texttt {len}(a)\right\} ?\), which indicates that the array element at index i is used by the program. An access a[i] in a boolean condition guarding a statement that uses or modifies another used variables is handled analogously. Instead, the operator \(\textsc {assign} _\mathrm {A} [\![{x = e} ]\!]\) modifies the segmentation of an array by replacing each occurrence of the assigned variable x with the canonical normal form of the expression e. For instance, an assignment \(i = i + 1\) modifies the segmentation \(\left\{ 0\right\} N\left\{ i\right\} ?~U\left\{ i+1\right\} N\left\{ \texttt {len}(a)\right\} ?\) into \(\left\{ 0\right\} N\left\{ i+1\right\} ?~U\left\{ i+2\right\} N\left\{ \texttt {len}(a)\right\} ?\). If e cannot be precisely put into a canonical normal form, the operator replaces the assigned variable with an approximation of e as an integer interval [13] computed using the underlying numerical domain, and possibly merges segments together as a result of the approximation. For instance, a non-linear assignment \(i = i * j\) approximated as \(i = [0,1]\) modifies the segmentation \(\left\{ 0\right\} N\left\{ i\right\} ?~U\left\{ i+1\right\} N\left\{ \texttt {len}(a)\right\} ?\) into \(\left\{ 0\right\} U\left\{ 2\right\} N\left\{ \texttt {len}(a)\right\} ?\), which loses the information that the initial segment of the array is unused.

When merging control flows, segmentations are compared or joined by means of a unification algorithm [16], which finds the coarsest common refinement of both segmentations. Then, the comparison or the join \(\sqcup _\mathrm {A} \) is performed pointwise for each segment using the corresponding operators of the underlying abstract domain chosen to abstract the array content. For our analysis, we adapt and refine the originally proposed unification algorithm to take into account the knowledge of the numerical domain chosen to abstract the segment bounds. We refer to [36] for further details. A widening \(\triangledown _\mathrm {A} \) limits the number of segments to enforce termination of the analysis.

Piecewise Data Usage Analysis. We can now map each scalar variable to an element of the \(\textsc {usage}\) lattice and each array variable to an array segmentation in \(\mathrm {A} \), and use the data usage analysis \(\varLambda _\mathrm {Q} \) presented in the previous section to identify unused input data stored in variables and portions of arrays.

Fig. 7.
figure 7

Another program to check if a student has passed a number of exams based on their grades stored in the array \(\texttt {grades}\). The programmer has made a mistake at line 2 that causes the program to ignore the grade stored at index 0 in \(\texttt {grades}\).

Fig. 8.
figure 8

Data usage analysis of the loop statement of the program shown in Example 11. Stack elements are separated by \(\mid \) and, for brevity, only array variables are shown.

Example 11

Let us consider the program shown in Fig. 7 where the array variable \(\texttt {grades}\) and the variable \(\texttt {passing}\) are the input and output variables, respectively. The initial stack contains a single map in which \(\texttt {passing}\) maps to \(U\), all other scalar variables map to \(N\), and \(\texttt {grades}\) maps to \(\left\{ 0\right\} N\left\{ \texttt {len}(\texttt {grades})\right\} ?\), indicating that all elements of the array (if any) are unused.

At line 6, the assignment modifies the variable \(\texttt {passing}\) (i.e., \(\texttt {passing}\) now maps to \(W\)) and uses the variable \(\texttt {failed}\) (i.e., \(\texttt {failed}\) now maps to \(U\)), while every other variable remains unchanged.

The result of the analysis of the loop statement at line 3 is shown in Fig. 8. The analysis of the loop begins by pushing (cf. Eq. 21) a map onto the stack in which \(\texttt {passing}\) becomes unused (i.e., maps to \(N\)) and \(\texttt {failed}\) is used only in a lower nesting level (i.e., maps to \(B\)), and every other variable still remains unchanged. At the first iteration of the analysis of the loop body, the assignment at line 4 uses \(\texttt {failed}\) and thus the access \(\texttt {grades}[\texttt {i}]\) at line 3 creates a used segment in the segmentation for \(\texttt {grades}\), which becomes \(\left\{ 0\right\} N\left\{ \texttt {i}\right\} ?~U\left\{ \texttt {i}+1\right\} ~N\left\{ \texttt {len}(\texttt {grades})\right\} ?\). At the second iteration, the \(\textsc {push} \) operator turns the used segment \(\left\{ \texttt {i}\right\} U\left\{ \texttt {i}+1\right\} \) into \(\left\{ \texttt {i}\right\} B\left\{ \texttt {i}+1\right\} \), and the assignment to \(\texttt {i}\) modifies the segment into \(\left\{ \texttt {i}+1\right\} B\left\{ \texttt {i}+2\right\} \) (while the segmentation in the second stack element becomes \(\left\{ 0\right\} N\left\{ \texttt {i}+1\right\} ?~U\left\{ \texttt {i}+2\right\} ~N\left\{ \texttt {len}(\texttt {grades})\right\} ?\)). Then, the access to the array at line 3 creates again a used segment \(\left\{ \texttt {i}\right\} U\left\{ \texttt {i}+1\right\} \) (in the first segmentation) and the analysis continues with the result of the \(\textsc {pop} \) operator (cf. Eq. 22): \(\left\{ 0\right\} N\left\{ \texttt {i}\right\} ?~U\left\{ \texttt {i}+1\right\} ?~U\left\{ \texttt {i}+2\right\} ?~N\left\{ \texttt {len}(\texttt {grades})\right\} ?\). After widening, the last two segments are merged into a single segment, and the analysis of the loop terminates with \(\left\{ 0\right\} N\left\{ \texttt {i}\right\} ?~U\left\{ \texttt {i}+1\right\} ?~U\left\{ \texttt {len}(\texttt {grades})\right\} ?\).

Finally, the analysis of the assignment at line 2 produces the segmentation \(\left\{ 0\right\} N\left\{ 1\right\} ?~U\left\{ 2\right\} ?~U\left\{ \texttt {len}(\texttt {grades})\right\} ?\), which correctly indicates that the first element of the array \(\texttt {grades}\) (if any) is unused by the program.    \(\blacksquare \)

Implementation. The analyses presented in this and in the previous section are implemented in the prototype static analyzer lyra and are available onlineFootnote 3.

The implementation is in python and, at the time of writing, accepts programs written in a limited subset of python without user-defined classes. A type inference is run before the analysis of a program. The analysis is performed backwards on the control flow graph of the program with a standard worklist algorithm [32], using widening at loop heads to enforce termination.

12 Related Work

The most directly relevant work has been discussed throughout the paper. The non-interference analysis proposed by Assaf et al. [6] (cf. Sect. 8) is similar to the logic of Amtoft and Banerjee [5] and the type system of Hunt and Sands [25]. The data usage analysis proposed in Sect. 10 is similar to dependency analyses used for program slicing [37] (e.g., [24]). Both analyses as well as strongly live variable analysis (cf. Sect. 9) are based on the syntactic presence of a variable in the definition of another variable. To overcome this limitation, one should look further for semantic dependencies between values of program variables. In this direction, Giacobazzi, Mastroeni, and others [19, 22, 29] have proposed the notion of abstract dependency. However, note that an analysis based on abstract dependencies would over-approximate the subset of the input variables that are unused by a program. Indeed, the absence of an abstract dependency between variables (e.g., a dependency between the parity of the variables [19, 29]) does not imply the absence of a (concrete) dependency between the variables (i.e., a dependency between the values of the variables). Thus, such an analysis could not be used to prove that a program does not use a subset of its input variables, but would be used to prove that a program uses a subset of its input variables.

Semantics formulations using sets of sets of traces have already been proposed in the literature [6, 28]. Mastroeni and Pasqua [28] lift the hierarchy of semantics developed by Cousot [12] to sets of sets of traces to obtain a hierarchy of semantics suitable for verifying general program properties (i.e., properties that are not subset-closed, cf. Sect. 7). However, none of the semantics that they proposed is suitable for input data usage: all semantics in the hierarchy are abstractions of a semantics that contains sets with both finite and infinite traces and thus, unlike our outcome semantics (cf. Sect. 5), cannot be used to reason about terminating and non-terminating outcomes of a program. Similarly, as observed in [28], the semantics proposed by Assaf et al. [6] can be used to verify only subset-closed properties. Thus, it cannot be used for input data usage.

Finally, to the best of our knowledge, our work is the first to aim at detecting programming errors in data science code using static analysis. Closely related are [7, 10] which, however, focus on spreadsheet applications and target errors in the data rather than the code that analyzes it. Recent work [2] proposes an approach to repair bias in data science code. We believe that our work can be applied in this context to prove absence of bias, e.g., by showing that a program does not use gender information to decide whether to hire a person.

13 Conclusion and Future Work

In this paper, we have proposed an abstract interpretation framework to automatically detect input data that remains unused by a program. Additionally, we have shown that existing static analyses based on dependencies are subsumed by our unifying framework and can be used, with varying degrees of precision, for proving that a program does not use some of its input data. Finally, we have proposed a data usage analysis for more realistic data science applications that store input data in compound data structures such as arrays or lists.

As part of our future work, we plan to use our framework to guide the design of new, more precise static analyses for data usage. We also want to explore the complementary direction of proving that a program uses its input data by developing an analysis based on abstract dependencies [19, 22, 29] between program variables, as discussed above. Additionally, we plan to investigate other applications of our work such as provenance or lineage analysis [9] as well as proving absence of algorithmic bias [2]. Finally, we want to study other programming errors related to data usage such as accidental data duplication.