JUNIPR: a framework for unsupervised machine learning in particle physics
 60 Downloads
Abstract
In applications of machine learning to particle physics, a persistent challenge is how to go beyond discrimination to learn about the underlying physics. To this end, a powerful tool would be a framework for unsupervised learning, where the machine learns the intricate highdimensional contours of the data upon which it is trained, without reference to preestablished labels. In order to approach such a complex task, an unsupervised network must be structured intelligently, based on a qualitative understanding of the data. In this paper, we scaffold the neural network’s architecture around a leadingorder model of the physics underlying the data. In addition to making unsupervised learning tractable, this design actually alleviates existing tensions between performance and interpretability. We call the framework Junipr: “Jets from UNsupervised Interpretable PRobabilistic models”. In this approach, the set of particle momenta composing a jet are clustered into a binary tree that the neural network examines sequentially. Training is unsupervised and unrestricted: the network could decide that the data bears little correspondence to the chosen tree structure. However, when there is a correspondence, the network’s output along the tree has a direct physical interpretation. Junipr models can perform discrimination tasks, through the statistically optimal likelihoodratio test, and they permit visualizations of discrimination power at each branching in a jet’s tree. Additionally, Junipr models provide a probability distribution from which events can be drawn, providing a datadriven Monte Carlo generator. As a third application, Junipr models can reweight events from one (e.g. simulated) data set to agree with distributions from another (e.g. experimental) data set.
1 Introduction
Machine learning models based on deep neural networks have revolutionized information processing over the last decade. Such models can recognize objects in images [1, 2, 3], perform language translation [4, 5], transcribe spoken language [6], and even speak written text [7] at approaching human level. The truly revolutionary aspect of this progress is the generality of deep neural networks: a broad diversity of network architectures can be created from basic building blocks that allow for efficient calculation of gradients via back propagation, and thus efficient optimization through stochastic gradient descent [8]. These methods are arbitrarily expressive and can model extremely high dimensional data.
The architecture of a neural network should be designed to process information efficiently, from the input data all the way through to the network’s final output. Indeed, it empirically seems to be the case that networks that process information evenly layerbylayer perform very well. One example of this empirical result is that deep convolutional networks for image processing seem to perform sequentially more abstract operations as a function of depth [1]. Similarly, recurrent networks perform well on time series data, as their recurrent layers naturally describe stepbystep evolution in time [9].
The power and generality of deep neural networks has been leveraged across the sciences, and in particular in particle physics. The simplest architecture explored has been the fullyconnected network, which has successfully been applied in a wide variety of contexts, such as in identifying and splitting clusters from multiple particles in the pixel detector [10], in btagging [11], and in \(\tau \)identification [12]. In these basic applications, the neural network optimizes its use of some finite number of relevant physical observables for the task at hand.^{1} One drawback of such an approach is that the neural network is limited by the observables it is given. In fact, for these applications, other multivariate methods such as boosted decision trees often have comparable performance using the same inputs, but train faster and can be less sensitive to noise [17, 18].
As an alternative to feeding a neural network a set of motivated observables, one can feed it raw information. By doing so, one allows the network to take advantage of useful features that physicists have yet to discover. One way of preprocessing the raw data in a fairly unbiased way is through the use of jet images, which contain as pixel intensities the energy deposited by jet constituents in calorimeter cells [19]. Jet images invite the use of techniques from image recognition to discriminate jets of different origins. In [19], the pixel intensities in the twodimensional jet image were combined into a vector, and a Fisher linear discriminant was then used to find a plane in the highdimensional space that maximally separates two different jet classes. Treating a twodimensional jet image as an unstructured collection of pixel intensities, however, ignores the spatial locality of the problem, i.e. that neighboring pixels should have related intensities. Convolutional neural networks (CNNs), which boast reduced complexity by leveraging this spatially local structure, have since been adopted instead, and they generally outperform fullyconnected networks due to their efficient feature detection. In the first applications of CNNs to jet images, on boosted W detection [20] and quark/gluon discrimination [21], it was indeed found that simple CNNs could generally outperform previous techniques. Since then, a number of studies have aimed to optimize various discrimination tasks using CNNs [22, 23, 24, 25, 26, 27].
While the twodimensional detector image acts as a natural representation of a jet, especially from an experimental standpoint, the 4momenta of individual jet constituents provide a more fundamental representation for the input to a neural network. One complication in transitioning from the jet image to its list of momenta is that, while the image is a fixedsize representation, the list of momenta will have different sizes for different jets. To avoid this problem, one could truncate the list of momenta in the jet to a fixed size, and zeropad jets smaller than this size [28]. Alternatively, there are network architectures, namely recursive (RecNNs) and recurrent neural networks (RNNs), that handle variable length inputs naturally. With such methods, one also has the freedom to choose the order in which constituent momenta are fed into the network. In [29], a RecNN was used to build a fixedsize representation of the jet, and the authors explored various ways of ordering the momenta as input to the network: by jet clustering algorithms, by transverse momentum, and randomly. The resulting representation of the jet was then fed to a fullyconnected neural network for boosted W tagging. RecNNs and RNNs have also been used in similar ways for quark/gluon discrimination [30], top tagging [31], and jet charge [32]. See also [33, 34] for jet flavor classification using tracks.
To date, the majority of applications of machine learning to particle physics employ supervised machine learning techniques. Supervised learning is the optimization of a model to map input to output based on labeled inputoutput pairs in the training data. These training examples are typically simulated by Monte Carlo generators, in which case the labels come from the underlying physical processes being generated. Most of the classification studies mentioned above employ this style of supervised learning, and similar techniques have also been utilized for regression tasks such as pileup subtraction [22]. Alternatively, training data can be organized in mixed samples, each containing different proportions of the different underlying processes. In this case, labels correspond to the mixed samples, and learning is referred to as weakly supervised. While full and weak supervision are very similar as computational techniques, the distinction is exceptionally important in particle physics, where the underlying physical processes are unobservable in real collider data. Early studies of weakly supervised learning in particle physics show very promising results: performance comparable to fully supervised methods was found both with lowdimensional inputs [35, 36] (a few physical observables) and with very highdimensional inputs [37] (jet images).
With supervised learning, there is a notion of absolute accuracy: since every training example is labeled with the desired output, the network predicts this output either correctly or incorrectly. This is in contrast to unsupervised learning, where the machine learns underlying structure that is unlabeled in the training data. Without outputlabeled training examples, there is no notion of absolute accuracy. Several recent studies have employed unsupervised learning techniques in particle physics. In [38], borrowing concepts from topic modelling in text documents, the authors extract observable distributions of underlying quark and gluon jets from two mixed samples. In [39, 40, 41], generative adversarial networks (GANs) are used to efficiently generate realistic jet images and calorimeter showers.
In this work, we explore another approach to unsupervised machine learning in particle physics, in which a deep neural network learns to compute the relative differential cross section of each data point under consideration, or equivalently, the probability distribution generating the data. The power of having access to the probability distribution underlying the data should not be underestimated. For example, likelihood ratios would provide optimal discriminants [42], and sampling from the probability distribution would provide completely datadriven simulations.
In this paper, we introduce a framework named Junipr: “Jets from UNsupervised Interpretable PRobabilistic models”. We also present a basic implementation of this framework using a deep neural network. This network directly computes the general probability distribution underlying particle collider data using unsupervised learning.
The task of learning the probability distribution underlying collider data comes with challenges due to the complexity of the data. Some past studies have aimed to process collider information efficiently by using neural network architectures inspired by physics techniques already in use [29, 30, 31, 32, 33, 43]. In this paper, we take this idea one step further. We scaffold the neural network architecture around a leadingorder description of the physics underlying the data, from first input all the way to final output. Specifically, we base the Junipr framework on algorithmic jet clustering trees. The tree structure is used, both in processing input information, and in decomposing the network’s output. In particular, Junipr ’s output is organized into meaningful probabilities attached to individual nodes in a jet’s clustering tree. In addition to reducing the complexity and increasing the efficiency of the corresponding neural network, this approach also forces the machine to speak a language familiar to physicists, thus enabling its users to interpret the underlying physics it has learned. Indeed, one common downside associated with machine learning techniques in physics is that, though they provide powerful methods to accomplish the tasks learned in training, they do little to clarify the underlying physics that underpins their success. Our approach minimizes this downside.
Let us elaborate on the treebased architecture used for Junipr ’s implementation. In particle physics, events at colliders are dominated by the production of collimated collections of particles known as jets. The origin of jets and many of their properties can be understood through the fundamental theory of strong interactions, quantum chromodynamics (QCD). One insight from QCD is that jets have an inherently fractal structure, inherited from the approximate scale invariance of the fundamental theory. The fractal structure is made precise through the notion of factorization, which states that the dynamics in QCD stratify according to soft, collinear, and hard physics [44, 45, 46, 47, 48], with each sector being separately scale invariant. To capture this structure efficiently in Junipr, we use a kind of factorized architecture, with a dense network to describe local branchings (wellsuited for collinear factorization), and a global RNN superstructure general enough to encode soft coherence and any factorizationviolating effects.
One might naively expect this setup to require knowledge of the sequence of splittings that created the jet. Although there is a sequence of splittings in partonshower simulations, the splittings are only a semiclassical approximation used to model the intensely complex and essentially incalculable distribution of final state particles. Real data is not labelled with any such sequence. In fact, there are many possible sequences which could produce the same event, and the cross section for the event is given by the square of the quantum mechanical sum of all such amplitudes, including effects of virtual particles. A proxy for this fictitious splitting history is a clustering history that can be constructed in a deterministic way using a jetclustering algorithm, such as the \(k_t\) algorithm [49, 50] or the Cambridge/Aachen (C/A) algorithm [51, 52]. There is no correct algorithm: each is just a different way to process the momenta in an event. Indeed, there seems to be useful information in the multiple different ways that the same event can be clustered [53, 54, 55]. Any of these algorithms, or any algorithm at all that encodes the momenta of an event into a binary tree, can be used to scaffold a neural network in the Junipr approach.
For practical purposes, Junipr is implemented with respect to a fixed jet clustering algorithm. Without a fixed algorithm, the probability of the finalstate particles constructed through \(1\rightarrow 2\) branchings would require marginalization over all possible clustering histories – an extremely onerous computational task. In principle, fixing the algorithm used to implement Junipr should be inconsequential for its output, namely the probability distribution over finalstate momenta, as these momenta are independent of clustering algorithm. To reiterate, the Junipr approach does not require the chosen clustering algorithm to agree with the underlying datageneration process; this is demonstrated in Sects. 5.2 and 5.3 below. On the other hand, the sequence of probabilities assigned to each branching in a clustering tree certainly depends on the algorithm used to define the tree. For example, the same final probability \(P=10^{22}\) could be reached with one clustering algorithm through the sequence \(P = 10^{5}\cdot 10^{6}\cdot 10^{8} \cdot 10^{3}\), or with another algorithm through \(P= 10^{15}\cdot 10^{2}\cdot 10^{1} \cdot 10^{4}\). The key idea is that, if an algorithm is chosen which does correspond to a semiclassical parton shower, the resulting sequence of probabilities may be understandable. This provides avenues for users to interpret what physics the machine learns, and we expect that dissecting Junipr will be useful in such cases. We will demonstrate this throughout the paper.
It is worth emphasizing one fundamental aspect of our approach for clarity. The Junipr framework yields a probabilistic model, not a generative model. The probabilistic model allows us to directly compute the probability density of an individual jet, as defined by its set of constituent particle momenta. To be precise, this is the probability density for those particular momenta to arise in an event, conditioned on the event selection criteria used to select the training data. As a complementary example of this, shower deconstruction [56, 57] provides a theorydriven approach to probabilistic modeling in particle physics, in which probabilities are calculated using QCD rather than a neural network. In contrast, a generative model would output an example jet, taking random noise as input to seed the generation process. Given a distribution of input seeds, the jets output from a generative model should follow the same distribution as the training data. While this means that the probability distribution underlying the data is internally encoded in a generative model, this underlying distribution is hidden from the user. Examples of generative models in particle physics include Monte Carlo event generators and, more recently, GANs used to generate jet images and detector simulations [39, 40, 41].
The direct access to the probability distribution that is enabled by a probabilistic model comes with several advantages. If two different probabilistic models are trained on two different samples of jets, they can be used to compute likelihood ratios that distinguish between the two samples. Likelihood ratios provide theoretically optimal discriminants [42], which is indeed a major motivation for Junipr ’s probabilistic approach. One can also sample from a probabilistic model in order to generate events, though generative models are bettersuited for this application [39, 40, 41]. In addition, one can use a probabilistic model to reweight events generated by an imperfect simulator, so that the reweighted events properly agree with data.
In this paper, as a proofofconcept, we use simulated \(e^+e^\) data to train a basic implementation of the Junipr framework described above. We have not yet attempted to optimize all of this implementation’s hyperparameters; however, we do find that a very simple architecture with no fine tuning is adequate. This is confirmed by its impressive discrimination power and its effective predictivity for a broad class of observables, but more rigorous testing is needed to determine whether this approach can provide stateoftheart results on the most pressing physics problems.
The general probabilistic model, its motivation, and a specific neural network implementation of it are discussed in Sect. 2. A comprehensive discussion of training the model, including the data used and potential subtleties in extending the model are covered in Sect. 3. Results on discrimination, generation, and reweighting are presented in Sect. 4. We provide robustness tests and some conceptually interesting results related to factorization in Sect. 5, including the counterintuitive anti\(k_t\) shower generator. There are many ways to generalize our approach, as well as many applications that we do not fully explore in this work. We leave a discussion of some of these possible extensions to Sect. 6, where we conclude.
2 Unsupervised learning in jet physics
To establish the framework clearly and generally, Sect. 2.1 begins by describing Junipr as a general probabilistic model, independent of the specific parametric form taken by the various functions it involves. From this perspective, such a probabilistic model could be implemented in many different ways. Section 2.2 then describes the particular neural network implementation of Junipr used in this paper, which has a simple but QCDcustomized architecture and minimal hyperparameter tuning.
2.1 General probabilistic model
An unstructured model of the above form would ignore the fact that we know jet evolution is welldescribed by a semiclassical sequence of \(1\rightarrow 2\) splittings, due to factorization theorems [44, 45, 46, 47, 48]. A model that ignores factorization would be much more opaque to interpretation, and have many more parameters than needed due to its unnecessary neutrality. Thus, we propose a model that describes a given configuration of finalstate momenta using sequential \(1\rightarrow 2\) splittings. Such a sequence is defined by a jet clustering algorithm, which assigns a clustering tree to any set of finalstate momenta, so that a sequential decomposition of the probability distribution can be performed without loss of generality. We imagine fixing a specific algorithm to define the trees, so that there is no need to marginalize over all possible trees in computing a probability, a computation that would be intractable. While a deterministic clustering algorithm cannot directly describe the underlying quantummechanical parton evolution, that is not the goal for this model. With the algorithm set, the model as shown in Fig. 1 becomes that shown in Fig. 2.
We will now formalize this discussion into explicit equations. For the rest of this section we assume that the clustering tree is determined by a fixed jet algorithm (e.g. any of the generalized \(k_t\) algorithms [58, 59]). The particular algorithm chosen is theoretically inconsequential to the model, as the same probability distribution over final states will be learned for any choice. Practically speaking, however, certain algorithms may have advantages over others. We will discuss the choice of clustering algorithm further in Sects. 5.2 and 5.3.

the “initial state” consists of a single momentum: \(k^{(1)}_1 = p_1 + \cdots + p_n\);

at subsequent steps \(\{k^{(t)}_1,\ldots , k^{(t)}_{t}\}\) is gotten from \(\{k^{(t1)}_1,\ldots , k^{(t1)}_{t1}\}\) by a single momentumconserving \(1 \rightarrow 2\) branching;

after the final branching, the state is the physical jet: \(\{k^{(n)}_1,\ldots , k^{(n)}_{n}\} = \{p_1,\ldots , p_n\}\).

\(P_\text {end}\big (0\big h^{(t)}\big )\): probability over binary values for whether or not the tree ends;

\(P_\text {mother}\big (m^{(t)}\big h^{(t)}\big )\): probability over \(m\in \{1,\ldots , t\}\) indexing candidate mother momenta;

\(P_\text {branch}\big (k_{d_1}^{(t+1)}, k_{d_2}^{(t+1)} \big  k_m^{(t)}, h^{(t)}\big )\): probability over possible \(k_m \rightarrow k_{d_1}, k_{d_2}\) branchings.
With these choices, we force the hidden representation \(h^{(t)}\) to encode all global information about the tree, since it must predict whether the tree ends, which momentum branches next, and the branching pattern. In fact, providing \(P_\text {branch}\) with the momenta that directly participate in the \(1\rightarrow 2\) branching means that \(h^{(t)}\) only needs to encode global information. We show that the global structure stored in \(h^{(t)}\) is crucial for the model to predict the correct branching patterns in Sect. 5.1.
2.2 Neural network implementation
For a neural network based implementation of the model defined by Eqs. (2.2) and (2.3), we use an RNN with hidden state \(h^{(t)}\) augmented by dense neural networks for each of the three probability distributions in Eq. (2.3). The recurrent structure of this implementation is shown in Fig. 3, which emphasizes how the RNN’s hidden representation \(h^{(t)}\) keeps track of the global state of the jet, by sequentially reading in the momenta that branched most recently.
For \(P_\text {end}\) we use a fullyconnected network with \(h^{(t)}\) as input, a single hidden layer of size 100 with ReLU activation, and a sigmoid output layer. We use the same setup for \(P_\text {mother}\), the only difference being that the output layer is a softmax over the t candidate mother momenta, ordered by energy. These choices are generic and not highly tuned. We found that Junipr works well for a very general set of architectures and sizes, so we stick with this simple setup.
There are two separate approaches one could take to model the branching function \(P_\text {branch}\). Firstly, the variables x could be treated as discrete, with \(P_\text {branch}\) outputting a softmax probability over discrete cells representing different x values. Secondly, one could treat x as a continuous variable and use an “energy model” of the form \( P_\text {branch} \sim {e^{E(x)} / Z}\,, \) where Z is a normalizing partition function. In this work we predominantly adopt the former approach, as it is much faster, and most distributions are insensitive to the discretization of x. However, we do train an energy model to show that models with continuous x are possible, which we discuss in Sect. 3.4.
In the discrete case, we bin the possible values of x into a fourdimensional grid with ten bins per dimension, so that the entire grid has \(10^4\) cells. For a given value of x, we place a 1 in the bin corresponding to that value, and we place 0’s everywhere else. This 1hot encoding of the possible values of x allows us to use a softmax function at the top layer of the neural network describing \(P_\text {branch}\) (see Fig. 4). Furthermore, we use a dense network with a single hidden layer of size 100 and ReLU activation for \(P_\text {branch}\), just as we did for \(P_\text {end}\) and \(P_\text {mother}\). The hidden units in this network receive \(h^{(t)}\) as input, as well as the mother momentum \(k_m^{(t)}\).
Thus we have a neural network implementation of Eqs. (2.2) and (2.3), with a representation of the evolving global jet state stored in \(h^{(t)}\), and with fullyconnected networks describing \(P_\text {end}\), \(P_\text {mother}\), and \(P_\text {branch}\). As defined above, the model has a single \(10^6\) parameter matrix, mapping the branching function’s 100 dimensional hidden layer to its \(10^4\) dimensional output layer, and has \(6\times 10^4\) parameters elsewhere. For future studies, we reccomend starting with the same architecture with all the hidden layers having 100 nodes each, and then performing vary the hyperparameters to optimize for the task at hand. One might refer to our implementation as Junipr\(_{\emptyset }\), as one can imagine many alternative implementations within the Junipr framework that may prove useful in future applications. We will continue to use the term Junipr for brevity, to refer both to the framework and to the basic implementation described here.
3 Training and validation
We now describe how to train the model outlined in Sect. 2.2. We begin by discussing the training data used, followed by our general approach to training and validation. Finally we discuss an alternative model choice that allows higher resolution on the particle momenta.
3.1 Training data
To enable proofofconcept demonstrations of Junipr ’s various applications, we train the implementation described in Sect. 2.2 using jets simulated in Pythia v8.226 [63, 64] and clustered using FastJet v3.2.2 [59]. We simulated 600k hemisphere jets in Pythia using the process \(e^+e^ \rightarrow q\bar{q}\) at a centerofmass energy of 1 TeV, with hemispheres defined in FastJet using the exclusive \(k_t\) algorithm [49, 50], and with an energy window of 450–550 GeV imposed on the jets. To create the deterministic trees that Junipr requires, we reclustered the jets using the C/A clustering algorithm [51, 52], with \(E_\text {sub}=1\) GeV and \(R_\text {sub}=0.1\). The nonzero values of \(E_\text {sub}\) and \(R_\text {sub}\) make the input to Junipr formally infraredandcollinear safe, but this is by no means necessary. Furthermore, our approach is formally independent of the reclustering algorithm chosen. We demonstrate this by showing results using an absurd reclustering algorithm inspired by a 2D printer in Sect. 5.2, as well as for anti\(k_t\) [58] reclustering in Sect. 5.3.
Thus we have 600k quark jets with \(E_\text {jet}\sim 500\) GeV and \(R_\text {jet}\sim \pi /2\). We use 500k of these jets for training, with 10k set aside as a test set to monitor overfitting, and we use the remaining validation set of 100k jets to make the plots in this paper.
In the applications of Sect. 4, we also make use of several other data sets produced according to the above specifications, with small but important changes. We list these modifications here for completeness. In one case, quark jets from \(e^+e^ \rightarrow q\bar{q}\) were required to lie in a very tight mass window of 90.7–91.7 GeV. A sample of boosted Z jets from \(e^+e^ \rightarrow ZZ\) events was also produced with the same mass cut. And finally, another sample of quark jets was produced, as detailed above, but with the value of \(\alpha _s(m_Z)\) in the final state shower changed from Pythia ’s default value of 0.1365 to 0.11.
3.2 Approach to training
Schedule  5 epochs  5 epochs  5 epochs  5 epochs  5 epochs  5 epochs 

Learning rate  \(10^{2}\)  \(10^{3}\)  \(10^{4}\)  \(10^{3}\)  \(10^{4}\)  \(10^{5}\) 
Batch size  10  10  10  100  100  100 
We wrote Junipr in Theano^{2} [65] and trained it on 16core CPU servers using the SherlockML technical data science platform. Training Junipr on 500k jets according to the above schedule took an average of 4 days.
3.3 Validation of model components
Junipr is constructed as a probabilistic model for jet physics by expanding \(P_\text {jet}\) as a product over steps t in the jet’s clustering tree, as shown in Eq. (2.2). Each step involves three components: the probability \(P_\text {end}\) that the tree will end, the probability \(P_\text {mother}\) that a given momentum will be the next mother to branch, and the probability \(P_\text {branch}\) over the daughter momenta of the branching, as shown in Eq. (2.3). We now validate each of Junipr ’s components using our validation set of 100k previously unseen Pythia jets. In this section, we present histograms of actual outcomes in the Pythia validation set (i.e. frequency distributions) as well as Junipr ’s probabilistic output when evaluated on the jets in this data set (i.e. marginalized probability distributions) to check for agreement.
3.4 Increasing the branching function resolution
We begin by briefly discussing increasing the resolution of the branching function over discrete x, the case described in Sect. 2.2. The first thing to note is that with a softmax over fourdimensional x, the size of the matrix multiplication required in a dense network is quartic in the number of bins used for each dimension. We generically use ten bins for each of \(z,\theta ,\phi ,\delta \) resulting in an output size of \(10^4\). (In fact we use ten linearly spaced bins in the transformed coordinates of Eq. (3.2), and this can be seen on the logarithmic axes of Fig. 9, but this detail is not conceptually important.) Given this quartic scaling, simply increasing the number of discrete x cells quickly becomes prohibitively computationally expensive. Potential solutions to this problem include: (i) using a hierarchical softmax [66, 67], and (ii) simply interpolating between the discrete bins of the model.
In a hierarchical softmax, a lowresolution probability is predicted first, say with \(5^4\) cells, then another \(5^4\)celled distribution is predicted inside the chosen lowresolution cell. In principle, this gives \(25^4\) resolution at only twice the computational time required for \(5^4\) resolution. We briefly implemented the hierarchical softmax, and preliminary tests found it to work efficiently, but perhaps with a decrease in training stability. We chose not to pursue the hierarchical softmax further in this work, primarily because we have not seen the need for resolution much higher than \(10^4\) discrete x cells.
Due to its ease of use, we do employ linear interpolation between the discrete bins in our baseline model with resolution \(10^4\). This comes at no extra training cost, and removes most of the effects of discretization on the observable distributions generated by sampling from Junipr; see Sect. 4.2.
To close this section, we note that in most cases, we expect the discretized branching function with ten bins per dimension of x to be sufficient, especially if one performs a linear interpolation on the output cells. This simple case is certainly faster to train and does not require the technique described here to avoid biased gradient estimates.
4 Applications and results
A direct and powerful application of the Junipr framework, enabled by having access to separate probabilistic models of different data sources, is in discrimination based on likelihood ratios. We discuss discrimination in Sect. 4.1, along with a highly intuitive way of visualizing it. In contrast, an instinctive but indirect use of Junipr as a probabilistic model is in sampling new jets from it. We discuss the observable distributions generated through sampling in Sect. 4.2. However, sampling from a probabilistic model is often inefficient (e.g. slower than Pythia) compared to evaluating probabilities of jets directly. In Sect. 4.3 we discuss reweighting samples from one simulator to match those of another distribution. In principle, this could be used to tweak Pythia samples to match observed collider data simply by reweighting.
4.1 Likelihood ratio discrimination
Discrimination based on the likelihood ratio theoretically provides the most statistically powerful discriminant between two hypotheses [42]. Moreover, our setup takes into account all the momenta that define a specific type of jet. Note also that for the task of pairwise discrimination between N jet types, this unsupervised approach requires training N probabilistic models, whereas a supervised learning approach would require training \(N(N1)/2\) classifiers. Thus, we expect likelihoodratio discrimination using Junipr to provide a powerful tool.
We note further that we do not even require pure samples of the two underlying processes between which we would like to discriminate [35]. Thus, it would be feasible to discriminate based solely on real collider data. In our Z/quark example above, we would simply train one copy of Junipr on a sample of predominantly boostedZ jets, and train another copy on predominantly quark jets, and the likelihood ratio of those two models would still be theoretically optimal for Z/quark discrimination.
In order to get a first look at the potential of likelihoodratio discrimination using Junipr, we continue with the Z/quark example discussed above. We use Pythia to simulate \(e^+e^ \rightarrow q\bar{q}\) and \(e^+e^ \rightarrow ZZ\) events at a centerofmass energy of 1 TeV. We impose a very tight mass window, 90.7–91.7 GeV, on the jets in each data set, so that no discrimination power can be gleaned from the jet mass. More details on the generation of the data sets were given in Sect. 3.1. We admit that a more compelling example of discrimination power would be for quark and gluon jets at hadron colliders, but we leave a proper treatment of that important case to future work. The toy scenario studied here serves both to prove that the probabilities output by Junipr are meaningful, and that likelihood ratio discrimination using unsupervised probabilistic models is a promising application of the Junipr framework.
Indeed, visualizing jets as in Fig. 13 can provide a number of insights. Unsurprisingly, we see for the quark jet (on the top) that the likelihood ratio of the first branching is rather extreme, at \(10^{3.7}\), since it is unlike the energybalanced first branching associated with boostedZ jets. However, we also see that almost all subsequent branchings are also unlike those expected in boostedZ jets, and they combine to provide comparable discrimination power to the first branching alone. Many effects probably contribute to this separation power at later branchings, including that quark jets often gain their mass throughout their evolution instead of solely at the first branching, and that the quark jet is colorconnected to other objects in the global event. Such effects have proven to be useful for discrimination in other contexts [69].
Similarly, considering the boostedZ jet on the bottom of Fig. 13 shows that significant discrimination power comes not only from the first branching, but also from subsequent splittings, as the boostedZ jet evolves as a colorsinglet \(q \bar{q}\) pair. Note the presence of the predictive secondary emissions sent from one quarksubject toward the other. This is reminiscent of the pull observable, which has proven useful for discrimination in other contexts [70]. More generally, the importance of the energy distribution, opening angles, multiplicity, and branching pattern in highperformance discrimination can be understood from such pictures.
4.2 Generation from JUNIPR
We now turn to a more familiar approach to jet physics, but a somewhat less appropriate usage of Junipr models: sampling new jets from the learned probability distribution to generate traditional observable distributions. We include this application here, not only to demonstrate this capability, but also to further validate the distribution learned by Junipr during unsupervised training.
Sampling from Junipr is relatively efficient; one simply samples from the low dimensional distributions at each step t and feeds those samples forward as input to subsequent steps. In this way, one generates a full jet in many steps, as detailed in Fig. 14.
However, there are two reasons why we do not consider Junipr to be built for generation. (These drawbacks could be avoided with a generative model; see [39, 40, 41].) The first is simply that sampling from probability distributions is generally difficult. As we just showed, it turns out that Junipr is relatively easy to sample from, due to its sequential structure and the fact that distributions are lowdimensional at each t step. Despite this, sampling jets from Junipr is still much slower than generation with, for example, Pythia.
The second reason is more fundamental. With a sequential model structured as Junipr is, probability distributions at late t steps in generation are highly sensitive to the draws made at earlier t steps. Very small defects in the probability distributions at early steps cause feedback in the model that amplifies those errors. Furthermore, as a partially generated jet becomes more misrepresentative of the training data, the resulting probability distributions used at later steps are less trained, which can result in a runaway effect. All of this is to say that, for the purpose of generating jets, Junipr ’s accuracy at early t steps is disproportionately important. This is in tension with the training method undertaken in Sect. 3.2, namely the maximization of the loglikelihood, which prioritizes all branchings equally. Thus, we should expect that some observable distributions generated by sampling jets from Junipr might agree worse with the validation set of Pythia data than otherwise expected. We mention in passing that this second drawback could be mitigated by reweighting jets after generation, as detailed in Sect. 4.3 below.
We consider this disagreement to be both expected and nondiminishing of Junipr ’s potential. Indeed, in the next section we will show how to overcome this issue, by generating samples consistent with Junipr ’s learned probabilistic model, without ever sampling from it. In particular, the disagreement in Fig. 17 will be rectified in Fig. 18.
4.3 Reweighting Monte Carlo events
Another application of the Junipr framework is to reweight events. For example, suppose we trained Junipr on data from the Large Hadron Collider (LHC) to yield a probabilistic model \(P_\text {LHC}\). Then one could generate a sample of new events using a relatively accurate Monte Carlo simulator, train another instance of Junipr on that sample to yield \(P_\text {sim}\), and finally reweight the simulated events by \(P_\text {LHC} / P_\text {sim}\) evaluated on an eventbyevent basis. This process yields a sample of events that is theoretically equivalent to the LHC data used in training \(P_\text {LHC}\). The advantage of such an approach is that Junipr can correct the simulated events on different levels, for example using the data reclustered in \(R_\text {sub}=0.1\) subjets as we have done in this paper. However, the full simulated event has the complete hadron distributions and can thereby be interfaced with a detector simulation. This is in many ways a simpler approach than trying to improve the simulation directly through the dark art of MonteCarlo tuning.
As with the likelihoodratio discrimination in Sect. 4.1, here we will show results in a toy scenario as a proofofprinciple. Ideally a model trained on LHC data, with all related complications, would be used to reweight Monte Carlo jets to make the simulated data indiscernible from LHC data; we leave a proper study of this to future work.
In Fig. 18 we demonstrate that this is indeed the case. We check this for both the 2subjettiness and 3subjettiness ratio observables, as well as the jet shape observable. On the left side of Fig. 18, one can see that in all cases, the \(\alpha _s=0.11\) distribution is clearly different from the \(\alpha _s=0.1365\) distribution. On the right side of Fig. 18, one finds that the two distributions come into relatively good agreement once the \(\alpha _s=0.11\) jets are reweighted by \(P_{\alpha _s = 0.1365} / P_{\alpha _s = 0.11}\). This also provides further confirmation that Junipr learns subtle correlations between constituent momenta inside jets.
Note that it was the 2subjettiness ratio observable that Junipr struggled to predict well through direct sampling (see Fig. 17), whereas when reweighting another set of samples, Junipr matches the data well on this observable (see topright of Fig. 18). This corroborates the discussion in Sect. 4.2 concerning the difficulties in sampling directly from Junipr.
Before closing this section, let us reiterate one point mentioned above. For the procedure of reweighting events to be practical, the weights used should not be radically different from unity, meaning that the two distributions generating the two samples should not be too different. If this condition is not satisfied, then away from the limit of infinite statistics, a few events with very large weights could vastly overpower the rest of the events, leading to a choppy reweighted distribution with large statistical uncertainties. To avoid this problem in the toy scenario explored in this section, we found it necessary to discard roughly \(0.1\%\) of the jets in the \(\alpha _s = 0.11\) sample which were outliers with \(P_{\alpha _s = 0.1365} / P_{\alpha _s = 0.11} > 100\). These outliers were uncorrelated with the observables shown, and we believe they resulted from imperfections in the trained model. It is clear that much more needs to be understood about the application of reweighting, but this would perhaps be more effectively done in the context of a specific task of interest involving LHC data.
5 Factorization and JUNIPR
In the previous section, we showed some preliminary but very exciting results for likelihoodratio discrimination and for the generation of observables by reweighting simulated jets. Both of these applications require access to an unsupervised probabilistic model. Next we discuss some of the more subtle internal workings of Junipr, which are intimately related to the underlying physics of factorization.
In particular, we show that the hidden representation \(h^{(t)}\) indeed stores important global information about intermediate states of jets in Sect. 5.1. We then discuss the clusteringalgorithm independence of Junipr by considering two distinct clustering algorithms: a “printer” algorithm in Sect. 5.2, where momenta are processed lefttoright and toptobottom as if by an inkjet printer; and the anti\(k_t\) algorithm in Sect. 5.3, which allows us to present another counterintuitive result, the anti\(k_t\) shower generator.
5.1 The encoding of global information
We have constructed Junipr so that all global information about the jet is contained in the RNN’s hidden state \(h^{(t)}\). Only the branching function \(P_\text {branch}\) receives the local \(1\rightarrow 2\) branching information in addition to \(h^{(t)}\). This forces \(h^{(t)}\) to contain all the information needed to predict when the shower should end, \(P_\text {end}\), to predict which momentum should branch next, \(P_\text {mother}\), and to inform the branching function \(P_\text {branch}\) of the relevant global structure. As the primary feature vector for all three of these distinct tasks, \(h^{(t)}\) must learn an effective representation of the jet at evolution step t.
While we do not show the corresponding results from our baseline (global) model in Fig. 19 to avoid clutter, the agreement with Pythia is essentially perfect, as one would expect from the similar check performed in Fig. 9. This confirms the success of the jet representation \(h^{(t)}\) in supplying the branching function \(P_\text {branch}\) with important information about the global structure.
5.2 Clustering algorithm independence
Another subtle aspect of Junipr is its theoretical clustering algorithm independence. In principle, the model as described in Sect. 2.1 is indeed independent of the chosen algorithm, which is fixed simply to avoid a sum over all possible trees consistent with the finalstate momenta. That is, for each clustering procedure chosen by the user, a different model is learned, but one that describes the same probability distribution over finalstate momenta, at least formally.
However, it is not guaranteed that a given neuralnetwork implementation of Junipr will work well for every clustering algorithm. We have chosen an architecture that stores the global jet physics in the RNN’s hidden state \(h^{(t)}\) and the local \(1 \rightarrow 2\) branching physics in the branching function \(P_\text {branch}\). This architecture is motivated by the factorizing structure of QCD, and thus Junipr will most easily learn jet trees that are most similar to QCD – our primary reason for predominantly using the C/A algorithm. Consequently, though the model described in Sect. 2.1 is formally independent of clustering algorithm, the particular implementation adopted in Sect. 2.2 may weakly depend on the chosen algorithm by virtue of the ease with which it can learn the data.
5.3 Anti\(k_t\) shower generator
Reassured by the results of the previous section, we next consider Junipr trained on Pythia jets reclustered with anti\(k_t\) [58]. Like the printer algorithm, anti\(k_t\) does not approximate the natural collinear structure of QCD. Unlike the printer algorithm, however, anti\(k_t\) is a very commonly used tool. For the latter reason we explore anti\(k_t\) jets here.
To see to what extent an anti\(k_t\) implementation of Junipr relies on the global information stored in \(h^{(t)}\), we trained two models on Pythiagenerated quark jets clustered with anti\(k_t\) (see Sect. 3.1 for more details on the training data used). One model, \(P_\text {anti}\), has the baseline architecture outlined in Sect. 2. The other, \(P_\text {antilocal}\), is a local branching model like the one used in Sect. 5.1, in which the global representation \(h^{(t)}\) is withheld as input to the branching function.
In Fig. 23 (bottom) we show a jet sampled from \(P_\text {anti}\). In this case, though the tree itself does not properly guide the collinear structure of emissions, one can see that the emission directions are highly correlated with one another, demonstrating the success of the jet representation \(h^{(t)}\) in tracking the global branching pattern. In Fig. 23 (top) we show for comparison a jet sampled from \(P_\text {antilocal}\), in which the branching function does not receive \(h^{(t)}\). In the latter case, all correlation between the emission directions is lost. This shows that the global representation \(h^{(t)}\) is crucial for a successful anti\(k_t\) branching model.
In Fig. 24 we show the 2dimensional distribution over jet mass and constituent multiplicity, as well as the 2subjettiness distribution, generated with \(P_\text {anti}\). One can see that the former distribution is consistent with the distribution generated by Pythia in Fig. 16. Mild disagreement between \(P_\text {anti}\)’s 2subjettiness distribution and Pythia ’s can be seen on the right side of Fig. 24. This is on par with the agreement obtained by sampling from the C/A model in Fig. 17.
In Sect. 5.1 we saw that the RNN’s hidden state \(h^{(t)}\) manages the global information in Junipr ’s neural network architecture. This is an efficient and natural way to characterize QCDlike jets, and therefore also C/A clustering trees. Though Junipr is formally independent of jet algorithm (i.e. in the infinitecapacity and perfecttraining limit), we might expect Junipr ’s performance to degrade somewhat when paired with clustering algorithms that require significantly more information to be stored in \(h^{(t)}\). This was explored in Sects. 5.2 and 5.3 using two separate nonQCDlike clustering algorithms, namely the “printer” and anti\(k_t\) algorithms. Despite these clustering algorithms being unnatural choices for Junipr, we were able to demonstrate conceptually interesting and novel results, such as the anti\(k_t\) shower generator. This further demonstrates that Junipr can continue to function well, even when the clustering algorithm chosen for implementation bears little resemblance to the underlying physical processes that generate the data.
6 Conclusions and outlook
In this paper, we have introduced Junipr as a framework for unsupervised machine learning in particle physics. The framework calls for a neural network architecture designed to efficiently describe the leadingorder physics of \(1\rightarrow 2\) splittings, alongside a representation of the global jet physics. This requires the momenta in a jet to be clustered into a binary tree. The choice of clustering algorithm is not essential to Junipr ’s performance, but choosing an algorithm that has some correspondence with an underlying physical model, such as the angularordered parton shower in quantum chromodynamics, gives improved performance and allows for intrerpretability of the network. At Junipr ’s core is a recurrent neural network with three interconnected components. It moves along the jet’s clustering tree, evaluating the likelihood of each branching. More generally, Junipr is a function that acts on a set of 4momenta in an event to compute their relative differential cross section, i.e. the probability density for this event to occur, given the event selection criteria used to select the training sample. One of the appealing features of Junipr is its interpretability: it provides a desconstruction of the probability density into contributions from each point in the clustering history.
There are many promising applications of Junipr, and we have only been able to touch on a few proofofconcept tests in this introductory work. One exciting use case is discrimination. In contrast to supervised models which directly learn to discriminate between two samples, Junipr learns the features of the samples separately. It then discriminates by comparing the likelihood of a given event with respect to alternative models of the underlying physics. The resulting likelihood ratio provides theoretically optimal statistical power. As an example, we showed that Junipr can discriminate between boosted Z bosons and quark jets (in a very tight mass window around \(m_Z\)) in \(e^+e^\) events when trained on the two samples separately. With Junipr, it is not only possible to perform powerful discrimination using unsupervised learning, but the discrimination power can be visualized over the entire clustering tree of each jet, as in Fig. 13. This opens new avenues for physicists to gain intuition about the physics underlying highperformance discrimination. Such studies might even inspire the construction of new calculable observables.
Another exciting potential application of Junipr is the reweighting of Monte Carlo events, in order to improve agreement with real collider data. A proofofconcept of this idea was given in Fig. 18, where jets generated with one Pythia tune were reweighted to match jets generated with another. The reason this application is important is that current Monte Carlo event generators do an excellent job of simulating events on average, but are limited by the models and parameters within them. It may be easier to correct for systematic bias in event generation by a small reweighting factor appropriate for a particular data sample, rather than by trying to isolate and improve faulty components of the model. In this context, Junipr can be thought of as providing small but highly granular tweaks to simulations in order to improve agreement with data.
The Junipr framework was used here to compute the likelihood that a given set of particle momenta will arise inside a jet. One can also imagine more general models that act on all the momenta in an entire event, including particle identification tags, or even lowlevel detector data. A particularly interesting direction would be to consider applying Junipr to heavy ion collisions, in which the medium where the jets are produced and evolve is not yet well understood. In this case, comparing the probabilities in data to those of simulation could give insights into how to improve the simulations, or more optimistically, to improve our understanding of the underlying physics.
Footnotes
Notes
Acknowledgements
We benefited from interesting discussions with D. Barber, E. Bernton, A. Botev, Y.T. Chien, K. Cranmer, R. Elayavalli, M. Freytsis, B. Gaujac, R. Habib, P. Komiske, E. Metodiev, B. Nachman, and J. Thaler. AA, CF, and MDS are supported in part by the Department of Energy under contract DESC0013607. Support for AA and CF was provided in part by the Harvard Data Science Initiative. All compute costs were covered by ASI Data Science through their machine learning platform SherlockML.
References
 1.A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (2012), pp. 1097–1105Google Scholar
 2.K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition (2016), pp. 770–778. arXiv:1512.03385
 3.G. Huang, Z. Liu, K.Q. Weinberger, Densely Connected Convolutional Networks (2017). arXiv:1608.06993
 4.D. Bahdanau, K. Cho, Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate (2014). arXiv:1409.0473
 5.Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, W. Macherey et al., Google’s Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation. arXiv:1609.08144
 6.A. Graves, N. Jaitly, Towards EndtoEnd Speech Recognition with Recurrent Neural Networks (2014)Google Scholar
 7.A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves et al., Wavenet: A Generative Model for Raw Audio (2016). arXiv:1609.03499
 8.D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by backpropagating errors. Nature 323, 533 (1986)Google Scholar
 9.T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, S. Khudanpur, Recurrent neural network based language model, in Eleventh Annual Conference of the International Speech Communication Association (2010)Google Scholar
 10.ATLAS Collaboration, G. Aad et al., A neural network clustering algorithm for the ATLAS silicon pixel detector. JINST 9, P09009 (2014). arXiv:1406.7690
 11.ATLAS Collaboration, G. Aad et al., Performance of \(b\)jet identification in the ATLAS experiment. JINST 11, P04008 (2016). arXiv:1512.01094
 12.CMS Collaboration, S. Chatrchyan et al., Performance of taulepton reconstruction and identification in CMS. JINST 7, P01001 (2012). arXiv:1109.6034
 13.K. Datta, A. Larkoski, How much information is in a jet? JHEP 06, 073 (2017). arXiv:1704.08249 ADSCrossRefGoogle Scholar
 14.K. Datta, A.J. Larkoski, Novel jet observables from machine learning. JHEP 03, 086 (2018). arXiv:1710.01305 ADSCrossRefGoogle Scholar
 15.H. Luo, M.X. Luo, K. Wang, T. Xu, G. Zhu, Quark jet versus gluon jet: deep neural networks with highlevel features. arXiv:1712.03634
 16.P.T. Komiske, E.M. Metodiev, J. Thaler, Energy flow polynomials: a complete linear basis for jet substructure. arXiv:1712.07124
 17.J. Gallicchio, J. Huth, M. Kagan, M.D. Schwartz, K. Black, B. Tweedie, Multivariate discrimination and the Higgs + W/Z search. JHEP 04, 069 (2011). arXiv:1010.3698 ADSCrossRefGoogle Scholar
 18.ATLAS Collaboration, Identification of hadronicallydecaying W bosons and top quarks using highlevel features as input to boosted decision trees and deep neural networks in ATLAS at \(\sqrt{s} = 13~ TeV\), in Technical Report, ATLPHYSPUB2017004. CERN, Geneva (2017)Google Scholar
 19.J. Cogan, M. Kagan, E. Strauss, A. Schwarztman, JetImages: computer vision inspired techniques for jet tagging. JHEP 02, 118 (2015). arXiv:1407.5675 ADSCrossRefGoogle Scholar
 20.L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, A. Schwartzman, Jetimages deep learning edition. JHEP 07, 069 (2016). arXiv:1511.05190
 21.P.T. Komiske, E.M. Metodiev, M.D. Schwartz, Deep learning in color: towards automated quark/gluon jet discrimination. JHEP 01, 110 (2017). arXiv:1612.01551 ADSCrossRefGoogle Scholar
 22.P.T. Komiske, E.M. Metodiev, B. Nachman, M.D. Schwartz, Pileup mitigation with machine learning (PUMML). JHEP 12, 051 (2017). arXiv:1707.08600 ADSCrossRefGoogle Scholar
 23.G. Kasieczka, T. Plehn, M. Russell, T. Schell, Deeplearning top taggers or the end of QCD? JHEP 05, 006 (2017). arXiv:1701.08784 ADSCrossRefGoogle Scholar
 24.W. Bhimji, S.A. Farrell, T. Kurth, M. Paganini, Prabhat, E. Racah, Deep neural networks for physics analysis on lowlevel wholedetector data at the LHC, in 18th International Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT 2017) Seattle, WA, USA, August 21–25, 2017 (2017). arXiv:1711.03573
 25.ATLAS Collaboration , Quark versus gluon jet tagging using jet images with the ATLAS detector, in Technical Report ATLPHYSPUB2017017. CERN, Geneva (2017)Google Scholar
 26.S. Macaluso, D. Shih, Pulling Out All the Tops with Computer Vision and Deep Learning. arXiv:1803.00107
 27.Y.T. Chien, R. Kunnawalkam Elayavalli, Probing heavy ion collisions using quark and gluon jet substructure. arXiv:1803.03589
 28.J. Pearkes, W. Fedorko, A. Lister, C. Gay, Jet Constituents for Deep Neural Network Based Top Quark Tagging. arXiv:1704.02124
 29.G. Louppe, K. Cho, C. Becot, K. Cranmer, QCDAware Recursive Neural Networks for Jet Physics. arXiv:1702.00748
 30.T. Cheng, Recursive Neural Networks in Quark/Gluon Tagging. arXiv:1711.02633
 31.S. Egan, W. Fedorko, A. Lister, J. Pearkes, C. Gay, Long ShortTerm Memory (LSTM) Networks with Jet Constituents for Boosted Top Tagging at the LHC. arXiv:1711.09059
 32.K. Fraser, M.D. Schwartz, Jet Charge and Machine Learning. arXiv:1803.08066
 33.D. Guest, J. Collado, P. Baldi, S.C. Hsu, G. Urban, D. Whiteson, Jet flavor classification in highenergy physics with deep neural networks. Phys. Rev. D 94, 112002 (2016). arXiv:1607.08633 ADSCrossRefGoogle Scholar
 34.ATLAS Collaboration, Identification of jets containing \(b\)hadrons with recurrent neural networks at the ATLAS experiment, in Technical Report ATLPHYSPUB2017003. CERN, Geneva (2017)Google Scholar
 35.E.M. Metodiev, B. Nachman, J. Thaler, Classification without labels: learning from mixed samples in high energy physics. JHEP 10, 174 (2017). arXiv:1708.02949 ADSCrossRefGoogle Scholar
 36.T. Cohen, M. Freytsis, B. Ostdiek, (Machine) learning to do more with less. JHEP 02, 034 (2018). arXiv:1706.09451 ADSCrossRefGoogle Scholar
 37.P.T. Komiske, E.M. Metodiev, B. Nachman, M.D. Schwartz, Learning to Classify from Impure Samples. arXiv:1801.10158
 38.E.M. Metodiev, J. Thaler, On the Topic of Jets. arXiv:1802.00008
 39.L. de Oliveira, M. Paganini, B. Nachman, Learning particle physics by example: locationaware generative adversarial networks for physics synthesis. Comput. Softw. Big Sci. 1, 4 (2017). arXiv:1701.05927 CrossRefGoogle Scholar
 40.M. Paganini, L. de Oliveira, B. Nachman, Accelerating science with generative adversarial networks: an application to 3D particle showers in multilayer calorimeters. Phys. Rev. Lett. 120, 042003 (2018). arXiv:1705.02355 ADSCrossRefGoogle Scholar
 41.M. Paganini, L. de Oliveira, B. Nachman, CaloGAN : simulating 3D high energy particle showers in multilayer electromagnetic calorimeters with generative adversarial networks. Phys. Rev. D 97, 014021 (2018). arXiv:1712.10321 ADSCrossRefGoogle Scholar
 42.J. Neyman, E.S. Pearson, Ix. on the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. Lond. A Math. Phys. Eng. Sci. 231, 289–337 (1933). http://rsta.royalsocietypublishing.org/content/231/694706/289.full.pdf
 43.A. Butter, G. Kasieczka, T. Plehn, M. Russell, Deeplearned top tagging with a Lorentz layer. arXiv:1707.08966
 44.S. Coleman, R. Norton, Singularities in the physical region. Nuovo Cim. 38, 438–442 (1965)CrossRefGoogle Scholar
 45.J.C. Collins, D.E. Soper, G.F. Sterman, Factorization for short distance hadron–hadron scattering. Nucl. Phys. B 261, 104 (1985)ADSCrossRefGoogle Scholar
 46.J.C. Collins, D.E. Soper, G.F. Sterman, Soft gluons and factorization. Nucl. Phys. B 308, 833 (1988)ADSCrossRefGoogle Scholar
 47.I. Feige, M.D. Schwartz, Hardsoftcollinear factorization to all orders. Phys. Rev. D 90, 105020 (2014)ADSCrossRefGoogle Scholar
 48.I. Feige, M.D. Schwartz, An onshell approach to factorization. Phys. Rev. D 88, 065021 (2013)ADSCrossRefGoogle Scholar
 49.S. Catani, Y.L. Dokshitzer, M.H. Seymour, B.R. Webber, Longitudinally invariant \(K_t\) clustering algorithms for hadron hadron collisions. Nucl. Phys. B 406, 187–224 (1993)ADSCrossRefGoogle Scholar
 50.S.D. Ellis, D.E. Soper, Successive combination jet algorithm for hadron collisions. Phys. Rev. D 48, 3160–3166 (1993). arXiv:hepph/9305266 ADSCrossRefGoogle Scholar
 51.Y.L. Dokshitzer, G.D. Leder, S. Moretti, B.R. Webber, Better jet clustering algorithms. JHEP 08, 001 (1997). arXiv:hepph/9707323 ADSCrossRefGoogle Scholar
 52.M. Wobisch, T. Wengler, Hadronization corrections to jet crosssections in deep inelastic scattering, in Monte Carlo Generators for HERA Physics. Proceedings, Workshop, Hamburg, Germany 1998–1999, pp. 270–279 (1998). arXiv:hepph/9907280
 53.S.D. Ellis, A. Hornig, T.S. Roy, D. Krohn, M.D. Schwartz, Qjets: a nondeterministic approach to treebased jet substructure. Phys. Rev. Lett. 108, 182003 (2012). arXiv:1201.1914 ADSCrossRefGoogle Scholar
 54.D. Kahawala, D. Krohn, M.D. Schwartz, Jet sampling: improving event reconstruction through multiple interpretations. JHEP 06, 006 (2013). arXiv:1304.2394 ADSCrossRefGoogle Scholar
 55.L. Mackey, B. Nachman, A. Schwartzman, C. Stansbury, Fuzzy jets. JHEP 06, 010 (2016). arXiv:1509.02216 ADSCrossRefGoogle Scholar
 56.D.E. Soper, M. Spannowsky, Finding physics signals with shower deconstruction. Phys. Rev. D 84, 074002 (2011). arXiv:1102.3480 ADSCrossRefGoogle Scholar
 57.D.E. Soper, M. Spannowsky, Finding physics signals with event deconstruction. Phys. Rev. D 89, 094005 (2014). arXiv:1402.1189 ADSCrossRefGoogle Scholar
 58.M. Cacciari, G.P. Salam, G. Soyez, The anti\(k(t)\) jet clustering algorithm. JHEP 04, 063 (2008). arXiv:0802.1189 ADSCrossRefGoogle Scholar
 59.M. Cacciari, G.P. Salam, G. Soyez, FastJet user manual. Eur. Phys. J. C 72, 1896 (2012). arXiv:1111.6097 ADSCrossRefGoogle Scholar
 60.I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, New York, 2016)Google Scholar
 61.K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk et al., Learning phrase representations using RNN encoder–decoder for statistical machine translation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/D141179
 62.S. Hochreiter, J. Schmidhuber, Long shortterm memory. Neural Comput. 9, 1735–1780 (1997)Google Scholar
 63.T. Sjostrand, S. Mrenna, P.Z. Skands, PYTHIA 6.4 physics and manual. JHEP 05, 026 (2006). arXiv:hepph/0603175
 64.T. Sjöstrand, S. Ask, J.R. Christiansen, R. Corke, N. Desai, P. Ilten et al., An introduction to PYTHIA 8.2. Comput. Phys. Commun. 191, 159–177 (2015). arXiv:1410.3012
 65.Theano Development Team Collaboration, R. AlRfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas et al., Theano: a Python framework for fast computation of mathematical expressions. arXiv eprints: arXiv:1605.02688 (2016)
 66.F. Morin, Y. Bengio, Hierarchical probabilistic neural network language model, in AISTATS’05, pp. 246–252 (2005)Google Scholar
 67.A. Mnih, G.E. Hinton, A scalable hierarchical distributed language model, in Advances in Neural Information Processing Systems, vol. 21, ed. by D. Koller, D. Schuurmans, Y. Bengio, L. Bottou (Curran Associates Inc., 2009), pp. 1081–1088Google Scholar
 68.J. Thaler, K. Van Tilburg, Identifying boosted objects with \(N\)subjettiness. JHEP 03, 015 (2011). arXiv:1011.2268 ADSCrossRefGoogle Scholar
 69.Y.T. Chien, A. Emerman, S.C. Hsu, S. Meehan, Z. Montague, Telescoping Jet Substructure. arXiv:1711.11041
 70.J. Gallicchio, M.D. Schwartz, Seeing in color: jet superstructure. Phys. Rev. Lett. 105, 022001 (2010). arXiv:1001.5027 ADSCrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Funded by SCOAP^{3}