Parameterized neural networks for highenergy physics
 1.9k Downloads
 11 Citations
Abstract
We investigate a new structure for machine learning classifiers built with neural networks and applied to problems in highenergy physics by expanding the inputs to include not only measured features but also physics parameters. The physics parameters represent a smoothly varying learning task, and the resulting parameterized classifier can smoothly interpolate between them and replace sets of classifiers trained at individual values. This simplifies the training process and gives improved performance at intermediate values, even for complex problems requiring deep learning. Applications include tools parameterized in terms of theoretical model parameters, such as the mass of a particle, which allow for a single network to provide improved discrimination across a range of masses. This concept is simple to implement and allows for optimized interpolatable results.
Keywords
Parameterized Network Nuisance Parameter Deep Neural Network Single Network Stochastic Gradient Descent1 Introduction
Neural networks have been applied to a wide variety of problems in highenergy physics [1, 2], from event classification [3, 4] to object reconstruction [5, 6] and triggering [7, 8]. Typically, however, these networks are applied to solve a specific isolated problem, even when this problem is part of a set of closely related problems. An illustrative example is the signalbackground classification problem for a particle with a range of possible masses. The classification tasks at different masses are related, but distinct. Current approaches require the training of a set of isolated networks [9, 10], each of which are ignorant of the larger context and lack the ability to smoothly interpolate, or the use of a single signal sample in training [11, 12], sacrificing performance at other values.
In this paper, we describe the application of the ideas in Ref. [13] to a new neural network strategy, a parameterized neural network in which a single network tackles the full set of related tasks. This is done by simply extending the list of input features to include not only the traditional set of eventlevel features but also one or more parameters that describe the larger scope of the problem such as a new particle’s mass. The approach can be applied to any classification algorithm; however, neural networks provide a smooth interpolation, while treebased methods may not.
A single parameterized network can replace a set of individual networks trained for specific cases, as well as smoothly interpolate to cases where it has not been trained. In the case of a search for a hypothetical new particle, this greatly simplifies the task – by requiring only one network – as well as making the results more powerful – by allowing them to be interpolated between specific values. In addition, they may outperform isolated networks by generalizing from the full parameterdependent dataset.
In the following, we describe the network structure needed to apply a single parameterized network to a set of smoothly related problems and demonstrate the application for theoretical model parameters (such as new particle masses) in a set of examples of increasing complexity.
2 Network structure and training
A typical network takes as input a vector of features, \(\bar{x}\), where the features are based on eventlevel quantities. After training, the resulting network is then a function of these features, \(f(\bar{x})\). In the case that the task at hand is part of a larger context, described by one or more parameters, \(\bar{\theta }\). It is straightforward to construct a network that uses both sets of inputs, \(\bar{x}\) and \(\bar{\theta }\), and operates as a function of both: \(f(\bar{x},\bar{\theta })\). For a given set of inputs \(\bar{x}_0\), a traditional network evaluates to a real number \(f(\bar{x}_0)\). A parameterized network, however, provides a result that is parameterized in terms of \(\bar{\theta }\): \(f(\bar{x}_0,\bar{\theta })\), yielding different output values for different choices of the parameters \(\bar{\theta }\); see Fig. 1.
3 Toy example
As a demonstration for a simple toy problem, we construct a parameterized network which has a single input feature x and a single parameter \(\theta \). The network, with one hidden layer of three nodes and sigmoid activation functions, is trained using labeled examples where examples with label 0 are drawn from a uniform background and examples with label 1 are drawn from a Gaussian with mean \(\theta \) and width \(\sigma =0.25\). Training samples are generated with \(\theta =2,1,0,1,2\); see Fig. 2a.
4 1D physical example
We first explore the performance in a onedimensional case. The single eventlevel feature of the network is \(m_{WWbb}\), the reconstructed resonance mass, calculated using techniques described in Ref. [14]. Specifically, we assume resolved top quarks in each case, for simplicity. Events are simulated at the parton level with madgraph5 [15], using pythia [16] for showering and hadronization and delphes [17] with the ATLASstyle configuration for detector simulation. Figure 4a shows the distribution of reconstructed masses for the background process as well as several values of \(m_X\), the mass of the hypothetical X particle. Clearly the nature of the discrimination problem is distinct at each mass, though similar across masses.

Train a single neural network at one intermediate value of the mass and use it for all other mass values as was done in Refs. [11, 12]. This approach gives the best performance at the mass used in the training sample, but performance degrades at other masses.

Train a single neural network using an unlabeled mixture of signal samples and use it for all other mass values. This approach may reduce the loss in performance away from the single mass value used in the previous approach, but it also degrades the performance near that mass point, as the signal is smeared.

Train a set of neural networks for a set of mass values as done in Refs. [9, 10]. This approach gives the best signalbackground classification performance at each of the trained mass values. However, performance degrades for mass values away from the ones used in training. Most importantly, this approach leads to discontinuities in selection efficiencies across masses, and interpolation of the observed limits is not possible, as the degradation of the performance away from the training points is not defined.
We note that Ref. [18] previously applied a similar idea with the same goal of improving the interpolation among model parameters. However, in that study the application of BDTs led to a marked decrease in sensitivity at each point compared to isolated algorithms at specific values, and no demonstration was made of the ability to interpolate complex problems in highdimensional spaces.
Our parameterized neural networks are implemented using the multilayer perceptron in PyLearn2 [19], with outputs treated with a regressor method and logistic activation function. Input and output data are subject to preprocessing via a scikitlearn [20] pipeline (i.e. transformation to inputs/outputs with a minimum and maximum of zero and one, respectively). Each neural network is trained with 1 hidden layer of three nodes and using Nesterov’s method for stochastic gradient descent [21]. Learning rates were initiated at 0.01, learning momentum was set to 0.9, and minibatch size is set to treat each point individually (i.e. minibatch size of 1). The training samples have approximately 100k examples per mass point.
The critical test is the signalbackground classification performance. To measure the ability of the network to perform well at interpolated values of the parameter – values at which it has seen no training data – we compare the performance of a single fixed network trained at a specific value of \(m_{X}^0\) to a parameterized network trained at the other available values other than \(m_{X}^0\). For example, Fig. 4 compares a single network trained at \(m_{X}^0=750\) GeV to a parameterized network trained with data at \(m_{X}=500,1000,1250,1500\) GeV. The parameterized network’s input parameter is set to the true value of the mass \(m_X^0\), and it is applied to data generated at that mass; recall that it saw no examples at this value of \(m_X^0\) in training. Its performance matches or nearly matches that of the single network trained at that value, validating the ability of the single parameterized network to interpolate between mass values without any appreciable loss of statistical performance. Clearly, however, such arguments cannot be applied to extrapolation beyond the boundaries of the training examples. Moreover, we recommend similar holdout tests be performed to check the quality of the parameterized network on a casebycase basis.
5 Highdimensional physical example
The preceding examples serve to demonstrate the concept in onedimensional cases where the variation of the output on both the parameters and features can be easily visualized. In this section, we demonstrate that the parameterization of the problem and the interpolation power that it provides can be achieved also in highdimensional cases.

the leading lepton momenta,

the momenta of the four leading jets,

the btagging information for each jet

the missing transverse momentum magnitude and angle

the number of jets

the mass (\(m_{\ell \nu }\)) of the \(W\rightarrow \ell \nu \),

the mass (\(m_{jj}\)) of the \(W\rightarrow qq'\),

the mass (\(m_{jjj}\)) of the \(t\rightarrow Wb\rightarrow bqq'\),

the mass (\(m_{j\ell \nu }\)) of the \(t\rightarrow Wb\rightarrow \ell \nu b\),

the mass (\(m_{WWbb}\)) of the hypothetical \(X\rightarrow t\bar{t}\),
The parameterized deep neural network models were trained on GPUs using the Blocks framework [25, 26, 27]. Seven million examples were used for training and one million were used for testing, with 50 % background and 50 % signal. The architectures contain five hidden layers of 500 hidden rectified linear units with a logistic output unit. Parameters were initialized from a Gaussian distribution with mean zero and width 0.1, and updated using stochastic gradient descent with minibatches of size 100 and 0.5 momentum. The learning rate was initialized to 0.1 and decayed by a factor of 0.89 every epoch. Training was stopped after 200 epochs.
Conversely, Fig. 8 compares the performance of the parameterized network to a single network trained at \(m_X=1000\) GeV when applied across the mass range of interest, which is a common application case. This demonstrates the loss of performance incurred by some traditional approaches and recovered in this approach. Similarly, we see that a single network trained an unlabeled mixture of signal samples from all masses has reduced performance at each mass value tested.
In previous work, we have shown that deep networks such as these do not require the addition of highlevel features [28, 29] but are capable of learning the necessary functions directly from the lowlevel fourvectors. Here we extend that by repeating the study above without the use of the highlevel features; see Fig. 7. Using only the lowlevel features, the parameterized deep network achieves essentially indistinguishable performance for this particular problem and training sets of this size.
6 Discussion
We have presented a novel structure for neural networks that allows for a simplified and more powerful solution to a common use case in highenergy physics and demonstrated improved performance in a set of examples with increasing dimensionality for the input feature space. While these example use a single parameter \(\theta \), the technique is easily applied to higher dimensional parameter spaces.
Parameterized networks can also provide optimized performance as a function of nuisance parameters that describe systematic uncertainties, where typical networks are optimal only for a single specific value used during training. This allows statistical procedures that make use of profile likelihood ratio tests [30] to select the network corresponding to the profiled values of the nuisance parameters [13].
Datasets used in this paper containing millions of simulated collisions can be found in the UCI Machine Learning Repository [31] at http://archive.ics.uci.edu/ml/datasets/HEPMASS.
Notes
Acknowledgments
We thank Tobias Golling, Daniel Guest, Kevin Lannon, Juan Rojo, Gilles Louppe, and Chase Shimmin for useful discussions. KC is supported by the US National Science Foundation Grants PHY0955626, PHY1205376, and ACI1450310. KC is grateful to UCIrvine for their hospitality while this research was initiated and the Moore and Sloan foundations for their generous support of the data science environment at NYU. We thank Yuzo Kanomata for computing support. We also wish to acknowledge a hardware grant from NVIDIA, NSF Grant IIS1550705, and a Google Faculty Research award to PB.
References
 1.B.H. Denby, Neural networks and cellular automata in experimental highenergy physics. Comput. Phys. Commun. 49, 429–448 (1988)ADSCrossRefGoogle Scholar
 2.C. Peterson, T. Rognvaldsson, L. Lonnblad, JETNET 3.0: a versatile artificial neural network package. Comput. Phys. Commun. 81, 185–220 (1994)ADSCrossRefGoogle Scholar
 3.P. Abreu et al., Classification of the hadronic decays of the Z0 into b and c quark pairs using a neural network. Phys. Lett. B 295, 383–395 (1992)ADSCrossRefGoogle Scholar
 4.H. Kolanoski, Application of artificial neural networks in particle physics. Nucl. Instrum. Meth. A 367, 14–20 (1995)ADSCrossRefGoogle Scholar
 5.C. Peterson, Track finding with neural networks. Nucl. Instrum. Meth. A 279, 537 (1989)ADSCrossRefGoogle Scholar
 6.G. Aad et al., A neural network clustering algorithm for the ATLAS silicon pixel detector. JINST 9, P09009 (2014)CrossRefGoogle Scholar
 7.L. Lonnblad, C. Peterson, T. Rognvaldsson, Finding gluon jets with a neural trigger. Phys. Rev. Lett. 65, 1321–1324 (1990)ADSCrossRefGoogle Scholar
 8.H. Bruce, M. Denby, F.B. Campbell, N. Chriss, C. Bowers, F. Nesti, Neural networks for triggering. IEEE Trans. Nucl. Sci. 37, 248–254 (1990)Google Scholar
 9.T. Aaltonen et al., Evidence for a particle produced in association with weak bosons and decaying to a bottom–antibottom quark pair in Higgs boson searches at the Tevatron. Phys. Rev. Lett. 109, 071804 (2012)ADSCrossRefGoogle Scholar
 10.S. Chatrchyan et al., Combined results of searches for the standard model Higgs boson in \(pp\) collisions at \(\sqrt{s}=7\) TeV. Phys. Lett. B 710, 26–48 (2012)ADSCrossRefGoogle Scholar
 11.G. Aad et al., Search for \(W^{\prime } \rightarrow t\bar{b}\) in the lepton plus jets final state in proton–proton collisions at a centreofmass energy of \(\sqrt{s}\) = 8 TeV with the ATLAS detector. Phys. Lett. B 743, 235–255 (2015)ADSCrossRefGoogle Scholar
 12.S. Chatrchyan et al., Search for \(Z\) ’ resonances decaying to \(t\bar{t}\) in dilepton+jets final states in \(pp\) collisions at \(\sqrt{s}=7\) TeV. Phys. Rev. D 87(7), 072002 (2013)ADSCrossRefGoogle Scholar
 13.K. Cranmer, J. Pavez, G. Louppe. Approximating likelihood ratios with calibrated discriminative classifiers (2015). arXiv:1506.02169
 14.G. Aad et al., Search for a multiHiggsboson cascade in \(W^+W^ b\bar{b}\) events with the ATLAS detector in pp collisions at \(\sqrt{s} = 8\)TeV. Phys. Rev. D 89(3), 032002 (2014)ADSCrossRefGoogle Scholar
 15.J. Alwall, M. Herquet, F. Maltoni, O. Mattelaer, Tim Stelzer, MadGraph 5: going beyond. JHEP 1106, 128 (2011)ADSCrossRefMATHGoogle Scholar
 16.S. Torbjörn, M. Stephen, S. Peter. PYTHIA 6.4 physics and manual. JHEP 0605, 026 (2006)Google Scholar
 17.J. de Favereau et al., DELPHES 3, a modular framework for fast simulation of a generic collider experiment. JHEP 1402, 057 (2014)CrossRefGoogle Scholar
 18.V.M. Abazov et al., Search for the Higgs boson in lepton, tau and jets final states. Phys. Rev. D 88(5), 052005 (2013)ADSCrossRefGoogle Scholar
 19.I.J. Goodfellow, D. WardeFarley, P. Lamblin, V. Dumoulin, M. Mirza, R. Pascanu, J. Bergstra, F. Bastien, Y. Bengio. Pylearn2: a machine learning research library. arXiv:1308.4214, 2013
 20.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikitlearn: machine learning in python. J Mach Learn Res 12, 2825–2830 (2011)MathSciNetMATHGoogle Scholar
 21.Y. Nesterov et al., Gradient methods for minimizing composite objective function. Technical report, UCL (2007)Google Scholar
 22.A.L. Read, Linear interpolation of histograms. Nucl. Instrum. Meth. A 425, 357–360 (1999)ADSCrossRefGoogle Scholar
 23.K. Cranmer, G. Lewis, L. Moneta, A. Shibata, W. Verkerke. HistFactory: a tool for creating statistical models for use with RooFit and RooStats (2012)Google Scholar
 24.Max Baak, Stefan Gadatsch, Robert Harrington, Wouter Verkerke, Interpolation between multidimensional histograms using a new nonlinear moment morphing method. Nucl. Instrum. Meth. A 771, 39–48 (2015)ADSCrossRefGoogle Scholar
 25.B. van Merrinboer, D. Bahdanau, V. Dumoulin, D. Serdyuk, D. WardeFarley, J. Chorowski, Y. Bengio. Blocks and fuel: frameworks for deep learning (2015). arXiv:1506.00619
 26.F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I.J. Goodfellow, A. Bergeron, N. Bouchard, Y. Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop (2012)Google Scholar
 27.J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. WardeFarley, Y. Bengio. Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation, Austin, TX (2010)Google Scholar
 28.P. Baldi, P. Sadowski, D. Whiteson, Searching for exotic particles in highenergy physics with deep learning. Nature Commun. 5, 4308 (2014)Google Scholar
 29.P. Baldi, P. Sadowski, D. Whiteson, Enhanced Higgs boson to \(\tau ^+\tau ^\) search with deep learning. Phys. Rev. Lett. 114(11), 111801 (2015)ADSCrossRefGoogle Scholar
 30.Glen Cowan, Kyle Cranmer, Eilam Gross, Ofer Vitells, Asymptotic formulae for likelihoodbased tests of new physics. Eur. Phys. J. C 71, 1554 (2011)ADSCrossRefGoogle Scholar
 31.P. Baldi, K. Cranmer, T. Faucett, P. Sadowski, D. Whiteson. UCI machine learning repository (2015). http://archive.ics.uci.edu/ml/datasets/HEPMASS
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Funded by SCOAP^{3}