1 Introduction

For every experimental analysis at the CERN Large Hadron Collider (LHC), selection criteria, widely referred to as cuts, are necessary for the reduction of the data-recording rate to a technically feasible level and the discrimination between interesting and irrelevant events for a specific physics question. At the experimental level, it is important to distinguish between two classes of cuts: those imposed at the trigger level, and those imposed offline. Events failing the former are not recorded at all and the information is lost, whereas events failing the latter are merely not considered for the final analysis. This distinction is less important for the reinterpretation of an analysis based on any sample of events other than real observed data, notably events generated by Monte Carlo simulations of collisions to be observed assuming a given (new) physics model. In this case, both types of cuts simply amount to conditions on whether a given generated event is considered in the analysis or not. However, the reinterpretation of an analysis in general requires ex novo implementation of the full set of cuts.

Several frameworks [13] have recently been released with this aim, based on simulated collisions including an approximate modeling of the detector response. Although the description of the detector is highly simplified when compared to the full ATLAS or CMS software, public fast detector simulations, most notably the Delphes program [4], have been found to provide reasonably accurate results. Detector effects could instead be mimicked by scale factors derived from unfolded data as done in Rivet [5]. While this is an excellent approach for Standard Model measurements, detector unfolding is however not yet practicable for beyond the Standard Model searches in general. Another alternative way, which does not rely on event simulation, uses results published by the experimental collaborations in the context of so-called Simplified Models Spectra, as in the works of Refs. [6, 7]. While much faster, this is, however, less general.

In the present work, we focus on the expert mode of the MadAnalysis  5 program [1, 8] dedicated to the implementation of any analysis based on a cut-and-count flow (in contrast to analyses relying on multivariate techniques) and the investigation of the associated effects on any Monte Carlo event sample. The implementation of an analysis is facilitated by the large number of predefined functions and methods included in the SampleAnalyzer library shipped with the package, but is however often complicated in cases where one has several sub-analyses which we refer to as regions (with a nod to the terms signal and control regions commonly used in searches for physics beyond the Standard Model). The complication arose from the internal format handled by SampleAnalyzer, which assumed the existence of a single region. While this assumption is convenient for prospective studies, i.e., the design of new analyses, it is rarely fulfilled by existing analyses that one may want to recast. In order to allow the user to both design and recast analyses, we have consequently extended the SampleAnalyzer internal format to support analyses with multiple regions defined by different sets of cuts. We have also expanded the code with extra methods and routines to facilitate the implementation of more complex analyses by the user.

In the context of analyses which effectively contain sub-analyses, a further useful classification of cuts can be made: namely into those which are common/shared by different regions, and those which are not, the latter serving to define the different sub-analyses themselves. Figure 1 schematically illustrates an analysis containing four regions, which are defined by two region-specific cuts imposed after two common cuts. Some thought is required concerning the best way to capture in an algorithm the set of selection requirements shown in Fig. 1. For the common cuts (cuts 1 and 2 on the figure) this is clear: if the selection condition is failed, the event is vetoed (i.e., we ignore it and move on to analyzing the next event). Thereafter we have two conditions to check (cuts 3 and 4), but they apply to different regions. In terms of pseudo-code the most obvious, although not the most efficient, method for implementing these third and fourth cuts is

figure a
Fig. 1
figure 1

Schematic illustration of the definition of different regions in which a given event can be counted (or not), based on different combinations of selection cuts

One important drawback of this naive approach is the duplication of the check of the fourth condition. In the simple style of implementation of the cuts above, this is unavoidable: condition 4 must be checked both inside and outside the scope of condition 3. With the two region-specific cuts that we have here, there is only one such clumsy duplication present in the code. However as the number of such cuts grows, the situation rapidly gets worse. For instance, considering growing the decision tree shown in Fig. 1 to include \(N\) region-specific cuts, combined in all possible permutations to define \(2^N\) regions would deepen the nesting of the above pseudo-code and lead to \(2^N -(N+1)\) unnecessary duplications of checks. Moreover, each of those needs to be carefully implemented by the user in the correct scope, a task becoming less and less straightforward for large values of \(N\).

Ideally the algorithm should be structured so that there is no unnecessary duplication, which is one of the new features of the latest version of SampleAnalyzer, the C++ core of the MadAnalysis  5 program. Both can be obtained from the MadAnalysis  5 website, https://launchpad.net/madanalysis5 and all the features described in this paper are available from version 1.1.10 of the code onwards. This document supersedes the previous version of the manual for the expert mode of the program [1].

The remainder of this paper is organized as follows. In Sect. 2, we recall the basic functionalities of MadAnalysis  5 for the implementation of physics analyses in the expert mode of the program, which has been extended according to the needs of our users. Moreover, we introduce the new features of the SampleAnalyzer kernel. Two concrete examples are then provided in Sect. 3: the first is the reimplementation of a CMS search for supersymmetric partners of the top quark in events with a single lepton and missing energy [9], and the second the design of a monotop analysis where the monotop system decays in the hadronic mode [10]. Our work is summarized in Sect. 4.

2 The expert mode of MadAnalysis  5

In this section we present the manner in which physics analyses are implemented in the expert mode of MadAnalysis  5 . We begin with the creation of an analysis template (Sect. 2.1), followed by the addition of several analyses to the same template (Sect. 2.2). We then discuss the methods and classes allowing effective implementation of an analysis (Sect. 2.3), its compilation and execution (Sect. 2.4) and finally the structure of the output files (Sect. 2.5).

2.1 Creation of an analysis template

In the expert mode of the program, the user is asked to write his/her analysis in C++, using all the classes and methods of the SampleAnalyzer library. To begin implementing a new analysis, the user is recommended to use the Python interpreter of MadAnalysis  5 to create a working directory. This is achieved by starting MadAnalysis  5 with the command

figure b

where the value of \(\mathtt{{<}}\) mode \(\mathtt{{>}}\) refers to an analysis of events generated at the parton level (-P), hadron level (-H) or reconstructed level (-R). It is then enough to follow the instructions displayed on the screen—the user is asked for the names of the working directory and of his/her analysis, which we denote by name in the rest of Sect. 2. The directory that has been created contains three subdirectories: the Input, Output and Build directories.

Use of the Input directory is optional. It has been included in the analysis template in order to have a unique structure for both the normal and expert modes of MadAnalysis  5 . In the normal mode, its purpose is to collect text files with the lists of paths to the event samples to analyze. The Output directory has been conceived to store the results of each execution of the analysis. The Build directory includes a series of analysis-independent files organized into several sub-directories, together with files to be modified by the user.

At the root of the Build directory, one finds one bash script together with its tcsh counterpart. These scripts set appropriately the environment variables necessary for the compilation and execution of an analysis within the MadAnalysis  5 framework. They are initiated by typing in a (bash or tcsh) shell the respective commands

figure c

A Makefile is also available so that the standard commands

figure d

can be used to (re)compile the analysis (see Sect. 2.4). The final executable is obtained from two pieces—a library and the main program. The library originates from the merging of the SampleAnalyzer library and the analysis of the user, and is stored in the subdirectory Build/Lib. The main program is located in the Build/Main subdirectory and has a simple structure. It first initializes the analysis, then runs the analysis over all events (possibly collected into several files) and eventually produces the results in the Output directory previously mentioned.

The Build directory contains moreover the SampleAnalyzer subdirectory that stores the source and header files associated with the analysis being implemented (Analyzer/name.cpp and Analyzer/name.h), together with a Python script, newAnalyzer.py, dedicated to the implementation of several analyses into a single working directory. The Analyzer subdirectory additionally includes a list with all analyses implemented in the current working directory (analysisList.h). More information about those files is provided in the next subsections.

2.2 Merging several analyses in a single working directory

In Sect. 2.1, we have explained how to create a working directory containing a single (empty) analysis that is called, in our example, name. The analysis itself is implemented by the user in a pair of files name.cpp and name.h, which should be consistently referred to in the file analysisList.h. In addition, the main program (the file Build/Main/main.cpp) takes care of initializing and executing the analysis. The structure of this analysis provides guidelines for the implementation of any other analysis—newname for the sake of the example—in the same working directory. This new analysis has to be written in the two files newname.cpp and newname.h (stored in the Build/SampleAnalyzer/Analyzer directory) and referred to in the analysisList.h file. The main program also needs to be modified in order to initialize and execute the new analysis, in addition to the first analysis (name).

All these tasks have been automated (with the exception of the implementation of the analysis itself) so that the user is only required to run the python script newAnalyzer.py by typing in a shell the command

figure e

from the Build/SampleAnalyzer directory.

2.3 Implementing an analysis in the MadAnalysis  5 framework

2.3.1 General features

As briefly sketched in the previous subsections, the implementation of a specific analysis within the MadAnalysis  5 framework consists of providing the analysis C++ source and header files name.h and name.cpp.

The header file contains the declaration of a class dedicated to the analysis under consideration. This class is defined as a child class inheriting from the base class AnalysisBase, and includes, in addition to constructor and destructor methods, three functions to be implemented by the user (in the source file name.cpp) that define the analysis itself. The first of these, dubbed Initialize, is executed just once prior to the reading of the user’s set of events. In particular, it enables one both to declare selection regions and to associate them with a series of cuts and histograms. It returns a boolean quantity indicating whether the initialization procedure has been achieved properly. If not, the execution of the main program is stopped. The second method, named Execute, is the core of the analysis and is applied to each simulated event provided by the user. Among others things, it takes care of the application of the selection cuts and the filling of the various histograms. This function returns a boolean quantity that can be used according to the needs of the user, although it is by default not employed. Finally, the last function, a function of void type called Finalize, is called once all events have been read and analyzed. Moreover, the user is allowed to define his/her own set of functions and variables according to his/her purposes.

The splitting of the analysis into regions, the application of the selection criteria, and the filling of histograms are all controlled through the automatically initialized object Manager()—a pointer to an instance of the class RegionSelectionManager. The member methods of this class are listed in Table 1 and will be detailed in the next subsections, in which we also provide guidelines for the implementation of the functions Initialize, Execute and Finalize in the C++ source file name.cpp.

Table 1 Methods of the RegionSelectionManager class

2.3.2 Initialization of an analysis

When the analysis is executed from a shell, the program first calls the Initialize method before starting to analyze one or several event samples.

Prior to the declaration of regions, histograms and cuts, we first encourage the user to include an electronic signature to the analysis being implemented and to ask the program to display it to the screen. Although this is neither mandatory nor standardized, it improves the traceability of a given analysis and provides information to the community about who has implemented the analysis and which reference works have been used. In particular for analyses that are being made public, we strongly recommend including at least the names and e-mail addresses of the authors, a succinct description of the analysis and related experimental notes or publications. Taking the example of the CMS stop search in monoleptonic events [9] presented in Sect. 3, an electronic signature could be

figure f

where the last three lines refer to the Digital Object Identifier [11] of the analysis code (if available) and the physics publication for which this analysis reimplementation has been developed. The sample of code above also introduces the INFO message service of the SampleAnalyzer framework, which is presented in Sect. 2.3.7.

As already mentioned, each analysis region must be properly declared within the Initialize function. This is achieved by making use of the AddRegionSelection method of the RegionSelectionManager class (see Table 1). This declaration requires provision of a name (as a string) which serves as a unique identifier for this region within both the code itself (to link the region to cuts and histograms) and the output files that will be generated by the program. For instance, the declaration of two regions, dedicated to the analysis of events with a missing transverse energy \(/\!\!\!{E}_T > 200\) GeV and 300 GeV could be implemented as

figure g

As shown in these lines of code, the declaration of the two regions is handled by the Manager() object, an instance of the RegionSelectionManager class that is automatically included with any given analysis. As a result, two new regions are created and the program internally assigns the intuitive identifiers "MET>200" and "MET>300" to the respective regions.

Once all regions have been declared, the user can continue with the declaration of cuts and histograms. As for regions, each declaration requires a string name which acts as an identifier in the code and the output. Histogram declaration also asks for the number of bins (an integer number) and the lower and upper bounds defining the range of the \(x\) axis (two floating-point numbers) to be specified. Both histograms and cuts must also be associated with one or more regions. In the case of cuts, this finds its source at the conceptual level: each individual region is defined by its unique set of cuts. In the case of histograms, this enables one to establish the distribution of a particular observable after some region-specific cuts have been applied. The association of both types of objects to their regions follows a similar syntax, using an optional argument in their declaration. This argument is either a string or an array of strings, each being the name of one of the previously declared regions. If this argument is absent, the cut/histogram is automatically associated with all regions. This feature can be used, for example, for preselection cuts that are requirements common to all regions.

As an illustrative example, the code

figure h

would create two preselection cuts, "1lepton" and "MET>200 GeV", and assign them to the two previously declared regions "MET>200" and "MET>300". Although both cuts are associated with both regions, for illustrative purposes we have shown two methods of doing this—using the syntax for automatically linking to all regions (here there are two) and explicitly stating both regions. As a second example, we consider the declaration of a histogram of 20 bins representing the transverse momentum distribution of the leading lepton, \(p_T(\ell _1)\), in the range \([50,500]\) GeV. In the case where the user chooses to associate it with the second region only, the line

figure i

should be added to the analysis code.

Finally, the Initialize method can also be used for the initialization of one or several user-defined variables that have been previously declared in the header file name.h.

2.3.3 Using general information on Monte Carlo samples

Simulated events can be classified into two categories: Monte Carlo events either at the parton or at the hadron level, and reconstructed events after object reconstruction.Footnote 1 Contrary to reconstructed event samples, Monte Carlo samples in general contain global information on the generation process, such as cross section, the nature of the parton density set that has been used, etc. In the MadAnalysis  5 framework, these pieces of information are collected under the form of instances of the SampleFormat class and can be retrieved by means of the methods provided in Table 2.

Table 2 Methods of the SampleFormat class

The function Execute takes, as a first argument, a SampleFormat object associated with the current analyzed sample. In this way, if the sample is encoded in the Lhe [15, 16], StdHep [17] or HepMc [18] format, the user may access most of the available information passed by the event generator. In contrast, the other event formats supported by MadAnalysis  5 , namely the Lhco [19] and (Root-based [20]) Delphes 3 [4] format,Footnote 2 do not include any information of this kind so that the first argument of the Execute function is a null pointer. In the case where the user may need such information, it will have to be included by hand.

For instance, assuming that an event sample containing \(N = 10000\) events (\(N\) being stored as a double-precision number in the nev variable) is analyzed, the weight of each event could be calculated (and stored in the wgt variable for further use within the analysis) by means of the code sample

figure j

The MySample object is an instance of the SampleFormat class associated with the sample being analyzed and we impose the results to be normalized to 20 fb\(^{-1}\) of simulated collisions (stored in pb\(^{-1}\) in the lumi variable). For efficiency purposes, such a computation should be performed once and for all at the time of the initialization of the analysis, and not each time an event is analyzed. The variable wgt is then promoted to a member of the analysis class being implemented.

2.3.4 Internal data format for event handling

In the SampleAnalyzer framework, both Monte Carlo and reconstructed events are internally handled as instances of a class named EventFormat. At the time of execution of the analysis on a specific event, the Execute function receives such an EventFormat object as its second argument. The properties of this object reflect those of the current event and can be retrieved via the two methods

figure k

which return a pointer to an MCEventFormat object encompassing information at the Monte Carlo event level, and a pointer to a RecEventFormat object specific for managing information at the reconstructed event level, respectively.

Focusing first on Monte Carlo events, the properties of all initial-state, intermediate-state and final-state particles can be retrieved by means of the MCEventFormat class (see Table 3). Particles are encoded as instances of the MCParticleFormat class whose associated methods are shown in Table 4. Additionally, general event information, such as the values for the gauge couplings or the factorization scale used, is also available if properly stored in the event file. Finally, the MCEventFormat class also contains specific methods for the computation of four global event observables: the amount of (missing) transverse energy \(E_T\) (\(/\!\!\!{E}_T\)) and of (missing) transverse hadronic energy \(H_T\) (\(/\!\!\!{H}_T\)). These quantities are calculated from the transverse momentum of the final-state particles according to

$$\begin{aligned} E_T&= \sum _{\text {visible particles}} \big |\, {\varvec{p}}_T\, \big | \ ,\qquad H_T = \sum _{\text {hadronic particles}} \big |\, {\varvec{p}}_T \,\big |, \nonumber \\ /\!\!\!{E}_T&= \big |\,\,/\!\!\!{{\varvec{p}}}_T\big | = \bigg | -\sum _{\text {visible particles}} {\varvec{p}}_T \,\bigg |,\\ /\!\!\!{H}_T&= \big |\,\,/\!\!\!{{\varvec{H}}}_T\big | = \bigg | -\sum _{\text {hadronic particles}} {\varvec{p}}_T \,\bigg |,\nonumber \end{aligned}$$
(1)

once the user has defined, in the initialization part of the analysis, which particles are invisible and which ones are hadronizing (by means of the configuration functions described in Sect. 2.3.8). However, the definitions of Eq. (1) may be not appropriate if the user wants to include only specific visible/hadronic particles in the sums. In this case, he/she should perform their implementation within the Execute function of the analysis according to his/her needs. The entire set of properties that can be employed to analyze a Monte Carlo event is shown in Table 3.

Table 3 Methods of the MCEventFormat class
Table 4 Methods of the MCParticleFormat class

For example, the selection of all the final-state electrons and positrons that are present in an event and whose transverse momentum is larger than 50 GeV could be implemented as

figure l

The first line of the code above indicates the declaration of a vector, dubbed electrons, of pointers to (constant) MCParticleFormat objects that contain the selected electrons. With the next block of C++ commands, we loop over all the event particles (the for loop) and store the current particle into a temporary variable prt. We then discard non-final-state particles, which have a status code different from one (the first if statement). Finally, we fill the electrons vector with all electrons and positrons (with a Particle Data Group code equal to \(\pm 11\), as shown in the second if statement) whose transverse momentum is greater than 50 GeV (the third if statement).

We next present the methods that have been designed for the analysis of reconstructed events and which are part of the RecEventFormat class. This class contains functions (see Table 5) allowing access to two series of containers, the first ones gathering final state objects of a given nature and the second ones collecting specific generator-level (or equivalently parton-level) objects. All these containers can be further employed within an analysis so that the properties of the different objects can be retrieved and subsequently used, e.g., for cuts and histograms. All the available methods associated with reconstructed objects have been collected in Table 6, while we recall that the MCParticleFormat class has been described in Table 4 (necessary for the handling of generator-level objects). In the case where some pieces of information (either specific properties of a given particle species or a given container itself) are absent from the event file, the related methods return null results.

Table 5 Methods of the RecEventFormat class
Table 6 Methods giving access the properties of the reconstructed objects represented as instances of the RecLeptonFormat, RecJetFormat, RecPhotonFormat, RecTauFormat, RecTrackFormat and RecParticleFormat classes

Finally, as for the MCEventFormat class, specific functions (see Table 5) have been implemented to access the (missing) transverse energy and (missing) hadronic transverse energy of the event. While the value of the \(/\!\!\!{E}_T\) variable is taken from the event file and not calculated on the fly, the other variables are computed from the information on the reconstructedobjects,

$$\begin{aligned} E_T&= \ \sum _{\text {jets, charged leptons, photons}} \big |\, {\varvec{p}}_T\, \big |, \nonumber \\ H_T&= \ \sum _{\text {jets}} \big |\, {\varvec{p}}_T\, \big |, \\ /\!\!\!{H}_T&= \ \big |\,\,/\!\!\!{{\varvec{H}}}_T\,\big | = \bigg | -\sum _{\text {jets}} {\varvec{p}}_T\, \bigg |.\nonumber \end{aligned}$$
(2)

As an example, we show how an isolation requirement on final-state muons can be implemented. To do this we define an isolation variable \(I_\mathrm{rel}\) as the amount of transverse energy, relative to the transverse momentum of the muon, present in a cone of radius \(R = 0.4\) centered on the muon. We constrain this quantity to satisfy \(I_\mathrm{rel} < 20\,\%\). A possible corresponding sample of C++ code is

figure m

With those lines of code, we start by declaring the MyMuons variable, a vector of pointers to RecLeptonFormat objects, that will refer to the reconstructed muons tagged as isolated. Then, we proceed with a for-loop dedicated to the computation of the \(I_\mathrm{rel}\) variable for each of the final state muons. In the case where \(I_\mathrm{rel}\) is smaller than 20 %, the muon is added to the MyMuons container. In more detail, this for-loop works as follows. The current muon is stored in a temporary variable called Muon. The calculation of \(I_\mathrm{rel}\) relies, first, on the amount of calorimetric energy in a cone of radius \(R=0.4\) centered on the muon and second, on the transverse momentum of the current muon. The first of these two quantities is evaluated via the isolCones() method of the RecLeptonFormat class (see Table 6) whereas the second one is derived from the muon four-momentum (obtained from the momentum() method of the RecLeptonFormat class). In the example above, we assume that information on muon isolation associated with several cone sizes is available, including the choice \(R=0.4\). The second for-loop that has been implemented selects the desired value of \(R\). The subsequent computation of the \(I_\mathrm{rel}\) quantity is immediate. We refer to Ref. [21] for more detailed examples on this topic, in cases where event simulation is based on a modified version of Delphes 3 properly handling such a structure for the isolation information.

2.3.5 Applying cuts and filling histograms

The cuts for the analysis, having been declared in the Initialize function (see Sect. 2.3.2), are applied in the Execute function by means of the RegionSelection Manager method ApplyCut (see Table 1). Its two arguments consist of a boolean quantity governing the cut condition (i.e., it indicates whether the current event satisfies this cut) and a string which should be the name of one of the declared cuts.

This method starts by cycling through all regions associated with this cut. For each region, it checks whether the region is still surviving all cuts applied so far by evaluating an internal boolean variable. If a given region is found to be already failing one of the preceding cuts (indicated by the internal surviving variable having the value false), the ApplyCut method continues with the next region associated with the considered cut. On the other hand if the region is surviving, the cut-flow for this region is updated according to the cut condition (the boolean argument of the ApplyCut method) and the internal surviving variable will be kept as true or changed to false as appropriate. The aforementioned internal boolean variables indicating the survival of each region should all be initialized to true when starting to analyze a given event. This is achieved by adding, at the beginning of the Execute function,

figure n

where MyWeight is a floating-point number representing the weight of the event. The weight is used when histograms are filled and cut-flow charts calculated, and can be modified within the analysis by making use of the SetCurrentEventWeight method of the RegionSelection Manager class.

The analysis manager also stores internally the total number of surviving regions, which is updated when a specific region fails a cut. This enables the ApplyCut method to determine and return, after cycling through the associated RegionSelection instances, a boolean quantity which is set to false in the case where not a single surviving region remains. The output of the ApplyCut method is thus equal to the boolean value of the statement there is at least one region in the analysis, not necessarily one of those associated with this specific cut, which is still passing all cuts so far. When it switches from true to false, the present event should no longer be analyzed, and one should move on with the next event. It is therefore recommended, for efficiency purposes, to always call the ApplyCut method in the following schematic manner,

figure o

with the return command terminating the analysis of the current event if all regions are failing the set of cuts applied so far.

Since, trivially, cuts keep some events and reject others, the distribution of an observable is affected by the placement of its histogram-filling command within the full sequence of cuts. Then since each region has its own unique set of cuts (by definition), the distribution of any observable is in general different for any two regions. However, it is meaningful to consider a single histogram as associated with multiple regions, if  it is filled before any cuts are made that distinguish the regions. As an example, a typical format for processing an event would be a set of common preselection cuts, then the filling of various histograms (which are thus associated with all regions), then the application of the region-specific cuts (possibly followed by some further histogramming).

In MadAnalysis  5 , we deal with this within the histogram-filling method of the RegionSelection Manager class, FillHisto, which takes as arguments a string and a floating-point number. The string should be the name of one of the declared histograms, and the floating-point number represents the value of the histogrammed observable for the event under consideration. This method can be called as in

figure p

where "ptl1" is the name of the considered histogram (continuing with the example from Sect. 2.3.2) and val is the value of the observable of interest, namely the transverse momentum of the leading lepton in our case. The FillHisto method begins by verifying whether each of the regions associated with this histogram is surviving all cuts applied so far (via the internal surviving variable above-mentioned). In the case where all the associated regions are found surviving (failing) the cuts, the histogram is (not) filled. If a mixture of surviving and non-surviving regions is found, the program stops and displays an error message to the screen, as this situation implies that the histogram filling command has been called after at least one cut yields a distinction among the associated regions. This indicates an error in the design of the analysis.

2.3.6 Finalizing an analysis

Once all the events have been processed, the program calls the function Finalize. The user can make use of it for drawing histograms or deriving cut-flow charts as indicated in the manual for older versions of the program [1]; however, from the version of MadAnalysis  5 introduced in this paper onwards, the Finalize function does not need to be implemented anymore. Output files written according to the Saf format (see Sect. 2.5) are automatically generated.

2.3.7 Message services

The C++ core of MadAnalysis  5 includes a class of functions dedicated to the display of text on the screen at the time of the execution of the analysis. Whereas only two distinct levels of message are accessible by using the standard C++ streamers (std::cout and std:cerr for normal and error messages), the SampleAnalyzer library enables the user to print messages that can be classified into four categories. In this way, information (the INFO function), warning (the WARNING function), error (the ERROR function) and debugging (the DEBUG function) messages can be displayed as in the following sample of code,

figure q

Additionally, warning and error messages provide information on the line number of the analysis code that is at the source of the message. The effect of a given message service can finally be modified by means of the methods presented in Table 7.

Table 7 Methods associated with a given message service. The available services are INFO, WARNING, ERROR and DEBUG

2.3.8 Physics services

The SampleAnalyzer core includes a series of built-in functions aiming to facilitate the writing of an analysis from the user viewpoint. More precisely, these functions are specific for particle identification or observable calculation and have been grouped into several subcategories of the C++ pointer PHYSICS. All the available methods are listed in Table 8, and we provide, in the rest of this section, a few more details, together with some illustrative examples.

Table 8 Physics service methods

As mentioned in Sect. 2.3.4, MadAnalysis  5 can compute the (missing) transverse energy and (missing) hadronic transverse energy associated with a given Monte Carlo event. This calculation however relies on a correct identification of the invisible and hadronizing particles. This information must be provided by means of the mcConfig() category of physics services, as for instance, in

figure r

These intuitive lines of code indicate to the program that the gravitino (whose Particle Data Group identifier is 1000039) yields missing energy and that the bottom quark (whose Particle Data Group identifier is 5) will eventually hadronize.

An important category of methods shipped with the physics services consists of functions dedicated to the identification of particles and to the probing of their nature (invisible, hadronizing, etc.). They are collected within the Id structure attached to the PHYSICS object. For instance (see Table 8 for the other methods),

figure s

allows one to test the (in)visible nature of the particle referred to by the pointer prt. Also, basic isolation tests on RecLeptonFormat objects can be performed when analyzing reconstructed events. Including in the analysis

figure t

yields a boolean value related to the (non-)isolated nature of the reconstructed lepton muon, event being here a RecEventFormat object. Two isolation algorithms can be employed. By default, the program verifies that no reconstructed jet lies in a cone of radius \(R=0.5\) centered on the lepton. The value of \(R\) can be modified via the recConfig() category of physics services,

figure u

where dR is a floating-point variable with the chosen cone size. The user can instead require the program to tag leptons as isolated when both the sum of the transverse momenta of all tracks in a cone (of radius fixed at the level of the detector simulation) centered on the lepton is smaller than a specific threshold and when the amount of calorimetric energy in this cone, calculated relative to the sum of the transverse momenta of all tracks in the cone, is smaller than another threshold. This uses the information provided by the sumPT_isol() and ET_PT_isol() methods of the RecLeptonFormat class (see Table 6) and can be activated by implementing

figure v

where sumpt and et_pt are the two mentioned thresholds. For more sophisticated isolation tests, such as those based on the information encompassed in IsolationConeType objects possibly provided for reconstructed jets, leptons and photons (see Sect. 2.3.4), it is left to the user to manually implement the corresponding routines in his/her analysis.

In addition to identification routines, physics services include built-in functions allowing one to compute global event observables, such as several transverse variables that are accessible through the Transverse structure attached to the PHYSICS object. More information on the usage of these methods is provided in Table 8.

2.3.9 Sorting particles and objects

In most analyses, particles of a given species are identified according to an ordering in their transverse momentum or energy. In contrast, vector of particles as returned after the reading of an event are in general unordered and therefore need to be sorted. This can be achieved by means of sorting routines that can be called following the schematic form:

figure w

In this line of code, parts is a vector of (Monte Carlo or reconstructed) objects and crit consists of the ordering criterion. The allowed choices for the latter are ETAordering (ordering in pseudorapidity), ETordering (ordering in transverse energy), Eordering (ordering in energy), Pordering (ordering in the norm of the three-momentum), PTordering (ordering in the transverse momentum), PXordering (ordering in the \(x\)-component of the three-momentum), PYordering (ordering in the \(y\)-component of the three-momentum) and PZordering (ordering in the \(z\)-component of the three-momentum). The objects are always sorted in terms of decreasing values of the considered observable.

2.4 Compiling and executing the analysis

In Sect. 2.1, we have pointed out that the Build subdirectory of the analysis template contains a Makefile script readily to be used. In this way, the only task left to the user after having implemented his/her analysis is to launch this script in a shell, directly from the Build directory. This leads first to the creation of a library that is stored in the Build/Lib subdirectory, which includes all the analyses implemented by the user and the set of classes and methods of the SampleAnalyzer kernel. Next, this library is linked to the main program and an executable named MadAnalysis5Job is generated (and stored in the Build directory).

The program can be run by issuing in a shell the command

figure x

where \(\mathtt{{<}}\) inputfile \(\mathtt{{>}}\) is a text file with a list of paths to all event files to analyze. All implemented analyses are sequentially executed and the results, generated according to the Saf format (see Sect. 2.5), are stored in the Output directory.

2.5 The structure of the output of an analysis

As indicated in the previous section, the program stores, after its execution, the results of the analysis or analyses that have been implemented by the user in the Output subdirectory of the working directory. First, a subdirectory with the same name as the input file (\(\mathtt{{<}}\) inputfile \(\mathtt{{>}}\) in the schematic example of Sect. 2.4) is created. If a directory with this name exists already, the code uses it without deleting its content. It contains a Saf file (updated if already existing) with global information on the analyzed event samples organized following an Xml-like syntax:

figure y

where we have set the numerical values to zero for the sake of the illustration. In reality these values are extracted from the event file that is read; they are kept equal to zero if not available. In addition, the format includes header and footer tags (SAFheader and SAFfooter) omitted for brevity.

Secondly, a subdirectory specific to each of the executed analyses is created within the \(\mathtt{{<}}\) inputfile \(\mathtt{{>}}\) directory. The name of the subdirectory is the name of the associated analysis followed by an integer number chosen in such a way that the directory name is unique. This directory contains a Saf file with general information on the analysis (name.saf, name denoting a generic analysis name), a directory with histograms (Histograms) and a directory with cut-flow charts (Cutflows).

In addition to a header and a footer, the name.saf file, still encoded according to an Xml-like structure, contains a list with the names of the regions that have been declared in the analysis implementation. They are embedded in a RegionSelection Xml structure, as in

figure z

when taking the example of Sect. 2.3.2.

The Histograms subdirectory contains a unique Saf file with, again in addition to a possible header and footer, all the histograms implemented by the user. The single histogram declared in Sect. 2.3.2 would be encoded in the Saf format as in the following self-explanatory lines of code:

figure aa

where the dots stand for the other bins that we have omitted for brevity. Again, for the sake of the example we have set all values to zero.

Finally, the Cutflows directory contains one Saf file for each of the declared regions, the filename being the name of the region followed by the saf extension. Each of these files contains the cut-flow chart associated with the considered region encoded by means of two types of Xml tags. The first one is only used for the initial number of events (InitialCounter) whereas the second one is dedicated to each of the applied cuts. Taking the example of the first of the two cuts declared in Sect. 2.3.2, the MET_gr_200.saf file (the > symbol in the region name has been replaced by _gr_) would read

figure ab

which is again self-explanatory.

3 Illustrative examples

In this section, we show two examples of analyses making use of the new features of MadAnalysis  5 introduced in the previous section. First, we focus in Sect. 3.1 on the reinterpretation of a CMS search for stops in 8 TeV events with one single lepton, jets and missing energy [9]. Second, we investigate in Sect. 3.2 the implementation of a recent phenomenological analysis dedicated from the study of monotop systems decaying in the hadronic mode [10].

3.1 Recasting a CMS search for supersymmetric partners of the top quark

We present an implementation of the CMS cut-based strategy for probing stops in the single lepton and missing energy channel as presented in Ref. [9]. The analysis contains 16 overlapping signal regions that share a set of common preselection cuts and that are distinguished by extra requirements. We refer to Refs. [9, 21, 26] for more details on the cuts, the analysis regions and their reimplementation that strictly obeys the syntax introduced in Sect. 2. For simplicity, we consider below only one of the signal regions, which is defined by the following cuts.

  • We select events with a single tightly-isolated electron (muon) with a transverse momentum \(p_T^\ell >30\) GeV (25 GeV) and a pseudorapidity satisfying \(|\eta ^\ell |<1.4442\) (2.1). Isolation is enforced by constraining the sum of the transverse momenta of all tracksFootnote 3 in a cone of radius \(R=0.3\) centered on the lepton to be smaller than min(5 GeV, \(0.15 p_T^\ell \)).

  • Events featuring in addition a loosely-isolated lepton with a transverse momentum \(p_T^\ell > 5\) GeV are vetoed. Isolation is enforced by constraining the sum of the transverse momenta of all tracks in a cone of \(R=0.3\) centered on the lepton to be smaller than \(0.20 p_T^\ell \).

  • Events featuring an isolated track of transverse momentum \(p_T^{\mathrm{track}} > 10\) GeV and whose electric charge is opposite to the one of the primary lepton are vetoed. Isolation is enforced by constraining the sum of the transverse momenta of all tracks in a cone of radius \(R=0.3\) centered on the track to be smaller than \(0.10 \, p_T^\mathrm{track}\).

  • Events with reconstructed hadronic taus of transverse momentum greater than 20 GeV are vetoed.

  • Jets present in a cone of radius \(R=0.4\) centered on a lepton are discarded. Four central jets with a transverse momentum \(p_T^j > 30\) GeV and a pseudorapidity \(|\eta ^j| < 2.4\) are then required, with at least one of them being \(b\)-tagged.

  • The transverse mass \(M_T\) reconstructed from the lepton and the missing transverse momentum is constrained to be larger than 120 GeV.

  • The signal region consider events with at least 300 GeV of missing transverse energy, the preselection threshold common to all the analysis regions being 100 GeV. The missing transverse momentum is also required to be separated from the two hardest jets (\(\Delta \phi > 0.8\), \(\phi \) denoting the azimuthal angle with respect to the beam direction).

  • The final-state configuration is required to contain a hadronically decaying top quark (by means of a \(\chi ^2\)-fit based on the reconstructed objects).

  • The \(M_{T2}^W\) variable must be greater than 200 GeV.

For the sake of the example, we include below a snippet of code describing the implementation of the veto on the presence of isolated tracks. We start by creating a variable (Tracks) containing all the tracks whose transverse momentum is larger than 10 GeV and pseudorapidity satisfies \(|\eta ^\mathrm{track}|<2.1\):

figure ac

Next, we iterate over this container and update a boolean variable noIsolatedTrack (initialized to true) to false if an isolated track of opposite charge (compared to the primary lepton charge, stored in the LeptonCharge variable) is found:

figure ad

It is then sufficient to implement the verification of the cut condition as explained in Sect. 2.3.5,

figure ae

where we assume that a cut "VetoIsTr" has been declared in the analysis initialization method.

The validation of our reimplementation relies on the comparison of results obtained with MadAnalysis  5 with those of the CMS analysis note. We start from parton-level event samples that describe the supersymmetric signal in the context of specific benchmark scenarios, and which have been provided by the CMS collaboration. We then make use of the Pythia 6 program [28] for parton showering and hadronization, and of the previously mentioned modified version of the Delphes 3 package [4, 21] for a simulation of the detector response which uses the CMS detector description of Ref. [29]. Our final number of events are normalized to a signal cross section including the merging of the next-to-leading order result with resummed predictions at the next-to-leading logarithmic accuracy [30], \(\sigma = 0.0140\) pb, and an integrated luminosity of 19.5 fb\(^{-1}\). More details on the validation procedure can be found in Refs. [21, 26].

As an example, Table 9 shows the cut-flow chart for a benchmark point denoted by \(\tilde{t} \rightarrow t \tilde{\chi }^0_1 \ (650/50)\) (using the naming scheme of the original CMS analysis). In this scenario, a pair of top squarks whose mass is equal to 650 GeV is produced, and each squark then decays with a 100 % branching ratio into a top quark and the lightest neutralino. The mass of the latter is fixed to 50 GeV. We present both the output of MadAnalysis  5 and the CMS expectation [31] and observe that an agreement at the level of 20 % has been obtained. This order of magnitude is not surprising as we are comparing a fast simulation made with Delphes to the full simulation of CMS.

Table 9 Cut-flow chart for the benchmark point \(\tilde{t} \rightarrow t \tilde{\chi }^0_1 \ (650/50)\) in the signal region investigated in Sect. 3.1. We present predictions after each of the (pre)selection cuts detailed in the text. The small statistical uncertainties are omitted for brevity. We present number of events for 19.5 fb\(^{-1}\) of simulated collisions, after normalizing the total production rate to 0.0140 pb

The CMS analysis of Ref. [9] contains, in addition to a cut-based analysis strategy, a second strategy relying on advanced multivariate techniques. One of the key variables of this analysis is a quantity denoted by \(H_T^\mathrm{ratio}\), defined as the fraction of the total scalar sum of the jet transverse energies (including only jets with transverse momentum larger than 30 GeV and pseudorapidity \(|\eta | < 2.4\)) that lies in the same hemisphere as the missing momentum. For illustrative purposes, we present below a way to fill a histogram representing this variable. The definition of \(H_T\) appropriate here being different from Eq. (2), we recalculate it and store the value in the variable HT. The amount of hadronic energy located in the same hemisphere as the missing momentum is stored in the variable HTsameHemisphere. Both calculations rely on the Jets container, a collection of relevant jets.

figure af

A histogram declared as "HTratio" at the level of the initialization of the analysis can then be filled using the syntax introduced in Sect. 2.3.5,

figure ag

The distribution that is extracted from the corresponding output Saf file is shown in Fig. 2, alongside the CMS result presented in Ref. [9]. The two are in good agreement.

Fig. 2
figure 2

Distribution in the \(H_T^\mathrm{ratio}\) variable for the \(\tilde{t} \rightarrow t \tilde{\chi }^0_1 \, (650/50)\) scenario investigated in Sect. 3.1. Only the first four cuts of Table 9 have been applied. The solid black line has been obtained with MadAnalysis  5 , while the dashed blue line is the CMS result

3.2 Designing a phenomenological analysis for probing hadronic monotop states

The LHC sensitivity to the observation of a monotop state—a topology where a single, hadronically decaying top quark is produced in association with missing energy—has been recently investigated by means of a phenomenological analysis relying on a cut-and-count technique [10]. It exploits the presence of three final-state jets (including a \(b\)-tagged jet) compatible with a top quark decay, lying in a different hemisphere to the (large) missing transverse momentum. In more detail, events are selected as follows.

  • Selected events are required to contain two or three light (non-\(b\)-tagged) jets (allowing for one from initial or final state radiation) with a transverse momentum greater than 30 GeV, as well as one \(b\)-tagged jet with a transverse momentum larger than 50 GeV. The pseudorapidity of each jet must satisfy \(|\eta ^j| < 2.5\) and the ratio of their hadronic to electromagnetic calorimetric energy must be above 30 %.

  • Events featuring isolated charged leptons with a transverse momentum \(p_T^\ell > 10\) GeV and a pseudorapidity \(|\eta ^\ell | < 2.5\) are vetoed. Lepton isolation is enforced by imposing that the sum of the transverse momenta of all tracks in a cone of \(R=0.4\) centered on the lepton is smaller than \(0.2 p_T^\ell \).

  • At least 250 GeV of missing transverse energy is required.

  • We select the pair of light jets whose invariant mass is the closest to the \(W\)-boson mass. This quantity is then constrained to lie in the \([50,105]\) GeV range.

  • The missing momentum is constrained to be well separated from the momentum of the reconstructed top quark (\(\Delta \phi \in [1,5]\), the azimuthal angular distance being normalized in the \([0,2\pi ]\) range).

  • The missing momentum is constrained to be well separated from the momentum of the hardest jet (\(\Delta \phi \in [0.5, 5.75]\), the azimuthal angular distance being normalized in the \([0,2\pi ]\) range).

  • The reconstructed top mass is constrained to lie in the \([140,195]\) GeV range.

As in Sect. 3.1, all these cuts can be easily implemented following the syntax of Sect. 2, so that we again restrict ourselves to the presentation of a few illustrative samples of code.

First, we show how to select the jet candidates relevant for the analysis and store them in a container named TheJets,

figure ah

In those self-explanatory lines of code, we have not yet split the jets into \(b\)-tagged and non-\(b\)-tagged ones.

Second, we focus on the reconstruction of the \(W\)-boson and the top quark. Assuming that all selected light jets are stored in a vector named ljets and that the \(b\)-tagged jet is represented by the variable bjet, a possible implementation of a code deriving the four-momenta of the reconstructed \(W\)-boson and top quark would be

figure ai

In the lines above, the double for-loop derives the pair of light jets that form the system which is the most compatible with a \(W\)-boson. The four-momentum of the reconstructed \(W\)-boson is saved as an instance of the TLorentzVector class (named w) and the four-momentum of the reconstructed top quark is then derived by adding the four-momentum of the \(b\)-tagged jet (stored in the top variable). The reconstructed top mass could then be histogrammed via the standard command,

figure aj

where a histogram named "Mtreco" has been initialized appropriately. Moreover, the selection cut on this variable could be implemented via

figure ak

assuming that the cut named "Mtop" has been correctly initialized.

We apply the above analysis in the context of the SII.v monotop scenario of Ref. [10]. In this setup, the monotop system arises from the flavor-changing interaction of an up quark with a novel invisible vector boson whose mass have been fixed to either 400 GeV or 600 GeV. Using the publicly available FeynRules [32, 33] monotop model [34], we generate a UFO library [35] that we link to the MadGraph 5 event generator [36] that is used to simulate parton-level events that include the decay of the top quark. We then perform parton showering, hadronization and the simulation of the detector response as in Sect. 3.1. From our analysis implementation, we derive, in the context of the two considered new physics models, the distribution in the reconstructed top mass \(M_{bjj}\) of Fig. 3 and the cut-flow charts of Table 10.

Fig. 3
figure 3

Distribution in the reconstructed top mass \(M_{bjj}\) for the two new physics scenarios investigated in Sect. 3.2, SII.v-400 and SII.v-600 (the numbers indicating the choice for the invisible state mass in GeV). All selection cuts except the one on \(M_{bjj}\) have been applied. The curves are normalized to 20 fb\(^{-1}\) of simulated LHC collisions at a center-of-mass energy of 8 TeV

Table 10 Cut-flow charts for the two considered monotop scenarios SII.v-400 and SII.v-600 (the numbers indicating the choice for the invisible state mass in GeV). We present the predicted number of events after each of the cuts detailed in the text, for an integrated luminosity of 20 fb\(^{-1}\) of LHC collisions at a center-of-mass energy of 8 TeV

4 Conclusion

We have presented a major extension of the expert mode of the MadAnalysis  5 package. Both designing a prospective new physics analysis and recasting an experimental search featuring multiple signal regions can now be achieved in a user-friendly fashion that relies on a powerful handling of regions, histogramming and selection cuts. We have illustrated the strengths of our approach with two examples. First, we have reimplemented a CMS search for stops in events with a single lepton and missing energy. We have shown that predictions of MadAnalysis  5 agree with the CMS expectation at the level of 20 % for a specific benchmark scenario. Second, we have implemented a phenomenological study for estimating the LHC sensitivity to hadronically decaying monotop systems that has been employed in a recent publication.