1 Introduction

The observation of a lepton-flavour-violating (LFV) process involving charged leptons would be a major breakthrough in understanding the matter content of the universe and would support the hypothesis of leptogenesis [1]. In particular, LFV processes involving both a \(\tau \) lepton and a muon are seen as most promising for such an observation, given the current measurements of neutrino oscillations [2]. In the Standard Model (SM), such processes have a vanishingly small branching fraction, e.g. \(\mathrm{Br}(\tau \rightarrow 3\mu )\) \(<10^{-14}\) [3], while a number of models beyond the SM predict it to be of the order of \(10^{-10}\)\(10^{-8}\) [46]. The current limits on branching fractions of neutrinoless \(\tau \) lepton decays are of the order of few times \(10^{-8}\) [710], for Z boson LFV decays they are about \(10^{-5}\) [2, 11, 12], and for the LFV decay of a Higgs boson to a \(\tau \) lepton and a muon they are about 1 % [13, 14]. The main experimental obstacles to improve the sensitivity with \(\tau \) leptons are the small number of produced \(\tau \) leptons world-wide.

In this article, a search for neutrinoless \(\tau \) lepton decays to three muons is performed with 20.3 fb\(^{-1}\) of pp collision data collected with ATLAS detector in 2012 at 8 TeV centre-of-mass energy. The search is focused on a particular source of \(\tau \) leptons, namely \(W\rightarrow \tau \nu \) decays with subsequent \(\tau \rightarrow 3\mu \) decay. In such events, \(\tau \) leptons are produced with a transverse momentum (\(p_\mathrm{T}\)) mostly in the range of \({\sim }25{-}50\) GeV. Due to the relativistic boost of the \(\tau \) lepton, the muons from the \(\tau \) LFV decay are produced in close geometrical proximity to each other but isolated from other energetic particles in the event. The tau-neutrino from the W boson decay appears as missing transverse momentum (\(E_\mathrm{T}^\mathrm{miss}\)) in the detector and together with the transverse momentum of the three muons (\(p_\mathrm{T}^{3\mu }\)) gives a transverse mass, \(m_\mathrm{T}=\sqrt{2p_\mathrm{T}^{3\mu }E_\mathrm{T}^\mathrm{miss} (1-\cos \Delta \phi )}\), compatible with the W boson decay, where \(\Delta \phi \) is the angle between the directions of the \(p_\mathrm{T}^{3\mu }\) and the \(E_\mathrm{T}^\mathrm{miss}\). The unique signature in the detector is three muons with invariant mass equal to the mass of the \(\tau \) lepton and with a significant missing transverse momentum that is on average back-to-back with the three muons in the transverse plane. Since no energetic jet is expected in the majority of W boson production events, very small hadronic activity is predicted beyond that from the soft underlying event or multiple simultaneous pp collisions (pile-up). A large fraction of such \(\tau \) leptons decay sufficiently far from the W production vertex to give a fully reconstructable additional vertex. This allows the selection of three muons originating from a vertex which is displaced from the primary interaction vertex. The background events usually contain one or two muons originating from the decay of hadrons, including decays in flight, while the remaining tracks are hadrons mis-measured as muons, originating from e.g. a pile-up jet or a pion punching through the calorimeter. The dominant background is due to muons originating from decays of b- or c-hadrons (heavy flavour, HF). Although such decays are typically accompanied by jets of particles produced in the direction opposite to the HF jets, in a fraction of the events the associated jet is lost or mis-measured, mimicking the signal \(E_\mathrm{T}^\mathrm{miss}\). A small light-flavour multi-jet contribution is also present while the contribution from leptonic decays of vector bosons is negligible.

The analysis strategy is as follows. Events with three muons associated with a common vertex are selected. A loose event selection is applied to collect a high-quality sample of candidate events satisfying \(|m_{3\mu }-m_\tau |\lesssim 1\) GeV. The characteristics of the loose sample of events are then analysed with a boosted decision tree (BDT). The BDT input variables are chosen so that the BDT output and the three-muon mass are uncorrelated in the mass range used in the analysis. A tight selection, following an initial cut on the BDT output, is applied to separate the signal from the background. After the optimal cut on the BDT output is found, a search is performed for an excess of events at the \(\tau \) lepton mass above the expected background level.

The branching fraction is calculated as

(1)

where \(N_\mathrm{s}\) is the number of observed events above the expected background level in a narrow region around the \(\tau \) lepton mass, is the detector acceptance times efficiency for the signal, and \(N_{W\rightarrow \tau \nu }\) is the number of \(\tau \) leptons produced via the \(W\rightarrow \tau \nu \) channel (additional contributions to the \(\tau \) lepton yield are estimated to be less than 3 %).

2 The ATLAS detector

The ATLAS experiment [15] at the LHC is a multi-purpose particle detector with a forward-backward symmetric cylindrical geometry and a near \(4\pi \) coverage in solid angle.Footnote 1 It consists of an inner tracking detector surrounded by a thin superconducting solenoid providing a 2 T axial magnetic field, electromagnetic and hadronic calorimeters, and a muon spectrometer. The inner tracking detector (ID) covers the pseudorapidity range \(|\eta | < 2.5\). It consists of silicon pixel, silicon microstrip, and transition radiation tracking detectors. Lead/liquid-argon (LAr) sampling calorimeters provide electromagnetic (EM) energy measurements with high granularity. A hadronic (iron/scintillator-tile) calorimeter covers the central pseudorapidity range (\(|\eta | < 1.7\)). The endcap and forward regions are instrumented with LAr calorimeters for EM and hadronic energy measurements up to \(|\eta | = 4.9\).

The muon spectrometer (MS) comprises separate trigger and high-precision tracking chambers measuring the deflection of muons in a magnetic field generated by superconducting air-core toroids. The magnets’ bending power is in the range from 2.0 to 7.5 T m. The muon tracking chambers cover the region \(|\eta | < 2.7\) with three layers of monitored drift tubes, complemented by cathode-strip chambers in the forward region, where the background is highest. The muon trigger system covers the range \(|\eta | < 2.4\) with resistive-plate chambers in the barrel, and thin-gap chambers in the endcap regions.

A three-level trigger system is used to select events. The first-level trigger is implemented in hardware and uses a subset of the detector information to reduce the accepted rate to at most 75 kHz. This is followed by two software-based trigger levels that together reduce the accepted event rate to 400 Hz on average.

During the data-taking period, there were no dedicated triggers implemented for this analysis. A combination of seven muon triggers is used, where all triggers are constructed from at least two trigger objects. A detailed discussion of the trigger is given in Sect. 4.

3 Simulation and data samples

The results presented here are based on proton–proton collision data at a centre-of-mass energy of \(\sqrt{s} = 8\) TeV, collected by the ATLAS detector at the LHC during 2012. Data samples corresponding to an integrated luminosity of 20.3 fb\(^{-1}\) are used. Selected data events are required to have all relevant components of the ATLAS detector in good working condition.

The Monte Carlo (MC) simulated \(W\rightarrow \tau \nu \rightarrow (3\mu )\nu \) signal sample is produced by the Pythia8  [16] event generator (version 8.175) using the AU2 [17] set of tuned parameters and the MSTW2008LO parton distribution function (PDF) set [18]. This signal sample is modelled using \(W\rightarrow \tau \nu \) production where the \(\tau \) lepton is forced to decay isotropically into three muons as in previous searches for this mode [710]. The detector response is modelled using GEANT4 [19, 20]. The number of \(\tau \) leptons produced in the 2012 dataset via the \(W\rightarrow \tau \nu \) channel appearing in Eq. (1), is estimated by scaling the ATLAS measurement of the \(W\rightarrow \ell \nu \) cross-section at \(\sqrt{s}=7\) TeV [21] to 8 TeV using the ratio of the 8 TeV to 7 TeV NNLO cross-section calculations (\(\sigma _\mathrm{theory}^\mathrm{8~TeV} = 12.18\pm 0.61\) nb and \(\sigma _\mathrm{theory}^\mathrm{7~TeV} = 10.46\pm 0.52\) nb) and multiplying by the 8 TeV integrated luminosity. The result is \(N_{W\rightarrow \tau \nu } =(2.41\pm 0.08) \times 10^8\), taking into account the uncertainty reported in Ref. [21] and the uncertainty in the 7 and 8 TeV luminosities. For the selection applied in the analysis, the contamination from other sources of \(\tau \) leptons, such as \(Z\rightarrow \tau \tau \) or HF processes, is less than 3 % and is therefore neglected. The background is estimated using data as discussed in Sect. 5.5.

4 Trigger and reconstruction

To maximise the signal acceptance times efficiency, events are required to pass at least one of seven triggers. These are six multi-muon triggers and one dimuon plus \(E_\mathrm{T}^\mathrm{miss}\) trigger. The software-based trigger thresholds used for the muons range from 4 to 18 GeV in transverse momentum while the \(E_\mathrm{T}^\mathrm{miss}\) threshold is 30 GeV. The trigger efficiency for simulated signal events within the muon-trigger acceptance (three generator-level muons with \(p_\mathrm{T} >2.5\) GeV and \(|\eta |<2.4\)) is \({\sim } 31~\%\) for the combination of all triggers used in the analysis. To evaluate the trigger performance in the region where the muons have a small angular separation, as is typical for the signal, a tag-and-probe study is performed using data events containing high-momentum \(J/\psi \rightarrow \mu \mu \) candidates. For this study, the data are collected using a single-muon baseline trigger with a \(p_\mathrm{T}\) threshold of 18 GeV. Single-muon efficiencies are measured separately for the different thresholds which define the six multi-muon triggers. Each multi-muon trigger efficiency is calculated as the product of the single-muon efficiencies. Correction factors are applied to account for the limited performance of the trigger system in identifying a pair of muons as two muon-trigger objects. At small angular separations (\(\Delta R \lesssim 0.2\)), where most of the signal is expected and where these limitations are most pronounced, these corrections must be taken into account. These factors are measured from the efficiency to identify two independent muon-trigger objects for different \(\Delta R\) values between the tag- and the probe-muon. The total efficiency of the seven triggers is calculated considering correlations between any of the triggers. The trigger efficiency, measured from the data, is compared to the one measured in simulated \(J/\psi \) events for the seven different triggers separately and jointly. Agreement between data and MC simulation was found to be within 11 % for all relevant values of \(\Delta R\) and \(p_\mathrm{T}\), where the largest difference comes from events where the \(\Delta R\) separation is smallest. The systematic uncertainty on due to the trigger is therefore taken to be 11 %.

The approach for measuring the muon reconstruction efficiency is similar to that used to measure the trigger efficiency. While the trigger efficiency is measured with respect to muon reconstruction as the baseline, the reconstruction efficiency is measured with respect to ID tracking, which in turn is close to \(100~\%\) efficient [22]. Small deviations from the assumed value for ID tracking efficiency have a negligible impact on this measurement. The tag-and-probe procedure is performed using muons as tags and ID tracks as probes. The baseline sample for the reconstruction efficiency measurement includes a large number of non-muon tracks, which must be subtracted. This is done in bins of probe-track \(p_\mathrm{T}\) (\(p_\mathrm{T}^\mathrm{trk}\)) and bins of the angular separation between the tag-muon and the probe-track, \(\Delta R_{\mu + \mathrm{trk}}\). To describe the \(J/\psi \) peak and the background, a small range in tag-muon plus probe-track invariant mass, \(m_{\mu +\mathrm{trk}}\in [2600,3500]\) MeV, is fit to a double Gaussian function plus an exponential function and a second-order polynomial. In each \(p_\mathrm{T}^\mathrm{trk}\) or \(\Delta R_{\mu + \mathrm{trk}}\) bin, the ratio of the \(J/\psi \) peak component integral to the full shape (\(J/\psi \) plus background) integral is used as a weight to correct the \(p_\mathrm{T}^\mathrm{trk}\) or \(\Delta R_{\mu + \mathrm{trk}}\) shape itself. This is done separately for the probe-track distributions (denominators) and the muon-matched probe-track distributions (numerators). The ratio of the above two weighted distributions is defined as the reconstruction efficiency per \(p_\mathrm{T}^\mathrm{trk}\) or \(\Delta R_{\mu + \mathrm{trk}}\) bin. The efficiency measured with this approach in data is compared with the one from simulation and the difference at small \(\Delta R_{\mu + \mathrm{trk}}\) results in an uncertainty of 13.1 % per event.

5 Analysis procedure

The analysis procedure is divided into four steps. First, events containing three high-quality muon objects with a combined invariant mass of less than 2.5 GeV are selected. These muons are required to originate from a common vertex. Second, a loose selection is applied to this sample to obtain a background sample that can be used to train the BDT, which is constructed using the TMVA toolkit [23]. The loose selection cuts (using a number of vertex quantities as well as kinematic quantities) are chosen to obtain a large background sample for training, while rejecting background that is kinematically inconsistent with the signal. Before training the BDT, the data events are divided into three regions based on the three-muon mass value. These are the blinded region (which includes the signal region), a sideband region and a BDT training region as defined in Table 1. Third, a tight selection (tightening the loose selection with a few additional cuts) is applied while simultaneously placing an initial loose cut on the BDT score, denoted by \(x{>}x_0\). The \(x{>}x_0\) cut removes background-like events having a very low BDT score, while the tight selection further reduces the background in the blinded and sideband regions. Fourth, the background rejection as a function of the BDT cut is studied using data events in the sideband region passing the tight \({+}x{>}x_0\) selection. This allows to optimise the final cut on the BDT score, denoted by \(x{>}x_1\). The statistical analysis is performed for the tight \({+}x{>}x_1\) selection.

Table 1 The different three-muon mass ranges used in the analysis

The signal region (SR) is defined as an interval around the \(\tau \) lepton mass with a half-width corresponding to twice the resolution of the three-muon mass, \(\sigma _s = 32\) MeV, as obtained from the signal MC sample. The analysis was blinded in a slightly wider region to allow variation of the signal region definition. The signal MC sample is divided into two independent samples. One signal sample is used for the BDT training while the second signal sample is used for estimating the . The background in the signal region is estimated from a fit to the three-muon mass distribution in the sidebands (SB) using the tight \({+}x{>}x_0\) selection. This estimate is then scaled down to the final BDT score cut, \(x_1\), using a fit to the BDT shape as explained below.

5.1 Object selection

Muons are selected to have a transverse momentum greater than 2.5 GeV and are required to pass stringent requirements on the track quality and the associated hits in both the ID and the MS. Only combined ID+MS measurements of track parameters are used. Several matching criteria [22] are imposed to reject non-muon tracks (e.g. tracks from hadron decays in flight). The performance of muon identification is validated in two dedicated dimuon control regions. One region is populated with muons from \(J/\psi \rightarrow \mu \mu \) decay (in \(2850 < m_{2\mu } < 3350\) MeV), while the second region has an enhanced fraction of non-muon tracks (in events with \(m_{2\mu } < 750\) MeV).

Events with at least three selected muons are considered. All possible three-muon combinations are used as inputs to a vertex fit. The primary vertex (PV) is also refitted after removing the three tracks. Due to the \(\tau \) lepton lifetime, the three-muon vertex is often separated from the PV. The characteristics of the separation between the three-muon vertex and the PV are therefore used to distinguish signal from background. Particularly, the two projections of the three-muon vertex displacement with respect to the PV in the transverse plane are used; \(L_{xy} {=}L_\mathrm{T}\cos \theta _{xy}\) and \(a^{0}_{xy} {=}L_\mathrm{T}\sin \theta _{xy}\) where \(L_\mathrm{T}\) is the transverse component of the vector connecting the PV and the three-muon vertex and \(\cos \theta _{xy}{=}\frac{\vec {L}_\mathrm{T}\cdot \vec {p}_\mathrm{T}^{3\mu }}{L_\mathrm{T} p_\mathrm{T}^{3\mu }}\). The three-muon vertex fit probability, \(p\)-value, is also used (as calculated from the vertex fit \(\chi ^2\) and degrees of freedom). After fitting all possible vertices, exactly one three-muon candidate is allowed per event, satisfying \(m_{3\mu } <2500\) MeV and \(|Q_{3\mu } |=1\) where \(Q_{3\mu }\) is the sum of the charges of the three-muon tracks.

Jets are used to separate the signal from the multi-jet backgrounds (predominantly HF), where more hadronic activity is expected. The jets are reconstructed from topological clusters formed in the calorimeter using the anti-\(k_t\) algorithm [24] with a radius parameter \(R=0.4\). The jets are calibrated to the hadronic energy scale using energy- and \(\eta \)-dependent correction factors derived from simulation and with residual corrections from in situ measurements. A detailed description of the jet energy scale measurement and its systematic uncertainties can be found in Ref. [25]. Jets found within a cone of \(\Delta R = 0.2\) around a selected three-muon candidate are removed. Jets are required to have \(p_\mathrm{T} >30\) GeV and \(|\eta |<2.8\); only the leading jet satisfying these criteria is considered. There is no veto of events with more than one jet satisfying these criteria. The leading jet and the three-muon momenta are summed vectorially to define \(\vec {\Sigma } = \vec {p}_\mathrm{jet} + \vec {p}_{3\mu }\) with \(\Sigma _\mathrm{T}\) being the magnitude of its transverse component. For events where there are no jets satisfying these criteria (the majority of events for the signal), \(\vec {\Sigma }\) is simply \(\vec {p}_{3\mu }\).

The \(E_\mathrm{T}^\mathrm{miss}\) is calculated as the negative vector sum of the transverse momenta of all high-\(p_\mathrm{T}\) objects reconstructed in the event, as well as a term for other activity in the calorimeter [26]. Clusters associated with electrons, hadronic \(\tau \) lepton decays and jets are calibrated separately, with other clusters calibrated at the EM energy scale. This \(E_\mathrm{T}^\mathrm{miss}\) is denoted hereafter by \(E_\mathrm{T,cal}^\mathrm{miss} \). In addition, a track-based missing transverse momentum (\(E_\mathrm{T,trk}^\mathrm{miss} \)) is calculated as the negative vector sum of the transverse momenta of tracks with \(|\eta |<2.5\), \(p_\mathrm{T} >500\) MeV and associated with the primary vertex. Both the calorimeter-based and track-based measurements of the \(E_\mathrm{T}^\mathrm{miss}\) are used.

Several kinematic variables are defined from the reconstructed objects listed above. Two transverse masses are defined using the three-muon transverse momentum (\(p_\mathrm{T}^{3\mu }\)) as \(m_\mathrm{T} = \sqrt{2p_\mathrm{T}^{3\mu } E_\mathrm{T}^\mathrm{miss} (1-\cos \Delta \phi _{3\mu })}\), where \(\Delta \phi _{3\mu }\) is the angle between the \(E_\mathrm{T}^\mathrm{miss}\) and \(p_\mathrm{T}^{3\mu }\) directions in the transverse plane. In these definitions, \(E_\mathrm{T}^\mathrm{miss}\) can be either \(E_\mathrm{T,cal}^\mathrm{miss} \) or \(E_\mathrm{T,trk}^\mathrm{miss} \) to obtain \(m_\mathrm{T}^\mathrm{cal}\) or \(m_\mathrm{T}^\mathrm{trk}\) respectively. The \(\Delta \phi _{3\mu }\) terms are \(\Delta \phi _{3\mu }^\mathrm{cal}\) or \(\Delta \phi _{3\mu }^\mathrm{trk}\) respectively. Similarly, the \(\Delta \phi _{\Sigma _\mathrm{T}}\) variable is the angle between the \(E_\mathrm{T}^\mathrm{miss}\) and \(\Sigma _\mathrm{T}\) directions in the transverse plane. This adds two additional angles, \(\Delta \phi _{\Sigma _\mathrm{T}}^\mathrm{cal}\) and \(\Delta \phi _{\Sigma _\mathrm{T}}^\mathrm{trk}\) for \(E_\mathrm{T,cal}^\mathrm{miss} \) and \(E_\mathrm{T,trk}^\mathrm{miss} \) respectively. These \(\Delta \phi _{\Sigma _\mathrm{T}}\) variables provide good separation when a hard jet is found and thus \(\Sigma _\mathrm{T}\) deviates from \(p_\mathrm{T}^{3\mu }\) in magnitude and direction.

5.2 Loose event selection

After the three-muon candidates are formed from the selected muons, a loose event selection is performed, maintaining a signal efficiency of about 80 % while rejecting about 95 % of the background. This loose selection includes cuts on the displacement of the vertex from the PV, requirements on the three-muon kinematics and on the presence of other tracks (track isolation), and requirements on quantities involving \(E_\mathrm{T,cal}^\mathrm{miss} \) and \(E_\mathrm{T,trk}^\mathrm{miss} \). The loose selection comprises the following requirements:

  • The \(L_{xy}\) significance, \(S(L_{xy}) {=}L_{xy}/\sigma _{L_{xy}} \), must satisfy \(-10{<}S(L_{xy}) {<}50\), where \(\sigma _{L_{xy}}\) is the uncertainty in the \(L_{xy}\).

  • The \(a^{0}_{xy}\) significance, \(S(a^{0}_{xy}) {=}a^{0}_{xy}/\sigma _{a^{0}_{xy}} \), must satisfy \(S(a^{0}_{xy}) {<}25\), where \(\sigma _{a^{0}_{xy}}\) is the uncertainty in \(a^{0}_{xy}\).

  • The three-muon track-fit probability product, \(\mathcal {P}_\mathrm{trks} = p_1\times p_2\times p_3\) (where \(p_i\) is the track fit \(p\)-value of track i), must satisfy \(\mathcal {P}_\mathrm{trks} >10^{-9}\).

  • The three-muon transverse momentum must satisfy \(p_\mathrm{T}^{3\mu } >10\) GeV.

  • The missing transverse energies, \(E_\mathrm{T,cal}^\mathrm{miss} \) and \(E_\mathrm{T,trk}^\mathrm{miss} \), must both satisfy \(10<E_\mathrm{T}^\mathrm{miss} < 250\) GeV.

  • The transverse masses, \(m_\mathrm{T}^\mathrm{cal}\) and \(m_\mathrm{T}^\mathrm{trk}\), must both satisfy \(m_\mathrm{T}>20\) GeV.

  • The three-muon track isolation is obtained from the sum of the \(p_\mathrm{T}\) of all tracks with \(p_\mathrm{T}^\mathrm{trk}>500\) MeV in a cone of \(\Delta R^{3\mu }_\mathrm{max} +0.20\) (and \(\Delta R^{3\mu }_\mathrm{max} +0.30\)) around the three-muon momentum while excluding its constituent tracks; it must satisfy \(\Sigma p_\mathrm{T}^\mathrm{trk}(\Delta R^{3\mu }_\mathrm{max} +0.20)/p_\mathrm{T}^{3\mu } <0.3\) (and \(\Sigma p_\mathrm{T}^\mathrm{trk}(\Delta R^{3\mu }_\mathrm{max} +0.30)/p_\mathrm{T}^{3\mu } <1\)). The largest separation, \(\Delta R^{3\mu }_\mathrm{max} \), between any pair of the three-muon tracks is on average 0.07 for the signal.

The loose cuts on the significances, \(S(L_{xy})\) and \(S(a^{0}_{xy})\), are applied to allow the three-muon vertex to be separated from the PV, while still being compatible with the \(\tau \) lepton lifetime. The requirement on \(\mathcal {P}_\mathrm{trks}\) imposes a goodness-of-fit criterion on the three-muon candidate. This value is based on an examination of signal-like events found in the sideband region in the data. As this is not the only quality requirement imposed on the individual muon objects, it is kept loose in this part of the selection. The efficiency for this cut to select signal events is \({\sim } 98~\%\), while it is rejecting \({\sim } 13~\%\) of the background events. The kinematic and the isolation variables are very effective in separating the W boson properties of the signal from the HF and the light-flavour multi-jet background, which tend to be non-isolated and with low values of \(p_\mathrm{T}\), \(E_\mathrm{T}^\mathrm{miss}\) and \(m_\mathrm{T}\). The associated cuts remain very loose in this part of the selection to ensure that the sample sizes are large enough for the BDT training.

5.3 Multivariate analysis

The events passing the loose selection described above are used as input to the BDT training. There are 6649 events passing the loose selection in the signal MC sample (out of \(10^5\)), where 6000 of these events are used for the BDT training and the rest are used for testing the BDT output. Similarly, the number of data events passing the loose selection in the training region is 4672, where 4000 of these events are used for the BDT training. The BDT input variables include kinematic distributions, track and vertex quality discriminants, vertex geometry parameters, and isolation. The following variables (sorted by their importance ranking) are used as inputs to the BDT:

  1. 1.

    The calorimeter-based transverse mass, \(m_\mathrm{T}^\mathrm{cal}\).

  2. 2.

    The track-based missing transverse momentum, \(E_\mathrm{T,trk}^\mathrm{miss} \).

  3. 3.

    The isolation variable, \(\Sigma p_\mathrm{T}^\mathrm{trk}(\Delta R^{3\mu }_\mathrm{max} +0.20)/p_\mathrm{T}^{3\mu } \).

  4. 4.

    The transverse component of the vector sum of the three-muon and leading jet momenta, \(\Sigma _\mathrm{T}\).

  5. 5.

    The track-based transverse mass, \(m_\mathrm{T}^\mathrm{trk}\).

  6. 6.

    The difference between the \(E_\mathrm{T,cal}^\mathrm{miss} \) and \(E_\mathrm{T,trk}^\mathrm{miss} \) directions, \(\Delta \phi _\mathrm{trk}^\mathrm{cal}\).

  7. 7.

    The calorimeter-based missing transverse momentum, \(E_\mathrm{T,cal}^\mathrm{miss}\).

  8. 8.

    The track-based missing transverse momentum balance \(p_\mathrm{T}^{3\mu }/E_\mathrm{T,trk}^\mathrm{miss}-1\).

  9. 9.

    The difference between the three-muon and \(E_\mathrm{T,cal}^\mathrm{miss}\) directions, \(\Delta \phi _{3\mu }^\mathrm{cal}\).

  10. 10.

    Three-muon vertex fit probability, \(p\)-value.

  11. 11.

    The three-muon vertex fit \(a^{0}_{xy}\) significance, \(S(a^{0}_{xy})\).

  12. 12.

    The track fit probability product, \(\mathcal {P}_\mathrm{trks}\).

  13. 13.

    The three-muon transverse momentum, \(p_\mathrm{T}^{3\mu }\).

  14. 14.

    The number of tracks associated with the PV (after refitting the PV while excluding the three-muon tracks), \(N_\mathrm{trk}^\mathrm{PV}\).

  15. 15.

    The three-muon vertex fit \(L_{xy}\) significance, \(S(L_{xy})\).

  16. 16.

    The calorimeter-based missing transverse momentum balance, \(p_\mathrm{T}^{3\mu }/E_\mathrm{T,cal}^\mathrm{miss}-1\).

This configuration was found to give the optimal balance between background rejection and signal efficiency.

The \(\Sigma _\mathrm{T}\) variable is introduced to avoid vetoing events with at least one jet fulfilling the requirements listed in Sect. 5.1. Although the majority of signal events do not have jets, it is found that keeping such events increases the by \({\sim } 15~\%\) and also ultimately leads to better rejection power, owing to the significantly larger training and sideband samples. The variables \(\Delta \phi _\mathrm{trk}^\mathrm{cal}\), \(\Delta \phi _{3\mu }^\mathrm{cal}\), \(p_\mathrm{T}^{3\mu }/E_\mathrm{T,trk}^\mathrm{miss}-1\), \(p_\mathrm{T}^{3\mu }/E_\mathrm{T,cal}^\mathrm{miss}-1\) and \(N_\mathrm{trk}^\mathrm{PV}\) are complementary to the \(E_\mathrm{T}^\mathrm{miss}\)-related variables used in the loose selection as well as here. These variables are also very effective in distinguishing the \(W\rightarrow \tau \nu \) production of the signal from the HF and light-flavour multi-jet background. The vertex \(p\)-value is a variable complementary to the \(S(L_{xy})\) and \(S(a^{0}_{xy})\) variables used in the loose selection as well as here. The HF and light-flavour multi-jet backgrounds have mostly random combinations of selected muon objects which do not originate from the same vertex. This variable peaks at very low values for the background while for the signal it is distributed uniformly and thus provides excellent separation.

After training the BDT with data events from the training region and signal MC events from the first signal MC sample, the BDT response is calculated for the data events in the sidebands and for events in the second signal sample. The BDT score, x, ranges between \(-\)1 and \(+\)1. Events with a very low BDT score, within \(-1\le x\le -0.9\) are removed from further consideration, defining \(x_0\equiv -0.9\).

In order to assess potential modelling problems in the signal MC sample, the BDT input distributions and the BDT response are validated against single-muon data. These data contain mainly \(W\rightarrow \mu \nu \) events with a small fraction (\({<}10~\%\)) of background. The single-muon selection is formulated to be as close as possible to the main analysis selection where the differences are mostly driven by the different triggers used (one single-muon trigger with no isolation requirement and with a threshold of 24 GeV is used in the validation) and the exclusion of variables which do not have equivalents in the \(W\rightarrow \mu \nu \) case, e.g. the three-muon vertex variables. The training samples used for this validation study, for both data and signal, are the same samples as used in the main analysis, constructed with the same loose selection as described in the previous section. All input variables are used for the training, excluding the \(p\)-value, \(S(L_{xy})\), \(S(a^{0}_{xy})\) and \(\mathcal {P}_\mathrm{trks}\), which cannot be calculated in a single-muon (\(W\rightarrow \mu \nu \)) selection. The resulting BDT setup is hereafter referred to as “partial BDT”. After training the partial BDT, the response is tested on the second signal sample and on the single-muon data, using the single-muon selection and where the three muon objects in the signal sample are treated as one object (muon). The \(N_\mathrm{trk}^\mathrm{PV}\) distribution of the signal sample is also modified by subtracting two tracks to reflect the difference with respect to a single-muon selection. The responses in data and simulation are compared and are found to agree within 10 % throughout most of the phase-space for all variables. The ratio of the partial BDT responses for the single-muon data and signal MC events is used as an event weight while applying the full selection and calculating the weighted as described in the next sections. The difference between the weighted and unweighted is found to be 4 % and is taken as a modelling uncertainty.

Any variable which may bias the BDT response by only selecting events very close to the \(\tau \) lepton mass is not included in the BDT input list. The distribution of the three-muon mass has been examined in several bins of x above \(x_0\) using both the loose and the tight samples, where no hint of potential peaking background around the \(\tau \) lepton mass has been found. In addition, the shape of the three-muon mass distribution has been found to be insensitive to the BDT cut, as expected given the small correlation coefficient between x and \(m_{3\mu }\), which is found to be about \(-\)0.05.

5.4 Tight event selection

Additional tight cuts are applied after the BDT training and the application of the \(x{>}x_0\) cut on the BDT score. The following requirements are tightened or added:

  • A number of the loose requirements are tightened, namely \(\mathcal {P}_\mathrm{trks} >8\times 10^{-9}\), \(m_\mathrm{T}^\mathrm{cal} >45\) GeV, \(m_\mathrm{T}^\mathrm{trk} >45\) GeV and \(1{<}S(L_{xy}) {<}50\).

  • Three-muon vertex fit probability must have \(p\mathrm{-value} >0.2\).

  • The angle between the \(\Sigma _\mathrm{T}\) and \(E_\mathrm{T,cal}^\mathrm{miss} \) (\(E_\mathrm{T,trk}^\mathrm{miss} \)) directions is required to be \(\Delta \phi _{\Sigma _\mathrm{T}}^\mathrm{cal} >2\) (\(\Delta \phi _{\Sigma _\mathrm{T}}^\mathrm{trk} >2\)).

  • The same-charge two-muon mass, \(m_\mathrm{SS}\), and opposite-charge two-muon mass, \(m_\mathrm{OS1}\) or \(m_\mathrm{OS2}\), satisfy \(m_\mathrm{SS} >300\) MeV, \(m_\mathrm{OS1} >300\) MeV and \(m_\mathrm{OS2} >300\) MeV, where \(m_\mathrm{OS1}\) (\(m_\mathrm{OS2}\)) is the mass of the two opposite-charge muon pairs with the highest (second highest) summed scalar \(p_\mathrm{T}\) among the three muons.

  • The event is rejected if \(|m_\mathrm{OS}-m_\omega |<50\) MeV or \(|m_\mathrm{OS}-m_\phi |<50\) MeV if either of the \(p_\mathrm{T}^{3\mu }\), the \(E_\mathrm{T,cal}^\mathrm{miss} \) or the \(E_\mathrm{T,trk}^\mathrm{miss} \) is lower than 35 GeV.

  • The event is rejected if \(|m_\mathrm{OS}-m_\phi |<50\) MeV if \(|m_{3\mu }-m_{D_s} |<100\) MeV.

In the above notation, \(m_\mathrm{OS}\) is \(m_\mathrm{OS1}\) or \(m_\mathrm{OS2}\) and \(m_\omega \), \(m_\phi \) and \(m_{D_s}\) are the masses of the \(\omega \), \(\phi \) and \(D_s\) mesons respectively, taken from Ref. [2].

The requirement on the three-muon vertex fit probability is applied in order to ensure a high-quality fit. The cuts on \(\Delta \phi _{\Sigma _\mathrm{T}}^\mathrm{cal}\) (\(\Delta \phi _{\Sigma _\mathrm{T}}^\mathrm{trk}\)) are applied in order to further suppress the HF and multi-jet background where the three-muon candidate is typically produced within or near a jet.

The first two-muon mass requirement is applied to suppress candidates originating from one prompt muon object and two muons from a converted photon. The second requirement on the two-muon masses is applied to prevent the low-mass mesons, \(\rho /\omega \) and \(\phi \), from entering into the region close to the \(\tau \) lepton mass when combined with an additional track. In the selected three-muon event sample, these resonances appear as two clear peaks in the mass distribution of oppositely charged muon pairs in data. Since the resonances lie in the middle of the signal distribution, the low-\(p_\mathrm{T}\) and \(E_\mathrm{T}^\mathrm{miss}\) requirement ensures that these can still be distinguished from the signal, and thus it removes the resonances while still maintaining a high enough signal efficiency. Finally, the last requirement is applied to remove a potential \(D_s \rightarrow \pi +\phi (\mu \mu )\) contamination from the high-mass sideband. The cuts listed above comprise the tight selection where the tight \({+}x{>}x_0\) selection is used to estimate the background for any cut on x above \(x_0\).

Figure 1 shows the three-muon mass distribution and the BDT response distribution. Figures 2 and 3 show the distributions of the BDT inputs sorted by the separation rank as reported by TMVA during the BDT training. Figure 4 shows the distributions of the complementary variables which are used in the loose or tight selection but not in the BDT.

Fig. 1
figure 1

The three-muon mass distribution in a and the BDT score in b. The BDT score distribution of the data is shown for the sideband region. The loose data are shown as hollow circles, while the loose signal MC events are shown as light solid grey area. The tight \({+}x{>}x_0\) data are shown as the solid black circles, while the tight \({+}x{>}x_0\) signal MC events are shown as the dark solid grey area. The area of the signal MC shapes is normalised to the area of the loose data shapes and the relative normalisation difference between the loose and the tight \({+}x{>}x_0\) MC signal distributions prior to the normalisation is maintained. For illustration, the signal is not constrained to the SR

Fig. 2
figure 2

The BDT inputs ranked 1–8. \(m_\mathrm{T}^\mathrm{cal}\) in a, \(E_\mathrm{T,trk}^\mathrm{miss} \) in b, \(\Sigma p_\mathrm{T}^\mathrm{trk}(\Delta R^{3\mu }_\mathrm{max} +0.20)/p_\mathrm{T}^{3\mu } \) in c, \(\Sigma _\mathrm{T}\) in d, \(m_\mathrm{T}^\mathrm{trk}\) in e, \(\Delta \phi _\mathrm{trk}^\mathrm{cal}\) in f, \(E_\mathrm{T,cal}^\mathrm{miss} \) in g and \(p_\mathrm{T}^{3\mu }/E_\mathrm{T,trk}^\mathrm{miss}-1\) in h. The loose data in the sidebands are shown as hollow circles, while the loose signal MC events are shown as light solid grey area. The tight \({+}x{>}x_0\) data in the sidebands are shown as the solid black circles, while the tight \({+}x{>}x_0\) signal MC events are shown as the dark solid grey area. The area of the signal MC shapes is normalised to the area of the loose data shapes and the relative normalisation difference between the loose and the tight \({+}x{>}x_0\) MC signal distributions prior to the normalisation is maintained. For illustration, the signal is not constrained to the SR

Fig. 3
figure 3

The BDT inputs ranked 9–16. \(\Delta \phi _{3\mu }^\mathrm{cal}\) in a, \(p\)-value in b, \(S(a^{0}_{xy})\) in c, \(\mathcal {P}_\mathrm{trks}\) in d, \(p_\mathrm{T}^{3\mu }\) in e, \(N_\mathrm{trk}^\mathrm{PV}\) in f, \(S(L_{xy})\) in g and \(p_\mathrm{T}^{3\mu }/E_\mathrm{T,cal}^\mathrm{miss}-1\) in h. The loose data in the sidebands are shown as hollow circles, while the loose signal MC events are shown as light solid grey area. The tight \({+}x{>}x_0\) data in the sidebands are shown as the solid black circles, while the tight \({+}x{>}x_0\) signal MC events are shown as the dark solid grey area. The area of the signal MC shapes is normalised to the area of the loose data shapes and the relative normalisation difference between the loose and the tight \({+}x{>}x_0\) MC signal distributions prior to the normalisation is maintained. For illustration, the signal is not constrained to the SR

Fig. 4
figure 4

The complementary variables used in the loose or tight selection but not as inputs for the BDT. \(\Delta \phi _{\Sigma _\mathrm{T}}^\mathrm{cal}\) in a, \(\Delta \phi _{\Sigma _\mathrm{T}}^\mathrm{trk}\) in b, \(\Sigma p_\mathrm{T}^\mathrm{trk}(\Delta R^{3\mu }_\mathrm{max} +0.30)/p_\mathrm{T}^{3\mu } \) in c, \(m_\mathrm{SS}\) in d, \(m_\mathrm{OS1}\) in e and \(m_\mathrm{OS2}\) in f. The loose data in the sidebands are shown as hollow circles, while the loose signal MC events are shown as light solid grey area. The tight \({+}x{>}x_0\) data in the sidebands are shown as the solid black circles, while the tight \({+}x{>}x_0\) signal MC events are shown as the dark solid grey area. The area of the signal MC shapes is normalised to the area of the loose data shapes and the relative normalisation difference between the loose and the tight \({+}x{>}x_0\) MC signal distributions prior to the normalisation is maintained. For illustration, the signal is not constrained to the SR

5.5 Background estimation

The events passing the tight \({+}x{>}x_0\) selection are used to estimate the expected number of background events in the signal region for higher cuts on x as described below.

The signal MC and sideband data BDT responses are shown in Fig. 5 after the tight \({+}x{>}x_0\) selection. The distinct shapes illustrate the power of the method in separating the background from the signal. The analytical function also shown in Fig. 5 is a result of a fit to the sideband data, excluding the blinded region, using an unbinned maximum-likelihood estimator. The fit function used is \(a_0 + a_1(x+1)^{a_2}+a_3(x+1)^{a_4}\), where \(a_i\) are the free fit parameters. The parameter \(a_2\) is required to be negative while the other are required to be non-negative. This function can exhibit rising behaviour at both ends of the x distribution (\(x\rightarrow \pm 1\)) and it is used to scale the quantities measured in \(x{>}x_0\) to the corresponding quantities in \(x{>}x_1\) as explained below.

Fig. 5
figure 5

The distribution of the BDT score of the data in the sideband region (SB) for the tight \({+}x{>}x_0\) selection. The line shows the result of a fit to the BDT score distribution, while the hatched area shows the uncertainty in the fit due to the SB range definition, the \(x_0\) cut location and the fit function choice. The solid grey area shows the signal shape (obtained from MC simulation), normalised to the area of the data for the tight \({+}x{>}x_0\) selection. For illustration, the signal is not constrained to the SR

The three-muon mass distribution of the tight \({+}x{>}x_0\) data is fit simultaneously in the two sidebands to a second-order polynomial in \(m_{3\mu }\) while excluding the blinded region. This is also done with an unbinned maximum-likelihood estimator. The integral of the resulting fit function in the signal region gives the expected number of background events, \(N_\mathrm{b}(x_0)\) in the signal region before applying the final \(x_1\) cut. The statistical uncertainty of \(N_\mathrm{b}(x_0)\) is calculated by scaling the statistical error in the number of events in the sidebands, according to the ratio of analytical integrals in the signal region and sidebands. Figure 6 shows the three-muon mass distribution in the sidebands for the tight \({+}x{>}x_0\) selection as black points together with the fit result. The signal is also shown for reference, scaled up arbitrarily to match the scale of the data.

Fig. 6
figure 6

The three-muon mass distribution in the range [1450, 2110] MeV shown for the tight \({+}x{>}x_0\) selection by solid black circles and for the tight \({+}x{>}x_1\) selection by the solid red square. The sideband and signal regions are indicated by the arrows. The tight \({+}x{>}x_0\) data are fit in the two sidebands simultaneously, excluding the events in the blinded region. The hatched area shows the uncertainty in the fit due to the SB range definition, the \(x_0\) cut location and the fit function choice. The solid grey area shows the signal shape (obtained from MC simulation), normalised to the area of the data for the tight \({+}x{>}x_0\) selection

For any \(x_1\) cut value above \(x\sim 0.6\), where most of the signal is expected, the estimated \(N_\mathrm{b}(x_0)\) in the signal region can be then scaled down according to the ratio of the integrals of the BDT analytical function above and below this cut. This ratio is denoted hereafter by . The extrapolation procedure can be written as where \(N_\mathrm{b}(x_1)\) is estimated in the signal region for \(x{>}x_1\).

5.6 Uncertainties and optimisation

The sources of systematic uncertainty associated with the extrapolation procedure in the background estimation are the BDT and sideband fit function choice, the definition of the sideband ranges and the definition of \(x_0\). To estimate this uncertainty, each of these definitions and choices is varied individually while calculating \(N_\mathrm{b}(x_0)\) and . For each fit function (BDT and sideband), different parameterisations are considered. In addition, to construct the variation of the tight \({+}x{>}x_0\) sample with which the two fits are performed, nine different sideband range variations and ten different \(x_0\) variations are used. The fits, and consequently also the extrapolation procedure, are found to be stable against these variations. The dominant uncertainty component is the impact on of varying the sideband ranges definition. The differences from the nominal values of and \(N_\mathrm{b}(x_0)\) are summed in quadrature and are translated to uncertainties in \(N_\mathrm{b}(x_1)\). The systematic uncertainty associated with the extrapolation procedure used to obtain \(N_\mathrm{b}(x_1)\) increases with \(x_1\) from \({\sim } 45~\%\) at \(x_1=0.6\) to \({\sim } 80~\%\) at \(x_1\simeq 1\). The statistical uncertainty of \(N_\mathrm{b}(x_1)\) is \({\sim } 19~\%\), independent of \(x_1\).

The systematic uncertainty in the signal acceptance times efficiency has contributions from reconstruction (13.1 %), trigger (11 %) and MC modelling (4 %) as discussed in the previous sections. In addition, there is a small (2.1 %) contribution due to jet and \(E_\mathrm{T}^\mathrm{miss}\) calibration. The number of \(\tau \) leptons produced via the \(W\rightarrow \tau \nu \) channel and its uncertainty (3.9 %) are estimated as described in Sect. 3. These uncertainties are independent of \(x_1\) in the range of interest.

The BDT cut is optimised by minimising the expected upper limit on the branching fraction given in Eq. (1), where \(N_\mathrm{s}\) becomes the upper limit on the number of observed events above the expected background level in a narrow region around the \(\tau \) lepton mass. The procedure is performed by varying \(x_1\) between 0.6 and 1.0 in steps of 0.001 while extracting \(N_\mathrm{b}(x_1)\) and its associated errors as explained above. To obtain the upper limit on \(N_\mathrm{s}\) for each \(x_1\) cut, a single-bin counting experiment is performed using the HistFitter [27] statistical framework, supplied with \(N_\mathrm{b}(x_1)\) and its uncertainties. For compatibility with previous searches, the limit on \(N_\mathrm{s}\) and on \(\mathrm{Br}(\tau \rightarrow 3\mu )\) is given at 90 % confidence level (CL). In each iteration, is calculated for the specific \(x_1\) cut using a signal sample that is different from the one used for the BDT training.

During the iterative optimisation process, the extrapolation of the number of events in the sideband region to high \(x_1\) cuts using the BDT shape is tested against a cut-and-count procedure. The two procedures are found to agree very well within the uncertainties, and the extrapolation procedure gives a more conservative result throughout the examined \(x_1\) range. The resulting optimal cut is at \(x_1=0.933\).

6 Results

Figure 6 shows the three-muon mass distributions in the full mass range, including the blinded region, for the tight \({+}x{>}x_1\) selection in red squares. Only one event with a three-muon mass of 1860 MeV survives the selection in the full mass range (sideband and blinded regions). This event is found in the range between the signal region and the right sideband region and it does not affect the background estimation or the observation in the signal region.

The event counts entering the different regions at the different steps of the analysis for signal and data are given in Table 2.

Table 2 [Event counts]The event count for the different steps of the analysis in the sideband and signal regions. The signal sample used to evaluate the has \(2\times 10^5\) events

The signal acceptance times efficiency is calculated from the second signal MC sample after applying the full tight \({+}x{>}x_1\) selection. This selection corresponds to . With this selection, the expected background yield is \(N_\mathrm{b}(x_1) =0.193 \pm 0.131_\mathrm{syst} \pm 0.037_\mathrm{stat}\). The systematic uncertainty on is dominated by the uncertainties in the reconstruction and trigger efficiency measurements. The systematic uncertainty on \(N_\mathrm{b}(x_1)\) is dominated by the uncertainty in the extrapolation of the background from the tight \({+}x{>}x_0\) selection to the tight \({+}x{>}x_1\) selection.

The systematic uncertainties in \(N_\mathrm{b}\) are taken into account when calculating the limit on the number of signal events, \(N_\mathrm{s}\), via one nuisance parameter. The systematic uncertainties in the product are summed in quadrature and taken into account as the uncertainty in the signal via one nuisance parameter when calculating the limit. The expected (median) limit on the branching fraction for \(N_\mathrm{o} =N_\mathrm{b}(x_1) \) is \(3.94\times 10^{-7}\) at 90 % CL. No events are observed in the signal region and the observed limit on the branching fraction is therefore \(3.76\times 10^{-7}\) at 90 % CL.

7 Conclusions and outlook

This article presents a search with the ATLAS detector for neutrinoless \(\tau \rightarrow 3\mu \) decays using 20.3 fb\(^{-1}\) of 2012 LHC pp collision data, utilising \(\tau \) leptons produced in \(W\rightarrow \tau \nu \) decays. No events are observed in the signal region for the final selection while \(0.193 \pm 0.131_\mathrm{syst} \pm 0.037_\mathrm{stat}\) background events are expected. This results in an observed (expected) upper limit of \(3.76\times 10^{-7}\) (\(3.94\times 10^{-7}\)) on \(\mathrm{Br}(\tau \rightarrow 3\mu )\) at 90 % CL. Although this limit is not yet competitive with searches performed at B-factories [7, 8] and at LHCb [9], it demonstrates the potential of LHC data collected by ATLAS as a probe of lepton flavour violation in \(\tau \) lepton decays. This analysis utilises single \(\tau \) lepton production in an environment very different from B-factories, which rely on \(\tau \) lepton pair production in \(e^+e^-\) collisions. The method and sample presented here were used to improve the ATLAS muon trigger and reconstruction of low-\(p_\mathrm{T}\), collimated muons relevant to the \(\tau \rightarrow 3\mu \) search. The analysis is limited by the number of \(W\rightarrow \tau \nu \) decays and by the systematic uncertainty, which depends on the size of the data sample. With the much larger data sets anticipated at Run 2 of the LHC, the sensitivity of ATLAS to lepton-flavour-violating decays will be improved significantly.