A High-Throughput Kalman Filter for Modern SIMD Architectures
The Kalman filter is a critical component of the reconstruction process of subatomic particle collision in high-energy physics detectors. At the LHCb detector in the Large Hadron Collider this reconstruction must be performed at an average rate of 30 million times per second. As a consequence of the ever-increasing collision rate and upcoming detector upgrades, the data rate that needs to be processed in real time is expected to increase by a factor of 40 in the next five years. In order to keep pace, processing and filtering software must take advantage of latest developments in hardware technology.
In this paper we present a cross-architecture SIMD parallel algorithm and implementation of a low-rank Kalman filter. We integrate our implementation in production code and validate the numerical results in the context of physics reconstruction. We also compare its throughput across modern multi- and many-core architectures.
Using our Kalman filter implementation we are able to achieve a sustained throughput of 75 million particle hit reconstructions per second on an Intel Xeon Phi Knights Landing platform, a factor 6.81 over the current production implementation running on a two-socket Haswell system. Additionally we show that under the constraints of our Kalman filter formulation we efficiently use the available hardware resources.
Our implementation will allow us to better sustain the required throughput of the detector in the coming years and scale to future hardware architectures. Additionally our work enables the evaluation of other computing platforms for future hardware upgrades.
KeywordsKalman filter Data-intensive parallel algorithms Numerical methods
The authors would like to thank the High-Throughput Computing Collaboration at CERN openlab for fruitful discussions through the process of designing and writing the presented software, and early access to Intel hardware. Thanks to F. Lemaitre for his contribution of the vectorized transposition code, and to O. Bouizi and S. Harald for the low-level code discussions and for providing early results and insight on the Xeon Phi architecture. In addition, thanks to W. Hulsbergen and R. Aaij for the mathematical discussions and data structure design. Finally, thanks to N. Neufeld and A. Riscos Núñez for their guidance and support.
- 1.The LHCb Collaboration: framework TDR for the LHCb upgrade: technical design report. Technical report CERN-LHCC-2012-007. LHCb-TDR-12, April 2012. https://cds.cern.ch/record/1443882
- 2.The LHCb Collaboration: LHCb trigger and online upgrade technical design report. Technical report CERN-LHCC-2014-016. LHCB-TDR-016, May 2014. https://cds.cern.ch/record/1701361
- 4.Mcgee, L.A., Schmidt, S.F.: Discovery of the Kalman filter as a practical tool for aerospace and industry. Technical report, November 1985. https://ntrs.nasa.gov/search.jsp?R=19860003843
- 5.Houtekamer, P.L., Mitchell, H.L.: Data assimilation using an ensemble Kalman filter technique. Mon. Weather Rev. 126(3), 796–811 (1998). http://journals.ametsoc.org/doi/abs/10.1175/1520-0493%281998%29126%3C0796%3ADAUAEK%3E2.0.CO%3B2 CrossRefGoogle Scholar
- 6.Welch, G., Bishop, G.: An introduction to the Kalman filter. Technical report, Chapel Hill, NC, USA (1995)Google Scholar
- 7.Hulsbergen, W.: The global covariance matrix of tracks fitted with a Kalman filter and an application in detector alignment. Nucl. Instrum. Methods Phys. Res. Sec. A: Accel. Spectrom. Detect. Assoc. Equip. 600(2), 471–477 (2009). http://www.sciencedirect.com/science/article/pii/S0168900208017567 CrossRefGoogle Scholar
- 8.Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the 18–20 April 1967, Spring Joint Computer Conference, pp. 483–485. AFIPS 1967 (Spring). ACM, New York (1967). http://doi.acm.org/10.1145/1465482.1465560
- 9.Cámpora Pérez, D.H.: LHCb Kalman filter cross-architecture studies (2016)Google Scholar
- 11.Aaij, R., Fontana, M., Le Gac, R., Zacharjasz, E.A., Schwemmer, R., Fitzpatrick, C., Albrecht, J., Grillo, L., Szumlak, T., Yin, H., Couturier, B., Stahl, S., Williams, M.R.J., Vries, D., Andreas, J., Seyfert, P., Wanczyk, J., Esen, S., Neufeld, N., Hasse, C., Vesterinen, M.A., Nikodem, T., Quagliani, R., Polci, F., Dziurda, A., Jones, C.R., Matev, R., De Cian, M., Del Buono, L.: Upgrade trigger: biannual performance update. Technical report, February 2017. https://cds.cern.ch/record/2244312
- 12.Mertens, S.: The easiest hard problem: number partitioning, October 2003. arXiv:cond-mat/0310317
- 13.Gou, C., Kuzmanov, G., Gaydadjiev, G.N.: SAMS multi-layout memory: providing multiple views of data to boost SIMD performance. In: Proceedings of the 24th ACM International Conference on Supercomputing, pp. 179–188. ICS 2010. ACM, New York (2010). http://doi.acm.org/10.1145/1810085.1810111
- 14.Fog, A.: VCL C++ vector class library (2012). http://www.agner.org/optimize
- 15.Karpiński, P., McDonald, J.: A high-performance portable abstract interface for explicit SIMD vectorization. In: Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2017, pp. 21–28. ACM, New York (2017). http://doi.acm.org/10.1145/3026937.3026939
- 17.Schiller, M.: Track reconstruction and prompt \(K^0_S\) production at the LHCb experiment. Dissertation, University of Heidelberg (2011)Google Scholar