FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing

Weinhold, Carsten; Lackorzynski, Adam; Bierbaum, Jan; Küttler, Martin; Planeta, Maksym; Härtig, Hermann; Shiloh, Amnon; Levy, Ely; Ben-Nun, Tal; Barak, Amnon; Steinke, Thomas; Schütt, Thorsten; Fajerski, Jan; Reinefeld, Alexander; Lieber, Matthias; Nagel, Wolfgang E.

doi:10.1007/978-3-319-40528-5_18

FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing

Carsten Weinhold¹⁰,
Adam Lackorzynski¹⁰,
Jan Bierbaum¹⁰,
Martin Küttler¹⁰,
Maksym Planeta¹⁰,
Hermann Härtig¹⁰,
Amnon Shiloh¹¹,
Ely Levy¹¹,
Tal Ben-Nun¹¹,
Amnon Barak¹¹,
Thomas Steinke¹²,
Thorsten Schütt¹²,
Jan Fajerski¹²,
Alexander Reinefeld¹²,
Matthias Lieber¹³ &
…
Wolfgang E. Nagel¹³

Conference paper
First Online: 15 September 2016

959 Accesses
5 Citations

Part of the book series: Lecture Notes in Computational Science and Engineering ((LNCSE,volume 113))

Abstract

In this paper we describe the hardware and application-inherent challenges that future exascale systems pose to high-performance computing (HPC) and propose a system architecture that addresses them. This architecture is based on proven building blocks and few principles: (1) a fast light-weight kernel that is supported by a virtualized Linux for tasks that are not performance critical, (2) decentralized load and health management using fault-tolerant gossip-based information dissemination, (3) a maximally-parallel checkpoint store for cheap checkpoint/restart in the presence of frequent component failures, and (4) a runtime that enables applications to interact with the underlying system platform through new interfaces. The paper discusses the vision behind FFMK and the current state of a prototype implementation of the system, which is based on a microkernel and an adapted MPI runtime.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
COSMO-SPECS+FD4 has an internal load balancer, which we disabled in the experiments described here.

References

Acun, B., Gupta, A., Jain, N., Langer, A., Menon, H., Mikida, E., Ni, X., Robson, M., Sun, Y., Totoni, E., Wesolowski, L., Kale, L.: Parallel programming with migratable objects: Charm++ in practice. In: Proceedings of the Supercomputing 2014, Leipzig, pp. 647–658. IEEE (2014)
Google Scholar
Arnold, D.C., Miller, B.P.: Scalable failure recovery for high-performance data aggregation. In: Proceedings of the IPDPS 2010, Atlanta, pp. 1–11. IEEE (2010)
Google Scholar
Barak, A., Guday, S., Wheeler, R.: The MOSIX Distributed Operating System: Load Balancing for UNIX. Lecture Notes in Computer Science, vol. 672. Springer, Berlin/New York (1993)
Google Scholar
Barak, A., Margolin, A., Shiloh, A.: Automatic resource-centric process migration for MPI. In: Proceedings of the EuroMPI 2012. Lecture Notes in Computer Science, vol. 7490, pp. 163–172. Springer, Berlin/New York (2012)
Google Scholar
Barak, A., Drezner, Z., Levy, E., Lieber, M., Shiloh, A.: Resilient gossip algorithms for collecting online management information in exascale clusters. Concurr. Comput. Pract. Exper. 27 (17), 4797–4818 (2015)
Article Google Scholar
Beckman, P., et al.: Argo: an exascale operating system. http://www.argo-osr.org/. Accessed 20 Nov 2015
Ben-Nun, T., Levy, E., Barak, A., Rubin, E.: Memory access patterns: the missing piece of the multi-GPU puzzle. In: Proceedings of the Supercomputing 2015, Newport Beach, pp. 19:1–19:12. ACM (2015)
Google Scholar
Berkeley Lab Checkpoint/Restart. http://ftg.lbl.gov/checkpoint. Accessed 20 Nov 2015
Brightwell, R., Oldfield, R., Maccabe, A.B., Bernholdt, D.E.: Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R. In: Proceedings of the ROSS’13, pp. 2:1–2:8. ACM (2013)
Google Scholar
Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Automated application-level checkpointing of MPI programs. ACM Sigplan Not. 38 (10), 84–94 (2003)
Article MATH Google Scholar
Burstedde, C., Ghattas, O., Gurnis, M., Isaac, T., Stadler, G., Warburton, T., Wilcox, L.: Extreme-scale AMR. In: Proceedings of the Supercomputing 2010, Tsukuba, pp. 1–12. ACM (2010)
Google Scholar
Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1 (1), 5–28 (2014)
Google Scholar
Corradi, A., Leonardi, L., Zambonelli, F.: Diffusive load-balancing policies for dynamic applications. IEEE Concurr. 7 (1), 22–31 (1999)
Article Google Scholar
Dongarra, J., et al.: The international exascale software project roadmap. Int. J. High Speed Comput. 25 (1), 3–60 (2011)
Google Scholar
EXAHD – An Exa-Scalable Two-Level Sparse Grid Approach for Higher-Dimensional Problems in Plasma Physics and Beyond. http://ipvs.informatik.uni-stuttgart.de/SGS/EXAHD/index.php. Accessed 29 Nov 2015
FFMK Website. http://ffmk.tudos.org. Accessed 20 Nov 2015
Harlacher, D.F., Klimach, H., Roller, S., Siebert, C., Wolf, F.: Dynamic load balancing for unstructured meshes on space-filling curves. In: Proceedings of the IPDPSW 2012, pp. 1661–1669. IEEE (2012)
Google Scholar
Kale, L.V., Zheng, G.: Charm++ and AMPI: adaptive runtime strategies via migratable objects. In: Parashar, M., Li, X. (eds.) Advanced Computational Infrastructures for Parallel and Distributed Adaptive Applications, chap. 13, pp. 265–282. Wiley, Hoboken (2009)
Chapter Google Scholar
Kogge, P., Shalf, J.: Exascale computing trends: adjusting to the “New Normal” for computer architecture. Comput. Sci. Eng. 15 (6), 16–26 (2013)
Article Google Scholar
Lackorzynski, A., Warg, A., Peter, M.: Generic virtualization with virtual processors. In: Proceedings of the 12th Real-Time Linux Workshop, Nairobi (2010)
Google Scholar
Lange, J., Pedretti, K., Hudson, T., Dinda, P., Cui, Z., Xia, L., Bridges, P., Gocke, A., Jaconette, S., Levenhagen, M., Brightwell, R.: Palacios and Kitten: new high performance operating systems for scalable virtualized and native supercomputing. In: Proceedings of the IPDPS 2010, Atlanta, pp. 1–12. IEEE (2010)
Google Scholar
Levy, E., Barak, A., Shiloh, A., Lieber, M., Weinhold, C., Härtig, H.: Overhead of a decentralized gossip algorithm on the performance of HPC applications. In: Proceedings of the ROSS’14, Munich, pp. 10:1–10:7. ACM (2014)
Google Scholar
Lieber, M., Grützun, V., Wolke, R., Müller, M.S., Nagel, W.E.: Highly scalable dynamic load balancing in the atmospheric modeling system COSMO-SPECS+FD4. In: Proceedings of the PARA 2010. Lecture Notes in Computer Science, vol. 7133, pp. 131–141. Springer, Berlin/New York (2012)
Google Scholar
Liedtke, J.: On micro-kernel construction. In: Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95), Copper Mountain Resort, pp. 237–250. ACM (1995)
Google Scholar
Lucas, R., et al.: Top ten exascale research challenges. DOE ASCAC subcommittee report. http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20140210/Top10reportFEB14.pdf (2014). Accessed 20 Nov 2015
Milthorpe, J., Ganesh, V., Rendell, A.P., Grove, D.: X10 as a parallel language for scientific computation: practice and experience. In: Proceedings of the IPDPS 2011, Anchorage, pp. 1080–1088. IEEE (2011)
Google Scholar
Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.: Detailed modeling, design, and evaluation of a scalable multi-level checkpointing system. Technical report LLNL-TR-440491, Lawrence Livermore National Laboratory (LLNL) (2010)
Google Scholar
MPI: A message-passing interface standard, version 3.1. http://www.mpi-forum.org/docs (2015). Accessed 20 Nov 2015
Mvapich: Mpi over infiniband. http://mvapich.cse.ohio-state.edu/. Accessed 20 Nov 2015
Open Source Molecular Dynamics. http://www.cp2k.org/. Accessed 20 Nov 2015
Ouyang, X., Marcarelli, S., Rajachandrasekar, R., Panda, D.K.: RDMA-based job migration framework for MPI over Infiniband. In: Proceedings of the IEEE CLUSTER 2010, Heraklion, pp. 116–125. IEEE (2010)
Google Scholar
Rajachandrasekar, R., Moody, A., Mohror, K., Panda, D.K.: A 1 PB/s file system to checkpoint three million MPI tasks. In: Proceedings of the HPDC’13, New York, pp. 143–154. ACM (2013)
Google Scholar
Roitzsch, M., Wachtler, S., Härtig, H.: Atlas: look-ahead scheduling using workload metrics. In: Proceedings of the RTAS 2013, Philadelphia, pp. 1–10. IEEE (2013)
Google Scholar
Sato, K., Maruyama, N., Mohror, K., Moody, A., Gamblin, T., de Supinski, B.R., Matsuoka, S.: Design and modeling of a non-blocking checkpointing system. In: Proceedings of the Supercomputing 2012, Venice, pp. 19:1–19:10. IEEE (2012)
Google Scholar
Sato, M., Fukazawa, G., Yoshinaga, K., Tsujita, Y., Hori, A., Namiki, M.: A hybrid operating system for a computing node with multi-core and many-core processors. Int. J. Adv. Comput. Sci. 3, 368–377 (2013)
Google Scholar
Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Proactive process-level live migration and back migration in HPC environments. J. Par. Distrib. Comput. 72 (2), 254–267 (2012)
Article Google Scholar
Wende, F., Steinke, T., Reinefeld, A.: The impact of process placement and oversubscription on application performance: a case study for exascale computing. Technical report 15–05, ZIB (2015)
Google Scholar
Winkel, M., Speck, R., Hübner, H., Arnold, L., Krause, R., Gibbon, P.: A massively parallel, multi-disciplinary Barnes-Hut tree code for extreme-scale N-body simulations. Comput. Phys. Commun. 183 (4), 880–889 (2012)
Article MathSciNet Google Scholar
Wisniewski, R.W., Inglett, T., Keppel, P., Murty, R., Riesen, R.: mOS: an architecture for extreme-scale operating systems. In: Proceedings of the ROSS’14, Munich, pp. 2:1–2:8. ACM (2014)
Google Scholar
XtreemFS – a cloud file system. http://www.xtreemfs.org. Accessed 20 Nov 2015
Xue, M., Droegemeier, K.K., Weber, D.: Numerical prediction of high-impact local weather: a driver for petascale computing. In: Bader, D.A. (ed.) Petascale Computing: Algorithms and Applications, pp. 103–124. Chapman & Hall/CRC, Boca Raton (2008)
Google Scholar
Zheng, F., Yu, H., Hantas, C., Wolf, M., Eisenhauer, G., Schwan, K., Abbasi, H., Klasky, S.: Goldrush: resource efficient in situ scientific data analytics using fine-grained interference aware execution. In: Proceedings of the Supercomputing 2013, Eugene, pp. 78:1–78:12. ACM (2013)
Google Scholar

Download references

Acknowledgements

This research and the work presented in this paper is supported by the German priority program 1648 “Software for Exascale Computing” via the research project FFMK [16]. We also thank the cluster of excellence “Center for Advancing Electronics Dresden” (cfaed). The authors acknowledge the Jülich Supercomputing Centre, the Gauss Centre for Supercomputing, and the John von Neumann Institute for Computing for providing compute time on the JUQUEEN supercomputer.

Author information

Authors and Affiliations

Department of Computer Science, TU Dresden, Dresden, Germany
Carsten Weinhold, Adam Lackorzynski, Jan Bierbaum, Martin Küttler, Maksym Planeta & Hermann Härtig
Department of Computer Science, The Hebrew University of Jerusalem, Jerusalem, Israel
Amnon Shiloh, Ely Levy, Tal Ben-Nun & Amnon Barak
Zuse Institute Berlin, Berlin, Germany
Thomas Steinke, Thorsten Schütt, Jan Fajerski & Alexander Reinefeld
Center for Information Services and HPC, TU Dresden, Dresden, Germany
Matthias Lieber & Wolfgang E. Nagel

Authors

Carsten Weinhold
View author publications
You can also search for this author in PubMed Google Scholar
Adam Lackorzynski
View author publications
You can also search for this author in PubMed Google Scholar
Jan Bierbaum
View author publications
You can also search for this author in PubMed Google Scholar
Martin Küttler
View author publications
You can also search for this author in PubMed Google Scholar
Maksym Planeta
View author publications
You can also search for this author in PubMed Google Scholar
Hermann Härtig
View author publications
You can also search for this author in PubMed Google Scholar
Amnon Shiloh
View author publications
You can also search for this author in PubMed Google Scholar
Ely Levy
View author publications
You can also search for this author in PubMed Google Scholar
Tal Ben-Nun
View author publications
You can also search for this author in PubMed Google Scholar
Amnon Barak
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Steinke
View author publications
You can also search for this author in PubMed Google Scholar
Thorsten Schütt
View author publications
You can also search for this author in PubMed Google Scholar
Jan Fajerski
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Reinefeld
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Lieber
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang E. Nagel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carsten Weinhold .

Editor information

Editors and Affiliations

Technische Universität München Institut für Informatik, Garching, Bayern, Germany
Hans-Joachim Bungartz
Technische Universität München Institut für Informatik, Garching, Germany
Philipp Neumann
Technische Universität Dresden, Dresden, Germany
Wolfgang E. Nagel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Weinhold, C. et al. (2016). FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing. In: Bungartz, HJ., Neumann, P., Nagel, W. (eds) Software for Exascale Computing - SPPEXA 2013-2015. Lecture Notes in Computational Science and Engineering, vol 113. Springer, Cham. https://doi.org/10.1007/978-3-319-40528-5_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-40528-5_18
Published: 15 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40526-1
Online ISBN: 978-3-319-40528-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics