Extreme-scale earthquake simulations on Sunway TaihuLight
- 95 Downloads
Earthquakes, as one of the most disruptive natural hazards, have been a major research target for generations of scientists. Numerical simulation of earthquakes, as one of the few methods to verify and improve scientists’ understanding about the earthquake process, and a key tool in various earthquake engineering applications, has long been both an important and challenging application on supercomputers. In this paper, we discuss the major challenges for developing an accurate earthquake simulation tool on supercomputers. Based on the discussion, we then demonstrate our efforts on performing extreme-scale earthquake simulations on Sunway TaihuLight, a 125-Pflops machine with over 10 million heterogeneous cores. With systematic approaches to resolve the memory bandwidth constraint, we manage to achieve 8% to 16% efficiency for utilizing the entire machine to simulate Tangshan and Wenchuan Earthquakes with an unprecedented spatial resolution.
KeywordsSunway TaihuLight Computational seismology Earthquake ground motions Parallel scalability Neural nets
Earthquake, for both the tremendous damage that it can cause and the complexities involved in understanding its formation and triggering process, is one of the ultimate scientific puzzles that scientists have been working on for generations.
In Chinese history, the study of earthquakes dated back to a famous scholar, Zhang Heng, (AD78 to AD179) Stein and Wysession (2009), who designed a delicate seismoscope that consisted of eight dragons holding balls in their mouths, and eight toads sitting beneath. With a ball falling from the dragon mouth to the toad, the seismoscope was able to indicate an earthquake that happened thousands of miles away. While designed thousands of years ago, Zhang’s seismoscope already exhibited technical features similar to modern seismic measurement instruments originated in the 1880s (Milne 1886).
Ever since then, science and technology have evolved along the way to provide significantly more accurate measurement devices to record the seismic events across the globe. However, on the other hand, direct detection of the subsurface is still only at the range of tens of thousands of meters, leaving the rest over 99% of the earth still unseen.
as a forward modeling engine, earthquake simulation provides an important tool to produce potential scenarios of specific earthquakes happening at specific locations,
as a test engine, the forward simulation verifies, and potentially improves scientists’ hypothesis about earthquake happening processes, as well as the underlying structures of the earth,
when coupled with engineering tools, the simulation engine can serve as an important guidance on earthquake engineering-related design and policy making processes.
The resulting software managed to simulate both the Tangshan and the Wenchuan Earthquakes with an unprecedented level of details, demonstrating great potential to enable more exciting seismology research in the next few years.
2 Simulating earthquakes: challenges
Memory requirements A destructive earthquake normally covers a region in the scale of hundreds of kilometers. Depending on the size and shape of the fault, the complete domain of the simulation scenario can vary significantly along different dimensions. A general case usually covers a plane area of a few hundred kilometers by a few hundred kilometers, and 50 to 100 km along the vertical axis. Take the Tangshan Earthquake for example, we are considering a 3D domain in the size of 320 km by 312 km by 40 km. To fulfill basic earthquake engineering analysis purposes, we need a spatial resolution of 20 m to support a frequency range up to 10 Hz. In such a scenario, the simulation involves 562.5 billion grids, 2.25 trillion unknowns, and roughly 150 TB of memory space. For the 1-PB memory system of Sunway TaihuLight, further improvement of the spatial resolution would only be possible with integration of compression schemes (Fu et al. 2017).
Compute requirements To capture the complex nonlinear behavior of ground motions in earthquakes, the current methods also involve a high complexity. For each spatial grid, we normally have 30 to 40 variables and around 500 floating-point arithmetic operations. For a similar scenario (300 km by 300 km by 50 km in problem size, and 20-m spatial resolution) as mentioned above, a complete modeling process (100,000 time steps to finish a simulation of 100 s) means a total operation count of 100 exa floating-point operations. Even on the most powerful systems in the world, such a computation requires weeks to months to accomplish (Fig. 1).
I/O requirements For simulations that happen at the scale of the full machine, normal I/O operations can also become tough challenges. A complete earthquake simulation process involves hundreds of thousands of timesteps, which translate to weeks or months of run time even on some of the largest supercomputers in the world. On the other hand, when running at full scale or half scale of the large supercomputers, the MTBF (mean time between failures, usually hardware failures) is usually a day or several days (in our case, when running earthquake simulation on Sunway TaihuLight, we observe a MTBF of 18 to 19 h for full-scale runs, and a MTBF of 4 to 5 days for half-scale runs). Checkpoints that enable restart of the simulation are therefore needed. When running at the full scale of Sunway TaihuLight, the entire memory space is around 1 petabytes. Storing a minimum set that can restart the simulation is around 100 terabytes. The size and scale present both capacity and throughput challenges for I/O experts.
Multi-disciplinary brainpower requirements While the above challenges bring exciting research questions for computer science researchers, the biggest challenge in the process of achieving scientifically-better earthquake simulation lies in the complexity of the problem and the necessity to involve people from many different domains. While the study of earthquakes and the related interior structure of the earth is a grand scientific problem, developing a computational platform that can enable simulation of the process and analysis of the data, on the other hand, is a grand engineering challenge (a lot of aspects touched in the previous items). Therefore, any progress made along the road really requires efforts from different domains and different disciplines. The SCEC (Southern California Earthquake Center) program shows a good example for forming such a multi-disciplinary team and performing long-term research and development efforts. In our case of modeling earthquakes on Sunway TaihuLight, the supercomputing platform becomes the glue that attracts people from different organizations to contribute to a same scientific goal. However, in a general case, we are still in great demand of brains that can understand both science and engineering, both earthquakes and supercomputers. Such a shortage of inter-disciplinary brainpower also exist for many other domains, such as climate, life science, astrophysics, etc.
3 Sunway TaihuLight: the hardware context
While the architectural discussion is already done many times in previous literature (Fu et al. 2016a, b, 2017; Yang et al. 2016), we still present the key information here, to provide a hardware context for the following technical sections.
While the MPE in each CG takes a similar memory hierarchy design with both L1 instruction and data caches, each of the 64 CPEs uses a 64-KB local data memory (LDM) as the scratch-pad buffer instead of data cache. Replacing the L1 data cache with a 64-KB LDM brings a completely different memory hierarchy for programmers, and requires, in many cases, a completely rethink of the memory scheme to enable any meaningful utilization of the system.
4 Framework design
While the earthquake simulation framework project on Sunway TaihuLight only started from early 2017, we are fortunate to start our development based on the established codes and collaborations with domain scientists. The first part of the framework is the source generator that produces the dynamic rupture source file for the following simulation of the earthquake, detailed in Sect. 4.1. The second part, which is also the more computationally-intensive part, is the forward modeling engine that simulates the propagation of seismic waves across large geographic regions, detailed in Sect. 4.2. Our current development is mainly based on two existing codes. One is the AWP-ODC (Cui et al. 2010) developed by SCEC, which is one of the most widely used simulation engine on some world-leading supercomputer facilities. The other one is the curved-grid finite-difference method (CG-FDM) (Zhang et al. 2014) developed by researchers from Southern University of Science and Technology (SUSTech).
Based on the existing scientific codes, we build a unified software framework that integrates different functions, as shown in Fig. 1, including different components that range from dynamic rupture source generation, mesh generation, to the most time-consuming wave propagation part.
4.1 Source generator
In this work, we generate the source by dynamic rupture modeling on non-planar Tangshan fault with a curved-grid finite-difference method (CG-FDM) (Zhang et al. 2014). While keeping the computational efficiency and easy implementation of conventional FDM, the CG-FDM also is flexible in modelling the complicated fault model by using general curvilinear grids. Thus, this method can model the rupture dynamics of a fault with complex geometry, such as non-planar fault, fault with step-over, even if irregular topography exists. This method has been proved to be an efficient tool in the dynamic rupture modeling by benchmarking problem test (Harris et al. 2018) and has been used in scenario earthquake simulations (Zhang et al. 2017).
4.2 Forward modeling engine
In our current framework, we include two sets of forward modeling engines to support the part of wave propagation simulation.
One is the AWP-ODC code, developed by the research groups from SCEC Cui et al. (2010). The other one is the CG-FDM code developed by the SUSTech research team Zhang et al. (2014), which is also used for the source generation part.
The CG-FDM, which is recently designed to solve the seismic wave propagation in media with complex geometry, discretizes the computational volume by collocated grid in curvilinear coordinates. Because of flexibility of discretization, the CG-FDM can model the response from real features of an earthquake, such as the topography, complex fault system, and so on.
In contrast, the AWP-ODC solves the wave equations with a staggered grid system in the Cartesian coordinate. The AWP-ODC has been a popular tool with high efficiency and well optimization after developing for a long time.
Our efforts reported in this paper focus on the redesign and tuning of AWP-ODC for Sunway TaihuLight. AWP-ODC, which stands for Anelastic Wave Propagation by Olsen, Day, and Cui, develops over the year from the original finite difference code developed by Kim Olsen at University of Utah (Olsen 1994). The code was the major tool for the Community Modeling Environment (CME) of SCEC, and has scaled to various parallel computing platforms, such as TeraGrid (Cui 2007), Jaguar (Cui et al. 2010), and Titan (Cui et al. 2013; Roten et al. 2016).
The current version of AWP-ODC has already built up the plasticity simulation capabilities for nonlinear effects, which can largely improve the simulation accuracy but at the cost of more variable arrays and more computation. The next sections demonstrate the major approaches that we take to achieve AWP-Sunway, an inherited version of AWP-ODC that is completely redesigned and tuned for the Sunway TaihuLight system.
5 Parallelization and optimization
5.1 A customized parallelization design
The first challenge of achieving a full-scale application on Sunway TaihuLight, is, as mentioned above, to identify a suitable mapping scheme that translate the physics into the compute instructions, data movements, and message passings among the 10 million cores in the system.
A large part of the mapping scheme is about decomposition, i.e. how we decompose the large problem into parts that sit in different nodes, and further down, in different cores. One specific complexity in the case of earthquake simulation is that the computational kernels normally involve reading and writing of over 20 variable arrays that covers the entire mesh grid. In such cases, many previous optimization techniques, such as the 3.5D blocking scheme (Nguyen et al. 2010), becomes impractical due to the extremely high memory volume requirement. As a result, our solution is a customized domain decomposition scheme that exposes enough parallelism for the 10 million cores and minimizes related memory costs at the same time.
2D decomposition for MPI processes: For the storage of all the 3D arrays, we take the z axis (the vertical direction) as the fastest axis, the y axis as the second, and the x axis as the slowest axis. In typical earthquake simulation scenarios, we generally have significantly larger dimensions for x and y (hundreds of kilometers) than the dimension for z (tens of kilometers). Therefore, to minimize communication among the different processes, at the first level, instead of taking a 3D approach, we decompose the horizontal plane into \(M_x\) by \(M_y\) different partitions, each corresponding to a specific MPI process. With the well-designed MPI scheme to hide halo communcation in computation inherited from AWP-ODC (Cui et al. 2010), in extreme cases, we can have up to 160,000 (400 by 400) MPI processes to scale to the full machine.
Blocking for each CG: At the second level, instead of assigning all the mesh points to different cores within a CG, we add a blocking mechanism along y and z axes to assign a suitable size of block to the CG, so as to achieve a more efficient utilization of the 64-KB LDM of each CPE. Each CG would iterate over these different blocks to finish the processing.
2D decomposition for Athreads: We further partition the block into different regions for each CPE, but along the y and z dimensions (with each thread iterating along the direction of x), so as to achieve fast memory accesses for the different threads.
LDM buffering scheme: For each CPE, we load a suitable size of the computing domain (both the central and the halo parts) into the LDM using DMA operations, so as to perform the computation afterwards. The DMA operations are designed to be asynchronous, so as to overlap with the computation part.
5.2 Memory-oriented optimization
To break the memory constraints, a key part of the solution is to efficiently utilize the memory hierarchy of the SW26010 processor.
The challenges are clear, a relatively low memory bandwidth, and the absence of automated hardware cache in the 64 CPEs. While the absence of automated cache brings extra efforts for programmers to make efficient utilization of the bandwidth, the introduction of user-controlled 64-KB LDM also brings the option to explore a customized memory scheme for the given algorithm or application.
Another hardware feature that we can take advantage of is the two instruction issuing ports of each CPE. One port is specifically for compute instructions, while the other one is for DMA instructions. Therefore, a large part of design is to achieve an asynchronous design that can overlap the compute and the DMA instructions to the maximum possible extent.
One unique feature of SW26010 is the register communication among the 64 CPEs in each CG, which provides a perfect solution for data reuse in stencil-like computations. Using register communication based halo exchange, inside each CG, the CPE thread only needs to load its corresponding central region, and can acquire the halo regions from the neighboring threads through register communication operations. Only the boundary CPE threads that need to communicate across different CGs still need to initialize DMA loads for the corresponding halo regions.
As a result, tuning the dimension parameters for the blocking scheme (such as the parameters in Fig. 3) become another key step for achieving good utilization of bandwidth. We propose an optimized blocking configuration guided by an analytic model to ensure: (1) minimize the number of DMA loads required for redundant halo region reads; (2) maximize the effective memory bandwidth by using a large chunk size.
Even after adopting the optimal parallelization scheme mentioned above, in most cases, due to the large number of arrays that we need to access, the 64 KB LDM would limit us to a too small portion of the array, and a low efficiency of the DMA reads.
After the fusion of the arrays, with only 3 separate arrays to read, we can afford a DMA block size of 432 bytes, improving the memory bandwidth utilization to around 80%. In the extreme case of the dstrqc kernel, the array fusion technique could increase the DMA block size from 84 bytes to 512 bytes, improving the effective memory bandwidth from 50.47 to 104.82GB/s.
5.3 On-the-fly compression
After a systematic optimization from both the compute and memory aspects, we achieve a design that is close to the point of squeezing out the system’s hardware capabilities. As the Sunway TaihuLight system, similar to many other supercomputers in the world, demonstrates an unbalanced ratio between compute and memory capacities, our method for further improvement is to shift the balancing point slightly by trading off a portion of the compute cycles to enable a compressed form of the data items when storing and moving them in memory.
Considering the features of different variables, we propose three different lossy compression schemes, with different levels of complexities and information loss, but a fixed compression ratio from 32-bit to 16-bit numbers.
While the scheme can effectively reduce both the memory capacity and bandwidth requirement by two times, the challenge shifts to the efficiency of the scheme, so as to improve the simulation speed even though significantly more complexities are introduced to accommodate the compression operations.
With a carefully designed blocking scheme and re-scheduling of the compute and DMA instructions, we finally manage to improve the computing performance by another 24% (processing the same scenario 24% faster after adding the compression scheme), and to enable scenarios that require two times more memory space (more details in Fu et al. (2017)).
The largest-scale results on Sunway TaihuLight compared with previous works
Cui et al. (2010)
Cui et al. (2013)
Roten et al. (2016)
40,000 SW26010 CPUs
Using the optimized simulation software, we are able to perform a series of simulations for the 1976 Tangshan earthquake with a problem domain of 320 km by 312 km by 40 km, with the spatial resolution increasing from 500 to 8 m, supporting a frequency range up to 18 Hz. To our best knowledge, this is the first time of performing a nonlinear plasticity earthquake simulation at such a scale, and with such a high frequency and high resolution in the world. The plasticity ground motion simulation for the Tangshan earthquake allows us for the first time to quantitatively estimate the hazards of the Tangshan earthquake in the affected area, and to provide guidance for designing proper seismic engineering standards for buildings in North China.
7 Conclusion and outlook
In this paper, based on our recent experience on Sunway TaihuLight, we summarize the major challenges for performing extreme-scale earthquake simulations on leadership supercomputers of nowadays. One key message is that memory bandwidth and capacity are the major constraints that stop scientists from performing larger or higher-resolution simulation in a faster manner. As a result, memory-related design strategies and tuning techniques become a major of our work. Using a customized parallelization scheme and a set of memory-oriented optimization methods, even on TaihuLight’s relatively modest memory system (a byte-to-flop ratio that is only 1/5 of Titan), we can achieve 15.2-Pflops nonlinear earthquake simulation by using the 10,400,000 cores of Sunway TaihuLight, up to 12.2% of the peak. Our compression scheme expands our computational performance to the level of 18.9 Pflops (15% of the peak), and enables us to support 18-Hz, 8-meter simulations, which is a big jump for the previous state of the art.
While these are exciting progresses by taking advantage of cutting-edge supercomputer systems, we are still far from the ideal state that scientists would demand for a more complete simulation system. On the science side, the current large-scale simulations are usually only focused on one part of the earth, such as the scenario simulation of a specific earthquake in a specific region introduced in this paper. There are still many other efforts that focus on different scales (city-oriented scenario simulation) and different parts of the earth (geo-dynamic simulation that focus on the mantle and the core). A more complete simulation platform would need to couple these different processes at different spatial and temporal scales for a more accurate picture. On the engineering side, we would also need a coupled system with not only the ground motion simulation, but also the behaviors of buildings, hills, and other elements that could be affected. For both frontiers, there will be interesting directions that demand for decades of efforts to decipher these grand scientific challenges or to achieve major engineering breakthroughs. Along the way, both hardware and software supercomputing technologies would always be an important foundation.
This work was supported in part by the National Key R&D Program of China (Grant No. 2017YFA0604500), by the National Natural Science Foundation of China (Grant No. 51761135015), and by Center for High Performance Computing and System Simulation, Pilot National Laboratory for Marine Science and Technology (Qingdao).
Compliance with ethical standards
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
- Bao, H., Bielak, J., Ghattas, O., Kallivokas, L.F., O’hallaron, D.R., Shewchuk, J.R., Xu, J.: Earthquake ground motion modeling on parallel computers. In: Proceedings of the 1996 ACM/IEEE conference on Supercomputing, IEEE Computer Society, p. 13. (1996)Google Scholar
- Cui, Y., Moore, R., Olsen, K., Chourasia, A., Maechling, P., Minster, B., Day, S., Hu, Y., Zhu, J., Majumdar, A., et al.: Enabling very-large scale earthquake simulations on parallel machines. In: International Conference on Computational Science, pp. 46–53. Springer (2007)Google Scholar
- Cui, Y., Olsen, K.B., Jordan, T.H., Lee, K., Zhou, J., Small, P., Roten, D., Ely, G., Panda, D.K., Chourasia, A., et al.: Scalable earthquake simulation on petascale supercomputers. In: High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for IEEE, pp. 1–20. (2010)Google Scholar
- Cui, Y., Poyraz, E., Olsen, K.B., Zhou, J., Withers, K., Callaghan, S., Larkin, J., Guest, C., Choi, D., Chourasia, A., et al.: Physics-based seismic hazard analysis on petascale heterogeneous supercomputers. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ACM, p. 70. (2013)Google Scholar
- Fu, H., He, C., Chen, B., Yin, Z., Zhang, Z., Zhang, W., Zhang, T., Xue, W., Liu, W., Yin, W., et al.: 18.9-pflops nonlinear earthquake simulation on Sunway Taihulight: Enabling depiction of 18-Hz and 8-m scenarios. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, p. 2. (2017)Google Scholar
- Fu, H., Liao, J., Xue, W., Wang, L., Chen, D., Gu, L., Xu, J., Ding, N., Wang, X., He, C., et al.: Refactoring and optimizing the community atmosphere model (CAM) on the sunway taihulight supercomputer. In: High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for IEEE, pp. 969–980.Google Scholar
- Komatitsch, D., Tsuboi, S., Ji, C., Tromp, J.: A 14.6 billion degrees of freedom, 5 teraflops, 2.5 terabyte earthquake simulation on the earth simulator. In: Supercomputing, 2003 ACM/IEEE Conference, IEEE, pp. 4–4. (2003)Google Scholar
- Milne, J.: Earthquakes and other earth movements, vol. 56. D. Appleton and company, New York, USA (1886)Google Scholar
- Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In: High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for IEEE, pp. 1–13. (2010)Google Scholar
- Olsen, K.B.: Simulation of three-dimensional wave propagation in the salt lake basin. Ph.D. thesis, Department of Geology and Geophysics, University of Utah (1994)Google Scholar
- Roten, D., Cui, Y., Olsen, K.B., Day, S.M., Withers, K., Savran, W.H., Wang, P., Mu, D.: High-frequency nonlinear earthquake simulations on petascale heterogeneous supercomputers. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 82. IEEE Press (2016)Google Scholar
- Stein, S., Wysession, M.: An Introduction to Seismology, Earthquakes, and Earth Structure. Wiley, New York (2009)Google Scholar
- Yang, C., Xue, W., Fu, H., You, H., Wang, X., Ao, Y., Liu, F., Gan, L., Xu, P., Wang, L., et al.: 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE Press, p. 6. (2016)Google Scholar