Help Your Busy Neighbors: Dynamic Multicasts over Static Topologies

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10659)

Abstract

Acknowledged multicasts, e.g. for software-based TLB invalidation, are a performance critical aspect of runtime environments for many-core processors. Their latency and peak throughput highly depend on the topology used to propagate the events and to collect the acknowledgements. Based on the assumption of an inevitable interrupt latency, previous work focused on very simple flat topologies. However, the emergence of simultaneous multi-threading with locally shared caches enables interrupt-free multicasts. Therefore, this paper explores and re-evaluates the design space for dynamic multicast groups based on combining shared memory with active messages and helping mechanisms. We expect this new approach to considerably improve the scalability of acknowledged multicasts on many-core processors.

Keywords

Multicast Shared memory Many-core TLB shootdown 

Notes

Acknowledgments

This work was supported by the German Research Foundation (DFG) under grant no. NO 625/7-2. We thank our students Martin Messer and Stefan Hertrampf for supporting the implementation and evaluation.

References

  1. 1.
    Baldi, M., Ofek, Y.: Ring versus tree embedding for real-time group multicast. In: Proceedings of Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies, INFOCOM 1999, vol. 3, pp. 1099–1106. IEEE (1999).  https://doi.org/10.1109/INFCOM.1999.751665
  2. 2.
    Bar-Noy, A., Kipnis, S.: Designing broadcasting algorithms in the postal model for message-passing systems. Math. Syst. Theory 27(5), 431–452 (1994).  https://doi.org/10.1007/BF01184933 MathSciNetCrossRefMATHGoogle Scholar
  3. 3.
    Baumann, A., Barham, P., Dagand, P.E., Harris, T., Isaacs, R., Peter, S., Roscoe, T., Schüpbach, A., Singhania, A.: The multikernel: a new OS architecture for scalable multicore systems. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP 2009, pp. 29–44. ACM (2009)Google Scholar
  4. 4.
    Black, D.L., Rashid, R.F., Golub, D.B., Hill, C.R.: Translation lookaside buffer consistency: a software approach. In: Proceedings of ASPLOS-III, vol. 17, no. 2, pp. 113–122 (1989).  https://doi.org/10.1145/68182.68193
  5. 5.
    Boyd-Wickizer, S., Chen, H., Chen, R., Mao, Y., Kaashoek, F., Morris, R., Pesterev, A., Stein, L., Wu, M., Dai, Y., Zhang, Y., Zhang, Z.: Corey: an operating system for many cores. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI 2008, pp. 43–57. USENIX Association, Berkeley (2008). https://www.usenix.org/conference/osdi-08/corey-operating-system-many-cores
  6. 6.
    Bruck, J., Coster, L.D., Dewulf, N., Ho, C.T., Lauwereins, R.: On the design and implementation of broadcast and global combine operations using the postal model. IEEE Trans. Parallel Distrib. Syst. 7(3), 256–265 (1996).  https://doi.org/10.1109/71.491579 CrossRefGoogle Scholar
  7. 7.
    Fang, J., Varbanescu, A.L., Sips, H.J., Zhang, L., Che, Y., Xu, C.: An empirical study of Intel Xeon Phi abs/1310.5842 (2013). http://arxiv.org/abs/1310.5842
  8. 8.
    Gerofi, B., Shimada, A., Hori, A., Ishikawa, Y.: Partially separated page tables for efficient operating system assisted hierarchical memory management on heterogeneous architectures. In: 2013 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 360–368 (2013).  https://doi.org/10.1109/CCGrid.2013.59
  9. 9.
    Gras, B., Razavi, K., Bosman, E., Bos, H., Giuffrida, C.: ASLR on the line: Practical cache attacks on the MMU. In: NDSS (2017). https://www.vusec.net/download/?t=papers/anc_ndss17.pdf
  10. 10.
    Hedetniemi, S.M., Hedetniemi, S.T., Liestman, A.L.: A survey of gossiping and broadcasting in communication networks. Networks 18(4), 319–349 (1988).  https://doi.org/10.1002/net.3230180406 MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    Kaestle, S., Achermann, R., Haecki, R., Hoffmann, M., Ramos, S., Roscoe, T.: Machine-aware atomic broadcast trees for multicores. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI 2016, pp. 33–48 (2016). https://www.usenix.org/conference/osdi16/technical-sessions/presentation/kaestle
  12. 12.
    Karp, R.M., Sahay, A., Santos, E.E., Schauser, K.E.: Optimal broadcast and summation in the LogP model. In: Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA 1993, pp. 142–153. ACM, New York (1993).  https://doi.org/10.1145/165231.165250
  13. 13.
    Nürnberger, S., Rotta, R., Drescher, G., Danner, D., Nolte, J.: Diamond rings: acknowledged event propagation in many-core processors. In: Hunold, S., et al. (eds.) Euro-Par 2015. LNCS, vol. 9523, pp. 722–733. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-27308-2_58 CrossRefGoogle Scholar
  14. 14.
    Oyama, Y., Taura, K., Yonezawa, A.: Executing parallel programs with synchronization bottlenecks efficiently. In: Proceedings of International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications (PDSIA 1999), pp. 182–204 (1999)Google Scholar
  15. 15.
    Sanders, P., Sibeyn, J.F.: A bandwidth latency tradeoff for broadcast and reduction. Inf. Process. Lett. 86(1), 33–38 (2003).  https://doi.org/10.1016/S0020-0190(02)00473-8 MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Sanders, P., Speck, J., Träff, J.L.: Two-tree algorithms for full bandwidth broadcast, reduction and scan. Parallel Comput. 35(12), 581–594 (2009).  https://doi.org/10.1016/j.parco.2009.09.001. Selected papers from the 14th European PVM/MPI Users Group MeetingMathSciNetCrossRefGoogle Scholar
  17. 17.
    Santos, E.E.: Optimal and near-optimal algorithms fork-item broadcast. J. Parallel Distrib. Comput. 57(2), 121–139 (1999).  https://doi.org/10.1006/jpdc.1999.1529 CrossRefMATHGoogle Scholar
  18. 18.
    Stets, R., Dwarkadas, S., Hardavellas, N., Hunt, G., Kontothanassis, L., Parthasarathy, S., Scott, M.: CASHMERE-2L: software coherent shared memory on a clustered remote-write network. ACM SIGOPS Oper. Syst. Rev. 31(5), 170–183 (1997).  https://doi.org/10.1145/269005.266675 CrossRefGoogle Scholar
  19. 19.
    Teller, P.J.: Translation-lookaside buffer consistency. Computer 23(6), 26–36 (1990).  https://doi.org/10.1109/2.55498 CrossRefGoogle Scholar
  20. 20.
    Träff, J.L., Ripke, A.: Optimal broadcast for fully connected processor-node networks. J. Parallel Distrib. Comput. 68(7), 887–901 (2008).  https://doi.org/10.1016/j.jpdc.2007.12.001 CrossRefMATHGoogle Scholar
  21. 21.
    Tullsen, D.M., Eggers, S.J., Levy, H.M.: Simultaneous multithreading: maximizing on-chip parallelism. In: 25 Years of the International Symposia on Computer Architecture (Selected Papers), ISCA 1998, pp. 533–544. ACM, New York (1998).  https://doi.org/10.1145/285930.286011
  22. 22.
    Villavieja, C., Karakostas, V., Vilanova, L., Etsion, Y., Ramirez, A., Mendelson, A., Navarro, N., Cristal, A., Unsal, O.S.: DiDi: mitigating the performance impact of TLB shootdowns using a shared TLB directory. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, pp. 340–349 (2011).  https://doi.org/10.1109/PACT.2011.65
  23. 23.
    Yew, P.C., Tzeng, N.F., Lawrie, D.H.: Distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans. Comput. C–36(4), 388–395 (1987).  https://doi.org/10.1109/TC.1987.1676921 Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Brandenburg University of Technology Cottbus-SenftenbergCottbusGermany

Personalised recommendations