Advertisement

The Processing-in-Memory Paradigm: Mechanisms to Enable Adoption

  • Saugata Ghose
  • Kevin Hsieh
  • Amirali Boroumand
  • Rachata Ausavarungnirun
  • Onur Mutlu
Chapter

Abstract

Performance improvements from DRAM technology scaling have been lagging behind the improvements from logic technology scaling for many years. As application demand for main memory continues to grow, DRAM-based main memory is increasingly becoming a larger system bottleneck in terms of both performance and energy consumption. A major reason for poor memory performance and energy efficiency is memory’s inability to perform computation. Instead, data stored within DRAM memory must be moved into the CPU before any computation can take place. This data movement is costly, as it requires a high latency and consumes significant energy to transfer the data across the pin-limited memory channel. Moreover, the data moved to the CPU is often not reused, and thus does not benefit from being cached within the CPU, which makes it difficult to amortize the overhead of data movement.

Modern 3D-stacked DRAM architectures provide an opportunity to avoid unnecessary data movement between memory and the CPU. These multi-layer architectures include a logic layer, where compute logic can be integrated underneath multiple layers of DRAM cell arrays (i.e., the memory layers) within the same chip. Architects can take advantage of the logic layer to perform processing-in-memory (PIM), or near-data processing, where some of the computation is moved from the CPU to the logic layer underneath the memory layer. In a PIM architecture, the logic layer within DRAM has access to the high internal bandwidth available within 3D-stacked DRAM (which is much greater than the bandwidth available in the narrow memory channel between DRAM and the CPU). Thus, PIM architectures can effectively free up valuable bandwidth on the bandwidth-limited memory channel while at the same time reducing system energy consumption.

A number of important issues arise when we add compute logic to DRAM. In particular, logic within DRAM does not have low-latency access to common CPU structures that are essential for modern application execution, such as the virtual memory mechanisms, e.g., the translation lookaside buffer (TLB) or the page table walker, and the cache coherence mechanisms, e.g., the coherence directory. To ease the widespread adoption of PIM, we ideally would like to maintain traditional virtual memory abstractions and the shared memory programming model. This requires efficient mechanisms that can provide logic in DRAM with access to virtual memory and cache coherence without having to communicate frequently with the CPU, as off-chip communication between the CPU and DRAM consumes much of the limited bandwidth that PIM aims to avoid using. To this end, we propose and evaluate two general-purpose solutions that can be used by PIM architectures to minimize unnecessary off-chip communication. The first, IMPICA, is an efficient in-memory accelerator for pointer chasing, which can handle address translation entirely within DRAM. The second, LazyPIM, provides coherence support without the need to continually communicate with the CPU. We show that both of these mechanisms provide a significant benefit for a number of important memory-intensive applications, thereby both improving performance and reducing energy consumption.

Notes

Acknowledgements

We thank all of the members of the SAFARI Research Group, and our collaborators at Carnegie Mellon, ETH Zürich, and other universities, who have contributed to the various works we describe in this chapter. Thanks also goes to our research group’s industrial sponsors over the past 9 years, especially Google, Huawei, Intel, Microsoft, NVIDIA, Samsung, Seagate, and VMware. This work was also partially supported by the Intel Science and Technology Center for Cloud Computing, the Semiconductor Research Corporation, the Data Storage Systems Center at Carnegie Mellon University, and NSF grants 1212962, 1320531, and 1409723.

References

  1. 1.
    S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, R. Das, Compute caches, in HPCA (2017)Google Scholar
  2. 2.
    J. Ahn, S. Hong, S. Yoo, O. Mutlu, K. Choi, A scalable processing-in-memory accelerator for parallel graph processing, in ISCA (2015)Google Scholar
  3. 3.
    J. Ahn, S. Yoo, O. Mutlu, K. Choi, PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture, in ISCA (2015)Google Scholar
  4. 4.
    B. Akin, F. Franchetti, J.C. Hoe, Data reorganization in memory using 3D-stacked DRAM, in ISCA (2015)Google Scholar
  5. 5.
    C. Alkan et al., Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 41, 1061 (2009)Google Scholar
  6. 6.
    M. Alser, H. Hassan, H. Xin, O. Ergin, O. Mutlu, C. Alkan, GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping. Bioinformatics 33, 3355–3363 (2017)Google Scholar
  7. 7.
  8. 8.
  9. 9.
    H. Asghari-Moghaddam, Y.H. Son, J.H. Ahn, N.S. Kim, Chameleon: versatile and practical near-DRAM acceleration architecture for large memory systems, in MICRO (2016)Google Scholar
  10. 10.
    R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C.J. Rossbach, O. Mutlu, Mosaic: a GPU memory manager with application-transparent support for multiple page sizes, in MICRO (2017)Google Scholar
  11. 11.
    R. Ausavarungnirun, V. Miller, J. Landgraf, S. Ghose, J. Gandhi, A. Jog, C.J. Rossbach, O. Mutlu, MASK: redesigning the GPU memory hierarchy to support multi-application concurrency, in ASPLOS (2018)Google Scholar
  12. 12.
    O.O. Babarinsa, S. Idreos, JAFAR: near-data processing for databases, in SIGMOD (2015)Google Scholar
  13. 13.
    A. Basu, J. Gandhi, J. Chang, M.D. Hill, M.M. Swift, Efficient virtual memory for big memory servers, in ISCA (2013)Google Scholar
  14. 14.
    A. Bensoussan, C.T. Clingen, R.C. Daley, The Multics virtual memory: concepts and design, in CACM (1972)Google Scholar
  15. 15.
    A. Bhattacharjee, Large-reach memory management unit caches, in MICRO (2013)Google Scholar
  16. 16.
    A. Bhattacharjee, M. Martonosi, Inter-core cooperative TLB for chip multiprocessors, in ASPLOS (2010)Google Scholar
  17. 17.
    A. Bhattacharjee, D. Lustig, M. Martonosi, Shared last-level TLBs for chip multiprocessors, in HPCA (2011)Google Scholar
  18. 18.
    N. Binkert, B. Beckman, A. Saidi, G. Black, A. Basu, The gem5 Simulator, in CAN (2011)Google Scholar
  19. 19.
    B.H. Bloom, Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422–426 (1970)Google Scholar
  20. 20.
    A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, K. Hsieh, K.T. Malladi, H. Zheng, O. Mutlu, LazyPIM: an efficient cache coherence mechanism for processing-in-memory, in CAL (2016)Google Scholar
  21. 21.
    A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, N. Hajinazar, K. Hsieh, K.T. Malladi, H. Zheng, O. Mutlu, LazyPIM: efficient support for cache coherence in processing-in-memory architectures (2017). arXiv:1706.03162 [cs:AR]Google Scholar
  22. 22.
    A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, O. Mutlu, Google workloads for consumer devices: mitigating data movement bottlenecks, in ASPLOS (2018)Google Scholar
  23. 23.
    L.M. Censier, P. Feutrier, A new solution to coherence problems in multicache systems, in IEEE TC (1978)Google Scholar
  24. 24.
    L. Ceze, J. Tuck, P. Montesinos, J. Torrellas, BulkSC: bulk enforcement of sequential consistency, in ISCA (2007)Google Scholar
  25. 25.
    K.K. Chang, D. Lee, Z. Chishti, A.R. Alameldeen, C. Wilkerson, Y. Kim, O. Mutlu, Improving DRAM performance by parallelizing refreshes with accesses, in HPCA (2014)Google Scholar
  26. 26.
    K.K. Chang, A. Kashyap, H. Hassan, S. Ghose, K. Hsieh, D. Lee, T. Li, G. Pekhimenko, S. Khan, O. Mutlu, Understanding latency variation in modern DRAM chips: experimental characterization, analysis, and optimization, in SIGMETRICS (2016)Google Scholar
  27. 27.
    K.K. Chang, P.J. Nair, D. Lee, S. Ghose, M.K. Qureshi, O. Mutlu, Low-cost inter-linked subarrays (LISA): enabling fast inter-subarray data movement in DRAM, in HPCA (2016)Google Scholar
  28. 28.
    K.K. Chang, Understanding and improving the latency of DRAM-based memory systems. Ph.D. dissertation, Carnegie Mellon University, 2017Google Scholar
  29. 29.
    K.K. Chang, A.G. Yağlıkçı, S. Ghose, A. Agrawal, N. Chatterjee, A. Kashyap, D. Lee, M. O’Connor, H. Hassan, O. Mutlu, Understanding reduced-voltage operation in modern DRAM devices: experimental characterization, analysis, and mechanisms, in SIGMETRICS (2017)Google Scholar
  30. 30.
    P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie, PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory, in ISCA (2016)Google Scholar
  31. 31.
    L. Chua, Memristor—the missing circuit element, in IEEE TCT (1971)Google Scholar
  32. 32.
    E.S. Chung, J.D. Davis, J. Lee, LINQits: big data on little clients, in ISCA (2013)Google Scholar
  33. 33.
    J.D. Collins, H. Wang, D.M. Tullsen, C.J. Hughes, Y. Lee, D.M. Lavery, J.P. Shen, Speculative precomputation: long-range prefetching of delinquent loads, in ISCA (2001)Google Scholar
  34. 34.
    J.D. Collins, S. Sair, B. Calder, D.M. Tullsen, Pointer cache assisted prefetching, in MICRO (2002)Google Scholar
  35. 35.
    R. Cooksey, S. Jourdan, D. Grunwald, A stateless, content-directed data prefetching mechanism, in ASPLOS (2002)Google Scholar
  36. 36.
    N.C. Crago, S.J. Patel, OUTRIDER: efficient memory latency tolerance with decoupled strands, in ISCA (2011)Google Scholar
  37. 37.
    J. Dean, L.A. Barroso, The tail at scale, in CACM (2013)Google Scholar
  38. 38.
    J. Devietti, B. Lucia, L. Ceze, M. Oskin, DMP: deterministic shared memory multiprocessing, in ASPLOS (2009)Google Scholar
  39. 39.
    J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. LaCoss, J. Granacki, J. Shin, C. Chen, C.W. Kang, I. Kim, G. Daglikoca, The architecture of the DIVA processing-in-memory chip, in SC (2002)Google Scholar
  40. 40.
    E. Ebrahimi, O. Mutlu, Y. Patt, Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems, in HPCA (2009)Google Scholar
  41. 41.
    E. Ebrahimi, O. Mutlu, C.J. Lee, Y.N. Patt, Coordinated control of multiple prefetchers in multi-core systems, in MICRO (2009)Google Scholar
  42. 42.
    E. Ebrahimi, C.J. Lee, O. Mutlu, Y.N. Patt, Prefetch-aware shared resource management for multi-core systems, in ISCA (2011)Google Scholar
  43. 43.
    Y. Eckert, N. Jayasena, G.H. Loh, Thermal feasibility of die-stacked processing in memory, in WoNDP (2014)Google Scholar
  44. 44.
    D.G. Elliott, W.M. Snelgrove, M. Stumm, Computational RAM: a memory-SIMD hybrid and its application to DSP, in CICC (1992)Google Scholar
  45. 45.
    D. Elliott, M. Stumm, W.M. Snelgrove, C. Cojocaru, R. McKenzie, Computational RAM: implementing processors in memory, in IEEE Design & Test (1999)Google Scholar
  46. 46.
    R. Elmasri, Fundamentals of Database Systems (Pearson, Boston, 2007)Google Scholar
  47. 47.
    A. Farmahini-Farahani, J.H. Ahn, K. Morrow, N.S. Kim, NDA: near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules, in HPCA (2015)Google Scholar
  48. 48.
    M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A.D. Popescu, A. Ailamaki, B. Falsafi, Clearing the clouds: a study of emerging scale-out workloads on modern hardware, in ASPLOS (2012)Google Scholar
  49. 49.
    M. Filippo, Technology preview: ARM next generation processing, in ARM TechCon (2012)Google Scholar
  50. 50.
    B. Fitzpatrick, Distributed caching with memcached. Linux J. 2004, 5 (2004)Google Scholar
  51. 51.
    M. Gao, C. Kozyrakis, HRL: efficient and flexible reconfigurable logic for near-data processing, in HPCA (2016)Google Scholar
  52. 52.
    M. Gao, G. Ayers, C. Kozyrakis, Practical near-data processing for in-memory analytics frameworks, in PACT (2015)Google Scholar
  53. 53.
    S. Ghemawat, H. Gobioff, S.-T. Leung, The Google file system, in SOSP (2003)Google Scholar
  54. 54.
    D. Giampaolo, Practical File System Design with the BE File System (Morgan Kaufmann Publishers Inc., San Francisco, 1998)Google Scholar
  55. 55.
    A. Glew, MLP yes! ILP no!, in ASPLOS WACI (1998)Google Scholar
  56. 56.
    M. Gokhale, B. Holmes, K. Iobst, Processing in memory: the Terasys massively parallel PIM array. IEEE Comput. 28, 23–31 (1995)Google Scholar
  57. 57.
    J.R. Goodman, Using cache memory to reduce processor-memory traffic, in ISCA (1983)Google Scholar
  58. 58.
    B. Gu, A.S. Yoon, D.-H. Bae, I. Jo, J. Lee, J. Yoon, J.-U. Kang, M. Kwon, C. Yoon, S. Cho, J. Jeong, D. Chang, Biscuit: a framework for near-data processing of big data workloads, in ISCA (2016)Google Scholar
  59. 59.
    Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T.M. Low, L. Pileggi, J.C. Hoe, F. Franchetti, 3D-stacked memory-side acceleration: accelerator and system design, in WoNDP (2014)Google Scholar
  60. 60.
    A. Gutierrez, J. Pusdesris, R.G. Dreslinski, T. Mudge, C. Sudanthi, C.D. Emmons, M. Hayenga, N. Paver, Sources of error in full-system simulation, in ISPASS (2014)Google Scholar
  61. 61.
    L. Hammond, V. Wong, M. Chen, B.D. Carlstrom, J.D. Davis, B. Hertzberg, M.K. Prabhu, H. Wijaya, C. Kozyrakis, K. Olukotun, Transactional memory coherence and consistency, in ISCA (2004)Google Scholar
  62. 62.
    M. Hashemi, O. Mutlu, Y.N. Patt, Continuous runahead: transparent hardware acceleration for memory intensive workloads, in MICRO (2016)Google Scholar
  63. 63.
    M. Hashemi, Khubaib, E. Ebrahimi, O. Mutlu, Y.N. Patt, Accelerating dependent cache misses with an enhanced memory controller, in ISCA (2016)Google Scholar
  64. 64.
    S.M. Hassan, S. Yalamanchili, S. Mukhopadhyay, Near data processing: impact and optimization of 3D memory system architecture on the uncore, in MEMSYS (2015)Google Scholar
  65. 65.
    H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, O. Mutlu, ChargeCache: reducing DRAM latency by exploiting row access locality, in HPCA (2016)Google Scholar
  66. 66.
    H. Hassan, N. Vijaykumar, S. Khan, S. Ghose, K. Chang, G. Pekhimenko, D. Lee, O. Ergin, O. Mutlu, SoftMC: a flexible and practical open-source infrastructure for enabling experimental DRAM studies, in HPCA (2017)Google Scholar
  67. 67.
    K. Hsieh, S. Khan, N. Vijaykumar, K.K. Chang, A. Boroumand, S. Ghose, O. Mutlu, Accelerating pointer chasing in 3D-stacked memory: challenges, mechanisms, evaluation, in ICCD (2016)Google Scholar
  68. 68.
    K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O’Conner, N. Vijaykumar, O. Mutlu, S. Keckler, Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems, in ISCA (2016)Google Scholar
  69. 69.
    Z. Hu, M. Martonosi, S. Kaxiras, TCP: tag correlating prefetchers, in HPCA (2003)Google Scholar
  70. 70.
    C.J. Hughes, S.V. Adve, Memory-side prefetching for linked data structures for processor-in-memory systems, in JPDC (2005)Google Scholar
  71. 71.
    Hybrid Memory Cube Consortium, HMC Specification 1.1 (2013)Google Scholar
  72. 72.
    Hybrid Memory Cube Consortium, HMC Specification 2.0 (2014)Google Scholar
  73. 73.
    Intel, Intel Xeon Processor W3550 (2009)Google Scholar
  74. 74.
    J. Jeddeloh, B. Keeth, Hybrid memory cube: new DRAM architecture increases density and performance, in VLSIT (2012)Google Scholar
  75. 75.
    JEDEC, High bandwidth memory (HBM) DRAM, Standard No. JESD235 (2013)Google Scholar
  76. 76.
    J. Joao, O. Mutlu, Y.N. Patt, Flexible reference-counting-based hardware acceleration for garbage collection, in ISCA (2009)Google Scholar
  77. 77.
    R. Jones, R. Lins, Garbage Collection: Algorithms for Automatic Dynamic Memory Management (Wiley, New York, 1996)Google Scholar
  78. 78.
    D. Joseph, D. Grunwald, Prefetching using Markov predictors, in ISCA (1997)Google Scholar
  79. 79.
    S. Kanev, J.P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, D. Brooks, Profiling a warehouse-scale computer, in ISCA (2015)Google Scholar
  80. 80.
    Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, J. Torrellas, FlexRAM: toward an advanced intelligent memory system, in ICCD (1999)Google Scholar
  81. 81.
    M. Kang, M.-S. Keel, N.R. Shanbhag, S. Eilert, K. Curewitz, An energy-efficient VLSI architecture for pattern recognition via deep embedding of computation in SRAM, in ICASSP (2014)Google Scholar
  82. 82.
    U. Kang, H.-S. Yu, C. Park, H. Zheng, J. Halbert, K. Bains, S. Jang, J. Choi, Co-architecting controllers and DRAM to enhance DRAM process scaling, in The Memory Forum (2014)Google Scholar
  83. 83.
    M. Karlsson, F. Dahlgren, P. Stenström, A prefetching technique for irregular accesses to linked data structures, in HPCA (2000)Google Scholar
  84. 84.
    S. Khan, D. Lee, Y. Kim, A.R. Alameldeen, C. Wilkerson, O. Mutlu, The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study, in SIGMETRICS (2014)Google Scholar
  85. 85.
    S. Khan, D. Lee, O. Mutlu, PARBOR: an efficient system-level technique to detect data dependent failures in DRAM, in DSN (2016)Google Scholar
  86. 86.
    S. Khan, C. Wilkerson, D. Lee, A.R. Alameldeen, O. Mutlu, A case for memory content-based detection and mitigation of data-dependent failures in DRAM, in CAL (2016)Google Scholar
  87. 87.
    S. Khan, C. Wilkerson, Z. Wang, A. Alameldeen, D. Lee, O. Mutlu, Detecting and mitigating data-dependent DRAM failures by exploiting current memory content, in MICRO (2017)Google Scholar
  88. 88.
    T. Kilburn, D.B.G. Edwards, M.J. Lanigan, F.H. Sumner, One-level storage system. IRE Trans. Electron Comput. 2, 223–235 (1962)Google Scholar
  89. 89.
    Y. Kim, Architectural techniques to enhance DRAM scaling. Ph.D. dissertation, Carnegie Mellon University, 2015Google Scholar
  90. 90.
    Y. Kim, V. Seshadri, D. Lee, J. Liu, O. Mutlu, A case for exploiting subarray-level parallelism (SALP) in DRAM, in ISCA (2012)Google Scholar
  91. 91.
    Y. Kim, R. Daly, J. Kim, C. Fallin, J.H. Lee, D. Lee, C. Wilkerson, K. Lai, O. Mutlu, Flipping bits in memory without accessing them: an experimental study of DRAM disturbance errors, in ISCA (2014)Google Scholar
  92. 92.
    Y. Kim, W. Yang, O. Mutlu, Ramulator: a fast and extensible DRAM simulator, in CAL (2015)Google Scholar
  93. 93.
    D. Kim, J. Kung, S. Chai, S. Yalamanchili, S. Mukhopadhyay, Neurocube: a programmable digital neuromorphic architecture with high-density 3D memory, in ISCA (2016)Google Scholar
  94. 94.
    Y. Kim, R. Daly, J. Kim, C. Fallin, J.H. Lee, D. Lee, C. Wilkerson, K. Lai, O. Mutlu, RowHammer: reliability analysis and security implications (2016). arXiv:1603.00747 [cs:AR]Google Scholar
  95. 95.
    G. Kim, N. Chatterjee, M. O’Connor, K. Hsieh, Toward standardized near-data processing with unrestricted data placement for GPUs, in SC (2017)Google Scholar
  96. 96.
    J.S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan, O. Ergin, C. Alkan, O. Mutlu, GRIM-Filter: fast seed filtering in read mapping using emerging memory technologies. arXiv:1708.04329 [q-bio.GN] (2017)Google Scholar
  97. 97.
    J. Kim, M. Patel, H. Hassan, O. Mutlu, The DRAM latency PUF: quickly evaluating physical unclonable functions by exploiting the latency–reliability tradeoff in modern DRAM devices, in HPCA (2018)Google Scholar
  98. 98.
    J.S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan, O. Ergin, C. Alkan, O. Mutlu, GRIM-Filter: fast seed location filtering in DNA read mapping using processing-in-memory technologies, in BMC Genomics (2018)Google Scholar
  99. 99.
    Y.O. Koçberber, B. Grot, J. Picorel, B. Falsafi, K.T. Lim, P. Ranganathan, Meet the walkers: accelerating index traversals for in-memory databases, in MICRO (2013)Google Scholar
  100. 100.
    P.M. Kogge, EXECUBE-a new architecture for scaleable MPPs, in ICPP (1994)Google Scholar
  101. 101.
    E. Kültürsay, M. Kandemir, A. Sivasubramaniam, O. Mutlu, Evaluating STT-RAM as an energy-efficient main memory alternative, in ISPASS (2013)Google Scholar
  102. 102.
    L. Kurian, P.T. Hulina, L.D. Coraor, Memory latency effects in decoupled architectures with a single data memory module, in ISCA (1992)Google Scholar
  103. 103.
    S. Kvatinsky, A. Kolodny, U.C. Weiser, E.G. Friedman, Memristor-based IMPLY logic design procedure, in ICCD (2011)Google Scholar
  104. 104.
    S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E.G. Friedman, A. Kolodny, U.C. Weiser, MAGIC—memristor-aided logic, in IEEE TCAS II: Express Briefs (2014)Google Scholar
  105. 105.
    S. Kvatinsky, G. Satat, N. Wald, E.G. Friedman, A. Kolodny, U.C. Weiser, Memristor-based material implication (IMPLY) logic: design principles and methodologies, in TVLSI (2014)Google Scholar
  106. 106.
    L. Lamport, How to make a multiprocessor computer that correctly executes multiprocess programs, in IEEE TC (1979)Google Scholar
  107. 107.
    D. Lee, Reducing DRAM latency at low cost by exploiting heterogeneity. Ph.D. dissertation, Carnegie Mellon University, 2016Google Scholar
  108. 108.
    J. Lee, Y. Solihin, J. Torrettas, Automatically mapping code on an intelligent memory architecture, in HPCA (2001)Google Scholar
  109. 109.
    C.J. Lee, O. Mutlu, V. Narasiman, Y.N. Patt, Prefetch-aware DRAM controllers, in MICRO (2008)Google Scholar
  110. 110.
    B.C. Lee, E. Ipek, O. Mutlu, D. Burger, Architecting phase change memory as a scalable DRAM alternative, in ISCA (2009)Google Scholar
  111. 111.
    B.C. Lee, E. Ipek, O. Mutlu, D. Burger, Phase change memory architecture and the quest for scalability, in CACM (2010)Google Scholar
  112. 112.
    B.C. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, D. Burger, Phase-change technology and the future of main memory, in IEEE Micro (2010)Google Scholar
  113. 113.
    C.J. Lee, O. Mutlu, V. Narasiman, Y.N. Patt, Prefetch-aware memory controllers, in IEEE TC (2011)Google Scholar
  114. 114.
    D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, O. Mutlu, Tiered-latency DRAM: a low latency and low cost DRAM architecture, in HPCA (2013)Google Scholar
  115. 115.
    D. Lee, F. Hormozdiari, H. Xin, F. Hach, O. Mutlu, C. Alkan, Fast and accurate mapping of complete genomics reads, in Methods (2014)Google Scholar
  116. 116.
    D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, O. Mutlu, Adaptive-latency DRAM: optimizing DRAM timing for the common-case, in HPCA (2015)Google Scholar
  117. 117.
    D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi, O. Mutlu, Decoupled direct memory access: isolating CPU and IO traffic by leveraging a dual-data-port DRAM, in PACT (2015)Google Scholar
  118. 118.
    J.H. Lee, J. Sim, H. Kim, BSSync: processing near memory for machine learning workloads with bounded staleness consistency models, in PACT (2015)Google Scholar
  119. 119.
    D. Lee, S. Ghose, G. Pekhimenko, S. Khan, O. Mutlu, Simultaneous multi-layer access: improving 3D-stacked memory bandwidth at low cost, in TACO (2016)Google Scholar
  120. 120.
    D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarungnirun, G. Pekhimenko, V. Seshadri, O. Mutlu, Design-induced latency variation in modern DRAM chips: characterization, analysis, and latency reduction mechanisms, in SIGMETRICS (2017)Google Scholar
  121. 121.
    Y. Levy, J. Bruck, Y. Cassuto, E.G. Friedman, A. Kolodny, E. Yaakobi, S. Kvatinsky, Logic operations in memory using a memristive Akers array. Microelectron. J. 45, 1429–1437 (2014)Google Scholar
  122. 122.
    S. Li, J.H. Ahn, R.D. Strong, J.B. Brockman, D.M. Tullsen, N.P. Jouppi, The McPAT framework for multicore and manycore architectures: simultaneously modeling power, area, and timing, in TACO (2013)Google Scholar
  123. 123.
    S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, Y. Xie, Pinatubo: a processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories, in DAC (2016)Google Scholar
  124. 124.
    S. Li, D. Niu, K.T. Malladi, H. Zheng, B. Brennan, Y. Xie, DRISA: a DRAM-based reconfigurable in-situ accelerator, in MICRO (2017)Google Scholar
  125. 125.
    K. Lim, J. Chang, T. Mudge, P. Ranganathan, S.K. Reinhardt, T.F. Wenisch, Disaggregated memory for expansion and sharing in blade servers, in ISCA (2009)Google Scholar
  126. 126.
    K.T. Lim, D. Meisner, A.G. Saidi, P. Ranganathan, T.F. Wenisch, Thin servers with smart pipes: designing SoC accelerators for memcached, in ISCA (2013)Google Scholar
  127. 127.
    Linaro, 64-Bit Linux Kernel for ARM (2014)Google Scholar
  128. 128.
    M.H. Lipasti, W.J. Schmidt, S.R. Kunkel, R.R. Roediger, SPAID: software prefetching in pointer- and call-intensive environments, in MICRO (1995)Google Scholar
  129. 129.
    J. Liu, B. Jaiyen, R. Veras, O. Mutlu, RAIDR: retention-aware intelligent DRAM refresh, in ISCA (2012)Google Scholar
  130. 130.
    J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, O. Mutlu, An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms, in ISCA (2013)Google Scholar
  131. 131.
    Z. Liu, I. Calciu, M. Harlihy, O. Mutlu, Concurrent data structures for near-memory computing, in SPAA (2017)Google Scholar
  132. 132.
    G.H. Loh, 3D-stacked memory architectures for multi-core processors, in ISCA (2008)Google Scholar
  133. 133.
    G.H. Loh, N. Jayasena, M. Oskin, M. Nutter, D. Roberts, M. Meswani, D.P. Zhang, M. Ignatowski, A processing in memory taxonomy and a case for studying fixed-function PIM, in WoNDP (2013)Google Scholar
  134. 134.
    P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, Y.O. Koçberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Özer, B. Falsafi, Scale-out processors, in ISCA (2012)Google Scholar
  135. 135.
    C. Luk, Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors, in ISCA (2001)Google Scholar
  136. 136.
    C. Luk, T.C. Mowry, Compiler-based prefetching for recursive data structures, in ASPLOS (1996)Google Scholar
  137. 137.
    Y. Luo, S. Govindan, B. Sharma, M. Santaniello, J. Meza, A. Kansal, J. Liu, B. Khessib, K. Vaid, O. Mutlu, Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory, in DSN (2014)Google Scholar
  138. 138.
    Y. Luo, S. Ghose, T. Li, S. Govindan, B. Sharma, B. Kelly, A. Boroumand, O. Mutlu, Using ECC DRAM to adaptively increase memory capacity (2017). arXiv:1706.08870 [cs:AR]Google Scholar
  139. 139.
    D. Lustig, A. Bhattacharjee, M. Martonosi, TLB improvements for chip multiprocessors: inter-core cooperative prefetchers and shared last-level TLBs, in ACM TACO (2013)Google Scholar
  140. 140.
    K. Mai, T. Paaske, N. Jayasena, R. Ho, W.J. Dally, M. Horowitz, Smart memories: a modular reconfigurable architecture, in ISCA (2000)Google Scholar
  141. 141.
    J.A. Mandelman, R.H. Dennard, G.B. Bronner, J.K. DeBrosse, R. Divakaruni, Y. Li, C.J. Radens, Challenges and future directions for the scaling of dynamic random-access memory (DRAM), in IBM JRD (2002)Google Scholar
  142. 142.
    Y. Mao, E. Kohler, R.T. Morris, Cache craftiness for fast multicore key-value storage, in EuroSys (2012)Google Scholar
  143. 143.
    S.A. McKee, Reflections on the memory wall, in CF (2004)Google Scholar
  144. 144.
    MemSQL, Inc., MemSQL. http://www.memsql.com
  145. 145.
    M.R. Meswani, S. Blagodurov, D. Roberts, J. Slice, M. Ignatowski, G.H. Loh, Heterogeneous memory architectures: a HW/SW approach for mixing die-stacked and off-package memories, in HPCA (2015), pp. 126–136Google Scholar
  146. 146.
    J. Meza, Y. Luo, S. Khan, J. Zhao, Y. Xie, O. Mutlu, A case for efficient hardware-software cooperative management of storage and memory, in WEED (2013)Google Scholar
  147. 147.
    J. Meza, Q. Wu, S. Kumar, O. Mutlu, Revisiting memory errors in large-scale production data centers: analysis and modeling of new trends from the field, in DSN (2015)Google Scholar
  148. 148.
    N. Mirzadeh, O. Kocberber, B. Falsafi, B. Grot, Sort vs. hash join revisited for near-memory execution, in ASBD (2007)Google Scholar
  149. 149.
    A. Morad, L. Yavits, R. Ginosar, GP-SIMD processing-in-memory, in ACM TACO (2015)Google Scholar
  150. 150.
    J. Mukundan, H. Hunter, K.H. Kim, J. Stuecheli, J.F. Martínez, Understanding and mitigating refresh overheads in high-density DDR4 DRAM systems, in ISCA (2013)Google Scholar
  151. 151.
    O. Mutlu, Memory scaling: a systems architecture perspective, in IMW (2013)Google Scholar
  152. 152.
    O. Mutlu, The RowHammer problem and other issues we may face as memory becomes denser, in DATE (2017)Google Scholar
  153. 153.
    O. Mutlu, J. Stark, C. Wilkerson, Y.N. Patt, Runahead execution: an alternative to very large instruction windows for out-of-order processors, in HPCA (2003)Google Scholar
  154. 154.
    O. Mutlu, J. Stark, C. Wilkerson, Y.N. Patt, Runahead execution: an effective alternative to large instruction windows, in IEEE Micro (2003)Google Scholar
  155. 155.
    O. Mutlu, H. Kim, Y.N. Patt, Address-value delta (AVD) prediction: increasing the effectiveness of runahead execution by exploiting regular memory allocation patterns, in MICRO (2005)Google Scholar
  156. 156.
    O. Mutlu, H. Kim, Y.N. Patt, Techniques for efficient processing in runahead execution engines, in ISCA (2005)Google Scholar
  157. 157.
    O. Mutlu, H. Kim, Y.N. Patt, Address-value delta (AVD) prediction: a hardware technique for efficiently parallelizing dependent cache misses, in TC (2006)Google Scholar
  158. 158.
    O. Mutlu, H. Kim, Y.N. Patt, Efficient runahead execution: power-efficient memory latency tolerance, in IEEE Micro (2006)Google Scholar
  159. 159.
    O. Mutlu, T. Moscibroda, Parallelism-aware batch scheduling: enhancing both performance and fairness of shared DRAM systems, in ISCA (2008)Google Scholar
  160. 160.
    O. Mutlu, L. Subramanian, Research problems and opportunities in memory systems, in SUPERFRI (2014)Google Scholar
  161. 161.
    A. Muzahid, D. Suárez, S. Qi, J. Torrellas, SigRace: signature-based data race detection, in ISCA (2009)Google Scholar
  162. 162.
    H. Naeimi, C. Augustine, A. Raychowdhury, S.-L. Lu, J. Tschanz, STT-RAM scaling and retention failure. Intel Technol. J. 17, 54–75 (2013)Google Scholar
  163. 163.
    L. Nai, R. Hadidi, J. Sim, H. Kim, P. Kumar, H. Kim, GraphPIM: enabling instruction-level PIM offloading in graph computing frameworks, in HPCA (2017)Google Scholar
  164. 164.
    B. Naylor, J. Amanatides, W. Thibault, Merging BSP trees yields polyhedral set operations, in SIGGRAPH (1990)Google Scholar
  165. 165.
  166. 166.
    M. Oskin, F.T. Chong, T. Sherwood, Active pages: a computation model for intelligent memory, in ISCA (1998)Google Scholar
  167. 167.
    M.S. Papamarcos, J.H. Patel, A low-overhead coherence solution for multiprocessors with private. Cache memories, in ISCA (1984)Google Scholar
  168. 168.
    M. Patel, J. Kim, O. Mutlu, The reach profiler (REAPER): enabling the mitigation of DRAM retention failures via profiling at aggressive conditions, in ISCA (2017)Google Scholar
  169. 169.
    Y.N. Patt, W.-M. Hwu, M. Shebanow, HPS, a new microarchitecture: rationale and introduction, in MICRO (1985)Google Scholar
  170. 170.
    Y.N. Patt, S.W. Melvin, W.-M. Hwu, M.C. Shebanow, Critical issues regarding HPS, a high performance microarchitecture, in MICRO, (1985)Google Scholar
  171. 171.
    D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, K. Yelick, A case for intelligent RAM, in IEEE Micro (1997)Google Scholar
  172. 172.
    A. Pattnaik, X. Tang, A. Jog, O. Kayiran, A.K. Mishra, M.T. Kandemir, O. Mutlu, C.R. Das, Scheduling techniques for GPU architectures with processing-in-memory capabilities, in PACT (2016)Google Scholar
  173. 173.
    B. Pichai, L. Hsu, A. Bhattacharjee, Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces, in ASPLOS (2014)Google Scholar
  174. 174.
    G. Pokam, C. Pereira, K. Danne, R. Kassa, A.-R. Adl-Tabatabai, Architecting a chunk-based memory race recorder in modern CMPs, in MICRO (2009)Google Scholar
  175. 175.
    J. Power, M.D. Hill, D.A. Wood, Supporting x86-64 address translation for 100s of GPU lanes, in HPCA (2014)Google Scholar
  176. 176.
    S.H. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, F. Li, NDC: analyzing the impact of 3D-stacked memory+logic devices on mapreduce workloads, in ISPASS (2014)Google Scholar
  177. 177.
    M.K. Qureshi, M.A. Suleman, Y.N. Patt, Line distillation: increasing cache capacity by filtering unused words in cache lines, in HPCA (2007)Google Scholar
  178. 178.
    M.K. Qureshi, A. Jaleel, Y.N. Patt, S.C. Steely Jr., J. Emer, Adaptive insertion policies for high-performance caching, in ISCA (2007)Google Scholar
  179. 179.
    M.K. Qureshi, V. Srinivasan, J.A. Rivers, Scalable high performance main memory system using phase-change memory technology, in ISCA (2009)Google Scholar
  180. 180.
    M.K. Qureshi, D.H. Kim, S. Khan, P.J. Nair, O. Mutlu, AVATAR: a variable-retention-time (VRT) aware refresh for DRAM systems, in DSN (2015)Google Scholar
  181. 181.
    J. Ren, J. Zhao, S. Khan, J. Choi, Y. Wu, O. Mutlu, ThyNVM: enabling software-transparent crash consistency in persistent memory systems, in MICRO (2015)Google Scholar
  182. 182.
    S. Rixner, W.J. Dally, U.J. Kapasi, P. Mattson, J.D. Owens, Memory access scheduling, in ISCA (2000)Google Scholar
  183. 183.
    O. Rodeh, C. Mason, J. Bacik, BTRFS: the Linux B-tree filesystem, in TOS (2013)Google Scholar
  184. 184.
    A. Rogers, M. C. Carlisle, J.H. Reppy, L.J. Hendren, Supporting dynamic data structures on distributed-memory machines, in TOPLAS (1995)Google Scholar
  185. 185.
    P. Rosenfeld, E. Cooper-Balis, B. Jacob, DRAMSim2: a cycle accurate memory system simulator, in CAL (2011)Google Scholar
  186. 186.
    A. Roth, G.S. Sohi, Effective jump-pointer prefetching for linked data structures, in ISCA (1999)Google Scholar
  187. 187.
    A. Roth, A. Moshovos, G.S. Sohi, Dependence based prefetching for linked data structures, in ASPLOS (1998)Google Scholar
  188. 188.
    SAFARI Research Group, IMPICA (in-memory pointer chasing accelerator) – GitHub repository. https://github.com/CMU-SAFARI/IMPICA/
  189. 189.
    SAFARI Research Group, Ramulator: A DRAM simulator – GitHub repository. https://github.com/CMU-SAFARI/ramulator/
  190. 190.
    SAFARI Research Group, SAFARI software tools – GitHub repository. https://github.com/CMU-SAFARI/
  191. 191.
    SAFARI Research Group, SoftMC v1.0 – GitHub repository. https://github.com/CMU-SAFARI/SoftMC/
  192. 192.
    D. Sanchez, L. Yen, M.D. Hill, K. Sankaralingam, Implementing signatures for transactional memory, in MICRO (2007)Google Scholar
  193. 193.
    SAP SE, SAP HANA. http://www.hana.sap.com/
  194. 194.
    B. Schroeder, E. Pinheiro, W.-D. Weber, DRAM errors in the wild: a large-scale field study, in SIGMETRICS (2009)Google Scholar
  195. 195.
    V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M.A. Kozuch, O. Mutlu, P.B. Gibbons, T.C. Mowry, Buddy-RAM: improving the performance and efficiency of bulk bitwise operations using DRAM (2016). arXiv:1611.09988 [cs:AR]Google Scholar
  196. 196.
    V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M.A. Kozuch, O. Mutlu, P.B. Gibbons, T.C. Mowry, Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology, in MICRO (2017)Google Scholar
  197. 197.
    V. Seshadri, Simple DRAM and virtual memory abstractions to enable highly efficient memory systems. Ph.D. dissertation, Carnegie Mellon University, 2016Google Scholar
  198. 198.
    V. Seshadri, O. Mutlu, The processing using memory paradigm: In-DRAM bulk copy, initialization, bitwise AND and OR (2016). arXiv:1610.09603 [cs:AR]Google Scholar
  199. 199.
    V. Seshadri, O. Mutlu, Simple operations in memory to reduce data movement. Adv. Comput. 106, 107–166 (2017)Google Scholar
  200. 200.
    V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, M.A. Kozuch, P.B. Gibbons, T.C. Mowry, RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization, in MICRO (2013)Google Scholar
  201. 201.
    V. Seshadri, A. Bhowmick, O. Mutlu, P.B. Gibbons, M.A. Kozuch, T.C. Mowry, The dirty-block index, in ISCA (2014)Google Scholar
  202. 202.
    V. Seshadri, K. Hsieh, A. Boroumand, D. Lee, M.A. Kozuch, O. Mutlu, P.B. Gibbons, T.C. Mowry, Fast bulk bitwise AND and OR in DRAM, CAL (2015)Google Scholar
  203. 203.
    V. Seshadri, T. Mullins, A. Boroumand, O. Mutli, P.B. Gibbons, M.A. Kozuch, T.C. Mowry, Gather-scatter DRAM: in-DRAM address translation to improve the spatial locality of non-unit strided accesses, in MICRO (2015)Google Scholar
  204. 204.
    V. Seshadri, S. Yedkar, H. Xin, O. Mutlu, P.B. Gibbons, M.A. Kozuch, T.C. Mowry, Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM TACO 11(4), 51:1–51:22 (2015)Google Scholar
  205. 205.
    A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J.P. Strachan, M. Hu, R.S. Williams, V. Srikumar, ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars, in ISCA (2016)Google Scholar
  206. 206.
    J.S. Shapiro, J. Adams, Design evolution of the EROS single-level store, in USENIX ATC (2002)Google Scholar
  207. 207.
    J.S. Shapiro, J.M. Smith, D.J. Farber, EROS: a fast capability system, in SOSP (1999)Google Scholar
  208. 208.
    D.E. Shaw, S.J. Stolfo, H. Ibrahim, B. Hillyer, G. Wiederhold, J. Andrews, The NON-VON database machine: a brief overview. IEEE Database Eng. Bull. 4, 41–52 (1981)Google Scholar
  209. 209.
    J. Shun, G.E. Blelloch, Ligra: a lightweight graph processing framework for shared memory, in PPoPP (2013)Google Scholar
  210. 210.
    J.E. Smith, Decoupled access/execute computer architectures, in ISCA (1982)Google Scholar
  211. 211.
    J.E. Smith, Dynamic instruction scheduling and the astronautics ZS-1, in Computer (1986)Google Scholar
  212. 212.
    J.E. Smith, S. Weiss, N.Y. Pang, A simulation study of decoupled architecture computers, in IEEE TC (1986)Google Scholar
  213. 213.
    Y. Solihin, J. Torrellas, J. Lee, Using a user-level memory thread for correlation prefetching, in ISCA (2002)Google Scholar
  214. 214.
    V. Sridharan, N. DeBardeleben, S. Blanchard, K.B. Ferreira, J. Stearley, J. Shalf, S. Gurumurthi, Memory errors in modern systems: the good, the bad, and the ugly, in ASPLOS (2015)Google Scholar
  215. 215.
    S. Srikantaiah, M. Kandemir, Synergistic TLBs for high performance address translation in chip multiprocessors, in MICRO (2010)Google Scholar
  216. 216.
    S. Srinath, O. Mutlu, H. Kim, Y.N. Patt, Feedback directed prefetching: improving the performance and bandwidth-efficiency of hardware prefetchers, in HPCA (2007)Google Scholar
  217. 217.
    Stanford Network Analysis Project, http://snap.stanford.edu/
  218. 218.
    H.S. Stone, A logic-in-memory computer, in TC (1970)Google Scholar
  219. 219.
    M. Stonebraker, A. Weisberg, The VoltDB main memory DBMS. IEEE Data Eng. Bull. 36, 21–27 (2013)Google Scholar
  220. 220.
    D.B. Strukov, G.S. Snider, D.R. Stewart, R.S. Williams, The missing memristor found. Nature 453, 80 (2008)Google Scholar
  221. 221.
    Z. Sura, A. Jacob, T. Chen, B. Rosenburg, O. Sallenave, C. Bertolli, S. Antao, J. Brunheroto, Y. Park, K. O’Brien, R. Nair, Data access optimization in a processing-in-memory system, in CF (2015)Google Scholar
  222. 222.
    R.M. Tomasulo, An efficient algorithm for exploiting multiple arithmetic units, in IBM JRD (1967)Google Scholar
  223. 223.
    Transaction Processing Performance Council, TPC benchmarks. http://www.tpc.org
  224. 224.
    M. Waldvogel, G. Varghese, J. Turner, B. Plattner, Scalable high speed IP routing lookups, in SIGCOMM (1997)Google Scholar
  225. 225.
    L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, B. Qiu, BigDataBench: a big data benchmark suite from internet services, in HPCA (2014)Google Scholar
  226. 226.
    M.V. Wilkes, The memory gap and the future of high performance memories, in CAN (2001)Google Scholar
  227. 227.
    P.R. Wilson, Uniprocessor garbage collection techniques, in IWMM (1992)Google Scholar
  228. 228.
    H.-S.P. Wong, S. Raoux, S. Kim, J. Liang, J.P. Reifenberg, B. Rajendran, M. Asheghi, K.E. Goodson, Phase change memory. Proc. IEEE 98, 2201–2227 (2010)Google Scholar
  229. 229.
    H.-S.P. Wong, H.-Y. Lee, S. Yu, Y.-S. Chen, Y. Wu, P.-S. Chen, B. Lee, F.T. Chen, M.-J. Tsai, Metal-oxide RRAM. Proc. IEEE 100, 1951–1970 (2012)Google Scholar
  230. 230.
    L. Wu, R.J. Barker, M.A. Kim, K.A. Ross, Navigating big data with high-throughput, energy-efficient data partitioning, in ISCA (2013)Google Scholar
  231. 231.
    L. Wu, A. Lottarini, T.K. Paine, M.A. Kim, K.A. Ross, Q100: the architecture and design of a database processing unit, in ASPLOS (2014)Google Scholar
  232. 232.
    Y. Wu, Efficient discovery of regular stride patterns in irregular programs, in PLDI (2002)Google Scholar
  233. 233.
    W.A. Wulf, S.A. McKee, Hitting the memory wall: implications of the obvious, CAN (1995)Google Scholar
  234. 234.
    S.L. Xi, O. Babarinsa, M. Athanassoulis, S. Idreos, Beyond the wall: near-data processing for databases, in DaMoN (2015)Google Scholar
  235. 235.
    H. Xin, D. Lee, F. Hormozdiari, S. Yedkar, O. Mutlu, C. Alkan, Accelerating read mapping with FastHASH, in BMC Genomics (2013)Google Scholar
  236. 236.
    H. Xin, J. Greth, J. Emmons, G. Pekhimenko, C. Kingsford, C. Alkan, O. Mutlu, Shifted hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping. Bioinformatics 31, 1553–1560 (2015)Google Scholar
  237. 237.
    J. Xue, Z. Yang, Z. Qu, S. Hou, Y. Dai, Seraph: an efficient, low-cost system for concurrent graph processing, in HPDC (2014)Google Scholar
  238. 238.
    C. Yang, A.R. Lebeck, Push vs. pull: data movement for linked data structures, in ICS (2000)Google Scholar
  239. 239.
    H. Yoon, R.A.J. Meza, R. Harding, O. Mutlu, Row buffer locality aware caching policies for hybrid memories, in ICCD (2012)Google Scholar
  240. 240.
    H. Yoon, J. Meza, N. Muralimanohar, N.P. Jouppi, O. Mutlu, Efficient data mapping and buffering techniques for multilevel cell phase-change memories, in ACM TACO (2014)Google Scholar
  241. 241.
    X. Yu, G. Bezerra, A. Pavlo, S. Devadas, M. Stonebraker, Staring into the abyss: an evaluation of concurrency control with one thousand cores, in VLDB (2014)Google Scholar
  242. 242.
    X. Yu, C.J. Hughes, N. Satish, S. Devadas, IMP: indirect memory prefetcher, in MICRO (2015)Google Scholar
  243. 243.
    D.P. Zhang, N. Jayasena, A. Lyashevsky, J.L. Greathouse, L. Xu, M. Ignatowski, TOP-PIM: throughput-oriented programmable processing in memory, in HPDC (2014)Google Scholar
  244. 244.
    J. Zhao, O. Mutlu, Y. Xie, FIRM: fair and high-performance memory control for persistent memory systems, in MICRO (2014)Google Scholar
  245. 245.
    P. Zhou, B. Zhao, J. Yang, Y. Zhang, A durable and energy efficient main memory using phase change memory technology, in ISCA (2009)Google Scholar
  246. 246.
    Q. Zhu, T. Graf, H.E. Sumbul, L. Pileggi, F. Franchetti, Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware, in HPEC (2013)Google Scholar
  247. 247.
    C.B. Zilles, Benchmark health considered harmful, in CAN (2001)Google Scholar
  248. 248.
    C.B. Zilles, G.S. Sohi, Execution-based prediction using speculative slices, in ISCA (2001)Google Scholar
  249. 249.
    W.K. Zuravleff, T. Robinson, Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. US Patent No. 5,630,096 (1997)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2019

Authors and Affiliations

  • Saugata Ghose
    • 1
  • Kevin Hsieh
    • 1
  • Amirali Boroumand
    • 1
  • Rachata Ausavarungnirun
    • 1
  • Onur Mutlu
    • 2
  1. 1.Carnegie Mellon UniversityPittsburghUSA
  2. 2.ETH ZürichZürichSwitzerland

Personalised recommendations