PIM-Align: A Processing-in-Memory Architecture for FM-Index Search Algorithm


Genomic sequence alignment is the most critical and time-consuming step in genomic analysis. Alignment algorithms generally follow a seed-and-extend model. Acceleration of the extension phase for sequence alignment has been well explored in computing-centric architectures on field-programmable gate array (FPGA), application-specific integrated circuit (ASIC), and graphics processing unit (GPU) (e.g., the Smith-Waterman algorithm). Compared with the extension phase, the seeding phase is more critical and essential. However, the seeding phase is bounded by memory, i.e., fine-grained random memory access and limited parallelism on conventional system. In this paper, we argue that the processing-in-memory (PIM) concept could be a viable solution to address these problems. This paper describes “PIM-Align”—application-driven near-data processing architecture for sequence alignment. In order to achieve memory-capacity proportional performance by taking advantage of 3D-stacked dynamic random access memory (DRAM) technology, we propose a lightweight message mechanism between different memory partitions, and a specialized hardware prefetcher for memory access patterns of sequence alignment. Our evaluation shows that the proposed architecture can achieve 20x and 1 820x speedup when compared with the best available ASIC implementation and the software running on 32-thread CPU, respectively.

This is a preview of subscription content, access via your institution.


  1. [1]

    Shendure J, Ji H. Next-generation DNA sequencing. Nature Biotechnology, 2008, 26(10): 1135-1145. https://doi.org/10.1038/nbt1486.

    Article  Google Scholar 

  2. [2]

    Erdmann J. Next generation technology edges genome sequencing toward the clinic. Chemistry & Biology, 2011, 18(12): 1513-1514. https://doi.org/10.1016/j.chembiol.2011.12.006.

  3. [3]

    Stephens Z D, Lee S Y, Faghri F, Campbell R H, Zhai C, Efron M J, Iyer R, Schatz M C, Sinha S, Robinson G E. Big data: Astronomical or genomical? PLoS Biology, 2015, 13(7): Article No. e1002195. https://doi.org/10.1371/journal.pbio.1002195.

  4. [4]

    Turakhia Y, Bejerano G, Dally W J. Darwin: A genomics co-processor provides up to 15,000X acceleration on long read assembly. In Proc. the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2018, pp.199-213. https://doi.org/10.1145/3173162.3173193.

  5. [5]

    Zhang J, Lin H, Balaji P, Feng W C. Optimizing burrows-wheeler transform-based sequence alignment on multicore architectures. In Proc. the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, May 2013, pp.377-384. https://doi.org/10.1109/CCGrid.2013.67.

  6. [6]

    Lu M, Tan Y, Bai G, Luo Q. High-performance short sequence alignment with GPU acceleration. Distributed and Parallel Databases, 2012, 30(5/6): 385-399. https://doi.org/10.1007/s10619-012-7099-x.

    Article  Google Scholar 

  7. [7]

    Chang M C F, Chen Y T, Cong J, Huang P T, Kuo C L, Yu C H. The SMEM seeding acceleration for DNA sequence alignment. In Proc. the 24th International Symposium on Field-Programmable Custom Computing Machines, May 2016, pp.32-39. https://doi.org/10.1109/FCCM.2016.21.

  8. [8]

    Wang Y, Li X, Zang D, Tan G, Sun N. Accelerating FM-index search for genomic data processing. In Proc. the 47th International Conference on Parallel Processing, Aug. 2018, Article No. 65. https://doi.org/10.1145/3225058.3225134.

  9. [9]

    Kocberber O, Grot B, Picorel J, Falsafi B, Lim K, Ranganathan P. Meet the walkers accelerating index traversals for in-memory databases. In Proc. the 46th IEEE/ACM International Symposium on Microarchitecture, Dec. 2013, pp.468-479. https://doi.org/10.1145/2540708.2540748.

  10. [10]

    Weis C, Wehn N, Igor L, Benini L. Design space exploration for 3D-stacked DRAMs. In Proc. the Design, Automation & Test in Europe, Mar. 2011, pp.389-394. https://doi.org/10.1109/DATE.2011.5763068.

  11. [11]

    Langmead B, Trapnell C, Pop M, Salzberg S L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 2009, 10(3): Article No. R25. https://doi.org/10.1186/gb-2009-10-3-r25.

  12. [12]

    Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997, 2013, Mar. 2013. https://arxiv.org/abs/1303.3997, Nov. 2020.

  13. [13]

    Langmead B, Salzberg S L. Fast gapped-read alignment with Bowtie 2. Nature Methods, 2012, 9(4): 357-359. https://doi.org/10.1038/nmeth.1923.

    Article  Google Scholar 

  14. [14]

    Luo R, Wong T, Zhu J et al. SOAP3-dp: Fast, accurate and sensitive GPU-based short read aligner. PloS One, 2013, 8(5): Article No. e65632. https://doi.org/10.1371/journal.pone.0065632.

  15. [15]

    Ahmed N, Bertels K, Al-Ars Z. A comparison of seed-and-extend techniques in modern DNA read alignment algorithms. In Proc. the 2016 IEEE International Conference on Bioinformatics and Biomedicine, Dec. 2016, pp.1421-1428. 10.1109/BIBM.2016.7822731.

  16. [16]

    Hu X, Stow D, Xie Y. Die stacking is happening. IEEE Micro, 2018, 38(1): 22-28. https://doi.org/10.1109/MM.2018.011441561.

    Article  Google Scholar 

  17. [17]

    Shevgoor M, Kim J S, Chatterjee N, Balasubramonian R, Davis A, Udipi A N. Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device. In Proc. the 46th International Symposium on Microarchitecture, Feb. 2013, pp.198-209. https://doi.org/10.1145/2540708.2540726.

  18. [18]

    Zhu Y, Wang B, Li D, Zhao J. Integrated thermal analysis for processing in die-stacking memory. In Proc. the 2nd International Symposium on Memory Systems, Oct. 2016, pp.402-414. https://doi.org/10.1145/2989081.2989093.

  19. [19]

    Gao M, Ayers G, Kozyrakis C. Practical near-data processing for in-memory analytics frame-works. In Proc. the 2015 International Conference on Parallel Architecture and Compilation, Mar. 2015, pp.113-124. https://doi.org/10.1109/PACT.2015.22.

  20. [20]

    Kim Y, Yang W, Mutlu O. Ramulator: A fast and extensible dram simulator. IEEE Computer Architecture Letters, 2015, 15(1): 45-49. https://doi.org/10.1109/LCA.2015.2414456.

    Article  Google Scholar 

  21. [21]

    Chen K, Li S, Muralimanohar N, Ahn J H, Brockman J B, Jouppi N P. CACTI-3DD: Architecture-level modeling for 3D die-stacked dram main memory. In Proc. the Conference on Design, Automation and Test in Europe, Mar. 2012, pp.33-38. https://doi.org/10.1109/DATE.2012.6176428.

  22. [22]

    Pugsley S H, Jestes J, Zhang H, Balasubramonian R, Srinivasan V, Buyuktosunoglu A, Davis A, Li F. NDC: Analyzing the impact of 3D-stacked memory+ logic devices on MapReduce workloads. In Proc. the IEEE International Symposium on Performance Analysis of Systems and Software, Mar. 2014, pp.190-200. https://doi.org/10.1109/IS-PASS.2014.6844483.

  23. [23]

    Pran K, Taher A. Logic Synthesis Using Synopsys®. Springer Science & Business Media, 2012.

  24. [24]

    Canzar S, Salzberg S L. Short read mapping: An algorithmic tour. Proc. the IEEE, 2017, 105(3): 436-458. https://doi.org/10.1109/JPROC.2015.2455551.

    Article  Google Scholar 

  25. [25]

    Xin H, Lee D, Hormozdiari F, Yedkar S, Mutlu O, Alkan C. Accelerating read mapping with FastHASH. BMC Genomics, 2013, 14(Suppl 1): Article No. S13. https://doi.org/10.1186/1471-2164-14-S1-S13.

  26. [26]

    Alkan C, Kidd J M, Marques-Bonet T et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nature Genetics, 2009, 41(10): 1061-1067. https://doi.org/10.1038/ng.437.

  27. [27]

    Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler E E, Sahinalp S C. mrsFAST: A cache-oblivious algorithm for short-read mapping. Nature Methods, 2010, 7(8): 576-577. https://doi.org/10.1038/nmeth0810-576.

    Article  Google Scholar 

  28. [28]

    David M, Dzamba M, Lister D, Ilie L, Brudno M. SHRiMP2: Sensitive yet practical short read mapping. Bioinformatics, 2011, 27(7): 1011-1012. https://doi.org/10.1093/bioinformatics/btr046.

    Article  Google Scholar 

  29. [29]

    Li H, Durbin R. Fast and accurate short read alignment with burrows wheeler transform. Bioinformatics, 2009, 25(14): 1754-1760. https://doi.org/10.1093/bioinformatics/btp324.

    Article  Google Scholar 

  30. [30]

    Fernandez E, Najjar W, Lonardi S. String matching in hardware using the FM-index. In Proc. the 19th Annual International Symposium on Field-Programmable Custom Computing Machines, May 2011, pp.218-225. https://doi.org/10.1109/FCCM.2011.55.

  31. [31]

    Fernandez E B, Najjar W A, Lonardi S, Villarreal J. Multithreaded FPGA acceleration of DNA sequence mapping. In Proc. the 2012 IEEE Conference on High Performance Extreme Computing, Sept. 2012. https://doi.org/10.1109/HPEC.2012.6408669.

  32. [32]

    Fernandez E B, Villarreal J, Lonardi S, Najjar W A. FHAST: FPGA-based acceleration of Bowtie in hardware. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2015, 12(5): 973-981. https://doi.org/10.1109/TCBB.2015.2405333.

    Article  Google Scholar 

  33. [33]

    Liu Y, Schmidt B. Evaluation of GPU-based seed generation for computational genomics using burrows-wheeler transform. In Proc. the 26th IEEE International Symposium on Parallel and Distributed Processing Symposium Workshops & PhD Forum, Aug. 2012, pp.684-690. https://doi.org/10.1109/IPDPSW.2012.85.

  34. [34]

    Fujiki D, Subramaniyan A, Zhang T, Zeng Y, Das R, Blaauw D, Narayanasamy S. GenAx: A genome sequencing accelerator. In Proc. the 45th Annual International Symposium on Computer Architecture, July 2018, pp.69-82. https://doi.org/10.1109/ISCA.2018.00017.

  35. [35]

    Balasubramonian R, Chang J, Manning T, Moreno J H, Murphy R, Nair R, Swanson S. Near-data processing: Insights from a micro-46 workshop. IEEE Micro, 2014, 34(4): 36-42. https://doi.org/10.1109/MM.2014.55.

    Article  Google Scholar 

  36. [36]

    Seshadri V, Kim Y, Fallin C et al. RowClone: Fast and energy-efficient in-dram bulk data copy and initialization. In Proc. the 46th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2013, pp.185-197. https://doi.org/10.1145/2540708.2540725.

  37. [37]

    Zhu Q, Akin B, Sumbul H E, Sadi F, Hoe J C, Pileggi L, Franchetti F. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In Proc. the 2013 IEEE International 3D Systems Integration Conference, Oct. 2013. https://doi.org/10.1109/3DIC.2013.6702348.

  38. [38]

    Zhu Q, Graf T, Sumbul H E, Pileggi L, Franchetti F. Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware. In Proc. the 2013 IEEE High Performance Extreme Computing Conference, Sept. 2013. https://doi.org/10.1109/HPEC.2013.6670336.

  39. [39]

    Vijayaraghavan T, Rajesh A, Sankaralingam K. MPU-BWM: Accelerating sequence alignment. IEEE Computer Architecture Letters, 2018, 17(2): 179-182. https://doi.org/10.1109/LCA.2018.2849064.

    Article  Google Scholar 

  40. [40]

    Asghari-Moghaddam H, Son Y H, Ahn J H, Kim N S. Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems. In Proc. the 49th Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2016. https://doi.org/10.1109/MICRO.2016.7783753.

  41. [41]

    Kaplan R, Yavits L, Ginosar R, Weiser U. A resistive cam processing-in-storage architecture for DNA sequence alignment. IEEE Micro, 2017, 37(4): 20-28. https://doi.org/10.1109/MM.2017.3211121.

    Article  Google Scholar 

  42. [42]

    Huangfu W, Li S, Hu X, Xie Y. RADAR: A 3D-ReRAM based DNA alignment accelerator architecture. In Proc. the 55th Design Automation Conference, Jun. 2018, Article No. 59. https://doi.org/10.1145/3195970.3196098.

  43. [43]

    Ahn J, Hong S, Yoo S, Mutlu O, Choi K. A scalable processing-in-memory accelerator for parallel graph processing. In Proc. the 42nd Annual International Symposium on Computer Architecture, June 2015, pp.105-117. https://doi.org/10.1145/2749469.2750386.

  44. [44]

    Nagasaka Y, Nukada A, Matsuoka S. Adaptive multi-level blocking optimization for sparse matrix vector multiplication on GPU. Procedia Computer Science, 2016, 80: 131-142. https://doi.org/10.1016/j.procs.2016.05.304.

    Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Xue-Qi Li.

Supplementary Information


(PDF 917 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, XQ., Tan, GM. & Sun, NH. PIM-Align: A Processing-in-Memory Architecture for FM-Index Search Algorithm. J. Comput. Sci. Technol. 36, 56–70 (2021). https://doi.org/10.1007/s11390-020-0825-3

Download citation


  • accelerator design
  • genomic sequence alignment
  • near-memory computing