FARM: A Flexible Accelerator for Recurrent and Memory Augmented Neural Networks


Recently, Memory Augmented Neural Networks (MANN)s, a class of Deep Neural Networks (DNN)s have become prominent owing to their ability to capture the long term dependencies effectively for several Natural Language Processing (NLP) tasks. These networks augment conventional DNNs by incorporating memory and attention mechanisms external to the network to capture relevant information. Several MANN architectures have shown particular benefits in NLP tasks by augmenting an underlying Recurrent Neural Network (RNN) with external memory using attention mechanisms. Unlike conventional DNNs whose computational time is dominated by MAC operations, MANNs have more diverse behavior. In addition to MACs, the attention mechanisms of MANNs also consist of operations such as similarity measure, sorting, weighted memory access, and pair-wise arithmetic. Due to this greater diversity in operations, MANNs are not trivially accelerated by the same techniques used by existing DNN accelerators. In this work, we present an end-to-end hardware accelerator architecture, FARM, for the inference of RNNs and several variants of MANNs, such as the Differential Neural Computer (DNC), Neural Turing Machine (NTM) and Meta-learning model. FARM achieves an average speedup of 30x-190x and 80x-100x over CPU and GPU implementations, respectively. To address remaining memory bottlenecks in FARM, we then propose the FARM-PIM architecture, which augments FARM with in-memory compute support for MAC and content-similarity operations in order to reduce data traversal costs. FARM-PIM offers an additional speedup of 1.5x compared to FARM. Additionally, we consider an efficiency-oriented version of the PIM implementation, FARM-PIM-LP, that trades a 20% performance reduction relative to FARM for a 4x average power consumption reduction.

This is a preview of subscription content, log in to check access.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12


  1. 1.

    Aga, S., Jeloka, S., Subramaniyan, A., Narayanasamy, S., Blaauw, D., & Das, R. (2017). Compute caches. In 2017 IEEE International symposium on high performance computer architecture (HPCA) (pp. 481–492).

  2. 2.

    Amin, H., Curtis, K. M., & Hayes-Gill, B. R. (1997). Piecewise linear approximation applied to nonlinear function of a neural network. IEE Proceedings - Circuits. Devices and Systems, 144(6), 313–317.

    Article  Google Scholar 

  3. 3.

    Bordes, A., Usunier, N., Chopra, S., & Weston, J. (2015). Large-scale Simple Question Answering with Memory Networks. arXiv:1506.02075.

  4. 4.

    Chen, R., Siriyal, S., & Prasanna, V. (2015). Energy and memory efficient mapping of bitonic sorting on FPGA. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 240–249).

  5. 5.

    Chen, Y., Emer, J., & Sze, V. (2016). Eyeriss: a spatial architecture for Energy-Efficient dataflow for convolutional neural networks. In 2016 ACM/IEEE 43Rd annual international symposium on computer architecture (ISCA) (pp. 367–379).

  6. 6.

    Chi, P., Li, S., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Y., & Xie, Y. (2016). PRIME: A Novel Processing-in-Memory Architecture For Neural Network Computation in reRAM-based Main Memory. In 2016 ACM/IEEE 43Rd annual international symposium on computer architecture (ISCA) (pp. 27–39).

  7. 7.

    Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.

  8. 8.

    Eckert, C., Wang, X., Wang, J., Subramaniyan, A., Iyer, R., Sylvester, D., Blaaauw, D., & Das, R. (2018). Neural cache: Bit-serial In-Cache acceleration of deep neural networks. In 2018 ACM/IEEE 45Th annual international symposium on computer architecture (ISCA) (pp. 383–396).

  9. 9.

    George, S., Li, X., Liao, M. J., Ma, K., Srinivasa, S., Mohan, K., Aziz, A., Sampson, J., Gupta, S. K., & Narayanan, V. (2018). Symmetric 2-D-Memory access to multidimensional data. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 26(6), 1040–1050.

    Article  Google Scholar 

  10. 10.

    Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In 2013 IEEE International conference on acoustics, speech and signal processing (pp. 6645–6649).

  11. 11.

    Graves, A., Wayne, G., & Danihelka, I. (2014). Neural turing machines. arXiv:1410.5401.

  12. 12.

    Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwinska, A., Colmenarejo, S. G., Grefenstette, E., Ramalho, T., Agapiou, J., Badia, A. P., Hermann, K. M., Zwols, Y., Ostrovski, G., Cain, A., King, H., Summerfield, C., Blunsom, P., Kavukcuoglu, K., & Hassabis, D. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538, 471–476.

    Article  Google Scholar 

  13. 13.

    Guan, Y., Yuan, Z., Sun, G., & Cong, J. (2017). Fpga-based accelerator for long short-term memory recurrent neural networks. In 2017 22Nd asia and south pacific design automation conference (ASP-DAC) (pp. 629–634).

  14. 14.

    Ha, H., Hwang, U., Hong, Y., & Yoon, S. (2018). Memory-Augmented Neural networks for knowledge tracing from the perspective of learning and forgetting. In Arxiv:1805:10768.

  15. 15.

    Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y., Xie, D., Luo, H., Yao, S., Wang, Y., Yang, H., & Dally, W. B. J. (2017). ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 75– 84).

  16. 16.

    Hu, M., Strachan, J. P., Li, Z., Grafals, E. M., Davila, N., Graves, C., Lam, S., Ge, N., Yang, J. J., & Williams, R. S. (2016). Dot-product engine for neuromorphic computing: Programming 1T1M crossbar to accelerate matrix-vector multiplication. In 2016 53Nd ACM/EDAC/IEEE design automation conference (DAC) (pp. 1–6).

  17. 17.

    Intel Corporation: Intel Nervana Neural Network Processors. [online]. Available: [Accessed: 26-Jun-2019].

  18. 18.

    Intel Corporation: Intel vtune amplifier performance profiler. [online]. Available: [Accessed: 26- Jun-2019].

  19. 19.

    Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P.l., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T.V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C.R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A., Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony, M., Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A., Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Steinberg, D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E., Vasudevan, V., Walter, R., Wang, W., Wilcox, E., & Yoon, D.H. (2017). In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (pp. 1–12).

  20. 20.

    Kim, Y., Zhang, Y., & Li, P. (2012). A digital neuromorphic VLSI architecture with memristor crossbar synaptic array for machine learning. In 2012 IEEE International SOC conference (pp. 328–333).

  21. 21.

    Laguna, A. F., Niemier, M., & Hu, X. S. (2019). Design of hardware-friendly memory enhanced neural networks. In 2019 Design, automation test in europe conference exhibition (DATE) (pp. 1583–1586).

  22. 22.

    Lake, B. M., Salakhutdinov, R. R., & Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350, 1332–1338.

    MathSciNet  Article  Google Scholar 

  23. 23.

    Leboeuf, K., Namin, A. H., Muscedere, R., Wu, H., & Ahmadi, M. (2008). High speed VLSI implementation of the hyperbolic tangent sigmoid function. In 2008 Third international conference on convergence and hybrid information technology (vol. 1, pp. 1070–1073).

  24. 24.

    Luo, T., Liu, S., Li, L., Wang, Y., Zhang, S., Chen, T., Xu, Z., Temam, O., & Chen, Y. (2017). Dadiannao: A Neural Network Supercomputer. IEEE Transactions on Computers, 66(1), 73–88.

    MathSciNet  Article  Google Scholar 

  25. 25.

    Miller, A. H., Fisch, A., Dodge, J., Karimi, A., Bordes, A., & Weston, J. (2016). Key-Value Memory networks for directly reading documents. In EMNLP.

  26. 26.

    Nvidia Corporation: Nvidia system management interface. [online]. Available: [Accessed: 26-Jun-2019].

  27. 27.

    Ranjan, A., Jain, S., Stevens, J. R., Das, D., Kaul, B., & Raghunathan, A. (2019). X-mann: a crossbar based architecture for memory augmented neural networks. In Proceedings of the 56th Annual Design Automation Conference 2019 (pp. 130:1–130:6).

  28. 28.

    Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., & Lillicrap, T. P. (2016). Meta-Learning With Memory-Augmented neural networks. In ICML (pp. 1842–1850).

  29. 29.

    Shafiee, A., Nag, A., Muralimanohar, N., Balasubramonian, R., Strachan, J. P., Hu, M., Williams, R. S., & Srikumar, V. (2016). ISAAC: A convolutional neural network accelerator with In-Situ analog arithmetic in crossbars. In 2016 ACM/IEEE 43Rd annual international symposium on computer architecture (ISCA) (pp. 14–26).

  30. 30.

    Sukhbaatar, S., Szlam, A., Weston, J., & Fergus, R. (2015). End-To-End Memory Networks. Curran Associates, Inc., (Vol. 28 pp. 2440–2448).

  31. 31.

    Synopsys: [online]. Available: [Accessed: 26- Jun- 2019].

  32. 32.

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., & Polosukhin, I. (2017). Attention is All you Need. Advances in Neural Information Processing Systems, 30, 5998–6008. Curran Associates, Inc..

    Google Scholar 

  33. 33.

    Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, k., & Wierstra, D. (2016). Matching Networks for One Shot Learning. Advances in Neural Information Processing Systems, 29, 3630–3638. Curran Associates, Inc.

    Google Scholar 

  34. 34.

    Weston, J., Bordes, A., Chopra, S., & Mikolov, T. (2016). Towards AI-complete Question Answering: A Set of Prerequisite Toy Tasks. CoRR arXiv:1502.05698.

  35. 35.

    Weston, J., Chopra, S., & Bordes, A. (2015). Memory networks. In 3Rd international conference on learning representations, ICLR.

  36. 36.

    Yin, W., Kann, K., Yu, M., & Schütze, H. (2017). Comparative study of CNN and RNN for natural language processing. arXiv:1702.01923.

  37. 37.

    Zhang, J., Shi, X., King, I., & Yeung, D. Y. (2017). Dynamic Key-Value memory networks for knowledge tracing. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17 (pp. 765–774).

Download references


This work was supported in part by Semiconductor Research Corporation (SRC) Center for Brain-inspired Computing Enabling Autonomous Intelligence (C-BRIC) and Center for Research in Intelligent Storage and Processing in Memory (CRISP).

Author information



Corresponding author

Correspondence to Nagadastagiri Challapalle.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Challapalle, N., Rampalli, S., Jao, N. et al. FARM: A Flexible Accelerator for Recurrent and Memory Augmented Neural Networks. J Sign Process Syst (2020).

Download citation


  • Neural network
  • Attention mechanism
  • Memory augmentation
  • In-memory computing
  • Hardware accelerator