Quantitative Optimization and Early Cost Estimation of LowPower HierarchicalArchitecture SRAMs Based on Accurate Cost Models
Abstract
Dedicated lowpower SRAMs are frequently used in various systemonchip designs and their power consumption plays an increasingly crucial role in the overall power budget. However, the broad amount of choices regarding the capacity, wordlengths and operational modes make it hard for designers to determine the optimal SRAM architecture. Additionally, many lowpower techniques and circuits are frequently utilized but not supported by previously proposed cost models. In order to solve these problems, a costmodel based quantitative optimization approach is proposed. In particular, a fast and accurate power estimation model is built for aiding the lowpower SRAM designs. It precisely fits the various complex SRAM circuits and architectures. The quantitative approach provides useful conclusions early in the design phase guiding further optimizations. The estimation error of the power model has been proven to be less than 10 % compared to results based on timehungry extractednetlist simulations in a 40nm CMOS technology.
Keywords
SRAM Power model Quantitative parameter optimization1 Introduction
SRAMs are widely used in many applications as caches etc. due to their fast access speed but also contribute significantly to area cost and power consumption. Particularly in systemonchip design, dedicated SRAMs with optimized architecture and circuits are often applied for achieving lowpower. The optimization of those dedicated SRAMs for lowest possible power at a given performance is a quite challenging task because of the complexity of the design space. An attractive approach is to perform a quantitative optimization based on cost models. Such cost models not only support the optimization process but also allow for early cost estimation in the system conception phase. The focus of the study is laid on the power cost considering the increasingly significant role of power consumption of SRAMs.
From the lowpower perspective, SRAMs with hierarchical architecture, as described e.g. in [1, 2, 3, 4], are very attractive choices. A quantitative optimization approach for the hierarchicalarchitecture SRAMs deserves further research.
Besides the choices related to partitioning regarding the design space, there exists a wide choice of various energyefficient circuit techniques. These circuit techniques, such as the ones related to the bit cells, assist circuits and local sense amplifiers (LSA) cause additional overhead and complexity to the overall design. The consumed energy is difficult to evaluate and estimate since it depends on the context in which these circuits are used. Their nonstandard features cannot longer be characterized using the available models which only include standard features. Hence there must be a quantitative benchmarking tool to assess these circuits and techniques. Accordingly, their respective power consumption should be estimated and compared while using different configurations so that the optimal one can be selected. A precharacterization methodology is employed for capturing their distinct features and quantifying the analysis. With these 2 sets of inputs, partitioning parameters and specific circuit techniques, a designspace exploration is carried out, while building a cost (A, T, E) model for onchip SRAMs. This way a Paretooptimization is carried out by trading off the three costs (A, T, E). Finally the design decisions regarding architecture and circuits are made.

The area can be estimated by a parameterized floor plan estimation and an accumulation of elementary widths and height values (e.g. W _{ components } , H _{ components }).

In the energy model, the elementary energy values of basic circuits (e.g. E _{ gate }) are accumulated with the partitioning parameters and switching activity probabilities. The interconnect energy is estimated from the wire length and the energy per unit length (E _{ wire_unit }). Finally the total energy consumption is derived by an accumulation of elementary energy (e.g. E _{ bl }) from all used circuit components.

The speed can also be derived by using the cost data base (e.g. t _{ slope } , W _{ wire } , C _{ load, } C _{ input }) and a elementary delay accumulation along the critical path involving long resistancecapacitance interconnects.
In this contribution, we focus on the energy cost model and the relevant optimization approach. In the proposed model, only a few necessary basic circuit components (i.e. basic bit cells) need to be characterized. These circuit components could be verified by a number of MonteCarlo simulations for ensuring robustness. Moreover, the precharacterization including simulation time and model building only requires couple of hours and afterwards the estimation results can be acquired in a few minutes. Therefore, the total effort of the cost model is much less compared to a complete reference design.
2 Hierarchical Architecture
The choice of m and w defines the parasitic wordline capacitances and the wordline structure. Either a nondivided wordline (nonDWL) or a DWL structure can be selected according to the capacity and wordlength. The parameter n defines the bitline hierarchy and therewith affects the global and local bitline capacitances. Especially, charging and discharging the bitlines contributes significantly to the overall power consumption. The number of cells u in one column unit determines to a large extent the minimum energy consumption for one operation. The n and u must be carefully selected for trading the least frequent use of LSAs and the minimum switching capacitances.
3 Partitioning Impact Analysis
SRAMs typically include two major contributors to power consumption: the address decoders and the memory matrices. In the hierarchical architecture (Fig. 4), the way of dividing and combining the address decoders determines how the memory matrix is partitioned into subblocks. A probabilistic estimation approach is employed for estimating the switching activity and power consumption of the address decoder especially regarding whether or not a distributed wordline structure is used. The memory matrices including complex assist and periphery circuits, which consume a large portion of power, were also modeled and characterized. Four basic circuit templates and a power estimation method are proposed to extract and describe the architecture and circuit characteristics for the hierarchical architecture. The specific circuits used within the four circuit templates can be altered without changing the estimation approach itself. Various power reduction techniques, e.g. precharge schemes in [3, 5], circuit techniques in [1, 2, 5], can be precharacterized and benchmarked in the same configurations, which makes this model very appropriate for customized SRAM designs.
4 Power Model of Address Decoder
4.1 Basic Circuits of Address Decoder
As shown in Fig. 6, the address decoder includes three predecoders and three distributed decoders. The three predecoders can be decoders comprising either a large fanin or a small fanin, depending on their input numbers. The other three intermediate decoders are regarded to be matrixlike select circuits which are composed of logic gates distributed in a matrix. A probabilistic method is employed for modeling the underlying switching activities of these logic gates, by which the transition power consumption of the matrixlike select circuit is estimated. The large fanin decoder is composed of a matrixlike select circuit and two small fanin decoders. Therefore, if the energies associated with small fanin decoders and basic gates are available, the energy of the three predecoders and the three distributed decoders can be derived by the probabilistic method. Also, a realistic topology estimation approach is used to estimate the wire capacitances and area of different (m, n, u).
Characterization database of basic decoders (TT, 25°, 400 MHz, V_{DDH} = 0.9 V)
Decoder 2to4  Decoder 3to8  Decoder 4to16  

Dynamic energy (aJ)  2105  4400  8460 
Static power (nW)  12  31  56 
Input Capacitance (fF)  1.44  2.44  3.93 
Width (µm)  2.73  3.45  4.17 
Height (µm)  2.02  3.53  5.63 
4.2 Switching Activity
Energy of NOR gate for all possible input transition possibilities (TT, 25°, 400 MHz, V_{DDH} = 0.9 V, V_{DDL} = 0.3 V)
Inputs transitions  00  11  12  13  01  02  03  23 
Total energy(aJ)  11  4  400  68  213  320  235  68 
Inputs transitions  22  33  21  31  10  20  30  32 
Total energy(aJ)  2  3  315  228  705  655  880  228 
The equation was verified using several different combinations of rows and columns and shows 5 % estimation error compared to extractednetlist simulation results.
4.3 Energy Cost Related to Interconnects
As technology keeps shrinking the role of interconnects becomes increasingly significant in the total power budget. Particularly interconnects incur large capacitive loads in the dense SRAM layout. As described in Fig. 1, the 1^{st} stage predecoders and the 2^{nd} stage decoders are typically placed around the memory matrix. The 3^{rd} stage LWL decoders are distributed into the block columns. Hence, the aspect ratio of the LWL, local timing circuits and the bitcell column must be considered together. For estimating the associated interconnect lengths, a floor plan containing the dominating memory matrix and address decoder must be determined in advance.
The wire capacitance per unit length of a metal wire is assumed as an appropriate value (0.08fF/µm) for a 40nm technology. This value is evaluated and modified when coupling capacitances exist in very dense layouts. Moreover, under the assumption that only the column decoder switches and the row decoder does not, the switched capacitances are only determined by the width (W _{ h }) with a switching probability (C1)/(R·C). In case that only the row decoder switches and the column decoder does not, the switched capacitances considering its switching probability are equal to 0.08 ·H _{ h } ·(C1)/(R·C). If both row and column decoders are switching, the switched capacitances are computed considering both width and height. To summarize, for the two typical floor plans wire lengths and capacitances are estimated, which leads to a decision regarding which floor plan has to be assumed.
4.4 Verification of Address Decoder Estimation Model
The dynamic energy of three predecoders are represented by E _{ dyc } (n), E _{ dyc } (u) and E _{ dyc } (m). The parameters m, n and u denote the input widths of the three decoders respectively. The energy can be acquired from Table 2 (optionally in combination with small fanin decoders and matrixlike circuits). For the second stage, energy figures for wordrow and block decoders are given by E _{ matrix } (N,U) + E _{ wire } (N,U) and E _{ matrix } (M,N) + E _{ wire } (M,N). N = 2 ^{ n } and U = 2 ^{ u } represent the number of rows and columns of the matrixcircuit. Note that in the 3^{nd} stage a matrix (N·U, M) is applied instead of a matrix (N·U, M·N) since every GWL signal only needs 2 ^{ m } Blockact signals to select the word in that column, cf. in Fig. 9. For an address decoder in a nonDWL structure there is no use of GWLs. Therefore, the energies from E _{ matrix } (M, N) and E _{ wire } (M, N) are not counted into the total energy.
5 Power Model of the Memory Matrix
The contribution of the memory matrix to the total memory access energy is dominated by the cyclebased precharge and discharge of long bitlines. For lowpower memory matrix designs, assist circuits, bit cells and precharge schemes span a large design space complicating the power modeling. Their complex features bring significant influences on the layout placement location and the switching capacitances. Accordingly, the total energy cannot be computed by directivity accumulating their respective individual energies. Additionally, the use of LSAs in [1] brings lowvoltage swing at global bitlines and highvoltage swing at local bitlines. The complexity with multiple VDD plays at larger scale which makes it more difficult to estimate the power consumption. As before, the variable partitioning parameters (m, n, u and w) result in different access gates and parasitic capacitances due to different wire lengths. Another challenge is that read, write and standby operations must be considered separately, including a hierarchical bitline structure and the memory cell toggling state. In order to solve these issues, four circuit templates are proposed to act as a black box for precharacterization. In this way a database depending on the use case (V_{DDH} and V_{DDL}), technology corners, temperature and the characteristics of gates (width) and wires is generated. Finally, the elementary energies from assist circuits, bit cells and vertical global bitlines are separated by our estimation approach. Combined with the partitioning parameters the power consumed by the overall memory matrix is estimated accurately. Leakage power is estimated in a similar way.
5.1 Four Circuit Templates
Since E _{ 1 } …E _{ 4 } and P _{ static1 } …P _{ static4 } are precharacterized by simulating the extractednetlists of the four circuit templates, the elementary energy values E _{ lbl } , E _{ pre+ren+lwl+lsa } , E _{ vgbl } can be derived. This way the dynamic energy for read/write operations and static power of the four circuit templates are obtained. It is assumed that a toggle condition occurs for each write operation. As before, the simulation configuration is TT corner, 25°C, 400 MHz and 0.9 V supply voltage in 40 nm CMOS technology. The voltage swing of vgbl pair was chosen to 300 mV to guarantee robust operations. The estimation approach is the same for other technology corners but the precharacterization must be modified based on a MonteCarlo simulation.
Energy of four circuit templates (TT, 25°, 400 MHz, V_{DDH} = 0.9 V, V_{DDL} = 0.3 V)
Circuit Templates  Cell  Column unit  Row unit  Column  

Dynamic energy (pJ)  Write  4.26  5.25  25.75  12.58 
Read  2.91  3.71  22.47  5.88  
Static power (nW)  3.23  24.37  25.88  195.74 
5.2 Verification of Memory Matrix Model
Model Estimation Errors of the four capacities (TT, 25°, 400 MHz, V_{DDH} = 0.9 V, V_{DDL} = 0.3 V)
Capacity  64 × 8 bit  128 × 8 bit  256 × 8 bit  1024 × 8 bit 

Dynamic energy  −5 %  −1 %  9 %  3 % 
Static power  −4 %  −4 %  −5 %  2 % 
Model Estimation Errors of the four word lengths (TT, 25°, 400 MHz, V_{DDH} = 0.9 V, V_{DDL} = 0.3 V)
Word lengths  64 × 8 bit  64 × 16 bit  64 × 32 bit  64 × 64 bit 

Dynamic energy  −5 %  −5 %  −5 %  −7 % 
Static power  −4 %  −4 %  −3 %  −6 % 
6 Optimization Results
In addition, the power model can be used for optimizing a specific SRAM by determining the optimal parameter combination. As discussed before, many possibilities exist for partitioning the memory matrix, the corresponding address decoder, given the three parameters partitioning parameters (m, n, u), and many options for circuit implementations. Depending on the optimization criteria parameter combinations are picked from all possible implementations options. Note that the impact of process variations on leakage power is included in the power model.
7 Conclusion
In this chapter, a new method for power optimization of onchip SRAMs comprising a hierarchical architecture was described. The method is based on a power model including various energyefficient circuits and techniques. The introduction of the probabilistic estimation approach and the use of circuit templates provide quantified switching activities and precharacterized customized circuits separately. Simultaneously the hierarchical architecture regarding many partitioning choices is defined by the partitioning parameters. The power model is verified by a variety of extractednetlist simulations and it consistently exhibits good accuracy.
As a quantitative parameter optimization tool, this approach allows a fast and accurate power estimation of SRAMs comprising various capacities and wordlengths. In a hierarchicalarchitecture SRAM, the impact of partitioning with circuit selections on power and area were evaluated. The optimal architecture and circuits can be identified very quickly and accurately which leads to a SRAM specification with an achievable and attractive power consumption and silicon area. Moreover, this approach allows an easy tradeoff between area and power for meeting different design requirements. Furthermore, the power model can also be employed as a customized benchmark for comparing various local circuits using the same architecture. Finally, this approach can easily be extended to other CMOS technologies due to its circuit templates and switching activity analysis.
References
 1.Sharma, V., et al.: A 4.4 pJ/access 80 MHz, 128 kbit variability resilient SRAM with MultiSized sense amplifier redundancy. IEEE J. SolidState Circuits 46(10), 2416–2430 (2011)CrossRefGoogle Scholar
 2.Clerc, S., et al.: A 65 nm SRAM achieving 250 mV retention and 350 mV, 1 MHz, 55fJ/bit access energy, with bitinterleaved radiation soft error tolerance. In: 2012 Proceedings of the ESSCIRC (ESSCIRC), pp. 313–316. IEEE (2012)Google Scholar
 3.Rooseleer, B., Dehaene, W.: A 40 nm, 454 MHz 114 fJ/bit areaefficient SRAM memory with integrated charge pump. In: 2013 Proceedings of the ESSCIRC (ESSCIRC), pp. 201–204. IEEE (2013)Google Scholar
 4.Ren, Y., Noll, T.G.: An accurate power estimation model for lowpower hierarchicalarchitecture SRAMs. In: 2013 IFIP/IEEE 21st International Conference on Very Large Scale Integration (VLSISoC), pp. 144–149.IEEE (2013)Google Scholar
 5.Ren, Y., et al.: Low power 6TSRAM with tree address decoder using a new equalizer precharge scheme. In: 2012 IEEE International SOC Conference (SOCC), pp. 224–229. IEEE (2012)Google Scholar
 6.Muralimanohar, N., et al.: CACTI 6.0: A tool to model large caches. Technical report (2009). http://www.hpl.hp.com/techreports/2009/HPL200985.pdf
 7.Liang, X., et al.: Architectural power models for sram and cam structures based on hybrid analytical/empirical techniques. In: IEEE/ACM International Conference on ComputerAided Design, ICCAD 2007, pp. 824–830. IEEE (2007)Google Scholar
 8.Do, M.Q., et al.: LeakageConscious ArchitectureLevel power estimation for partitioned and PowerGated SRAM arrays. In: 8th International Symposium on Quality Electronic Design, ISQED 2007, pp. 185–191. IEEE, Washington, DC (2007)Google Scholar
 9.Donkoh, E., et al.: A hybrid and adaptive model for predicting register file and SRAM power using a reference design. In: 2012 49th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 62–67. IEEE (2012)Google Scholar
 10.Sun, L., et al.: Low power and robust binary tree SRAM design for embedded systems. In: 2013 International Symposium on Electronic System Design (ISED), pp. 87–92. IEEE (2013)Google Scholar