Quantitative Optimization and Early Cost Estimation of Low-Power Hierarchical-Architecture SRAMs Based on Accurate Cost Models
Abstract
Dedicated low-power SRAMs are frequently used in various system-on-chip designs and their power consumption plays an increasingly crucial role in the overall power budget. However, the broad amount of choices regarding the capacity, wordlengths and operational modes make it hard for designers to determine the optimal SRAM architecture. Additionally, many low-power techniques and circuits are frequently utilized but not supported by previously proposed cost models. In order to solve these problems, a cost-model based quantitative optimization approach is proposed. In particular, a fast and accurate power estimation model is built for aiding the low-power SRAM designs. It precisely fits the various complex SRAM circuits and architectures. The quantitative approach provides useful conclusions early in the design phase guiding further optimizations. The estimation error of the power model has been proven to be less than 10 % compared to results based on time-hungry extracted-netlist simulations in a 40-nm CMOS technology.
Keywords
SRAM Power model Quantitative parameter optimization1 Introduction
SRAMs are widely used in many applications as caches etc. due to their fast access speed but also contribute significantly to area cost and power consumption. Particularly in system-on-chip design, dedicated SRAMs with optimized architecture and circuits are often applied for achieving low-power. The optimization of those dedicated SRAMs for lowest possible power at a given performance is a quite challenging task because of the complexity of the design space. An attractive approach is to perform a quantitative optimization based on cost models. Such cost models not only support the optimization process but also allow for early cost estimation in the system conception phase. The focus of the study is laid on the power cost considering the increasingly significant role of power consumption of SRAMs.
From the low-power perspective, SRAMs with hierarchical architecture, as described e.g. in [1, 2, 3, 4], are very attractive choices. A quantitative optimization approach for the hierarchical-architecture SRAMs deserves further research.
A block diagram of a conventional on-chip SRAM with a hierarchical architecture
Overall flowchart of the cost estimation environment
Besides the choices related to partitioning regarding the design space, there exists a wide choice of various energy-efficient circuit techniques. These circuit techniques, such as the ones related to the bit cells, assist circuits and local sense amplifiers (LSA) cause additional overhead and complexity to the overall design. The consumed energy is difficult to evaluate and estimate since it depends on the context in which these circuits are used. Their non-standard features cannot longer be characterized using the available models which only include standard features. Hence there must be a quantitative benchmarking tool to assess these circuits and techniques. Accordingly, their respective power consumption should be estimated and compared while using different configurations so that the optimal one can be selected. A pre-characterization methodology is employed for capturing their distinct features and quantifying the analysis. With these 2 sets of inputs, partitioning parameters and specific circuit techniques, a design-space exploration is carried out, while building a cost (A, T, E) model for on-chip SRAMs. This way a Pareto-optimization is carried out by trading off the three costs (A, T, E). Finally the design decisions regarding architecture and circuits are made.
Overview of an ATE cost model
-
The area can be estimated by a parameterized floor plan estimation and an accumulation of elementary widths and height values (e.g. W components , H components ).
-
In the energy model, the elementary energy values of basic circuits (e.g. E gate ) are accumulated with the partitioning parameters and switching activity probabilities. The interconnect energy is estimated from the wire length and the energy per unit length (E wire_unit ). Finally the total energy consumption is derived by an accumulation of elementary energy (e.g. E bl ) from all used circuit components.
-
The speed can also be derived by using the cost data base (e.g. t slope , W wire , C load, C input ) and a elementary delay accumulation along the critical path involving long resistance-capacitance interconnects.
In this contribution, we focus on the energy cost model and the relevant optimization approach. In the proposed model, only a few necessary basic circuit components (i.e. basic bit cells) need to be characterized. These circuit components could be verified by a number of Monte-Carlo simulations for ensuring robustness. Moreover, the pre-characterization including simulation time and model building only requires couple of hours and afterwards the estimation results can be acquired in a few minutes. Therefore, the total effort of the cost model is much less compared to a complete reference design.
2 Hierarchical Architecture
Hierarchical-architecture SRAM organization comprising DWL structure
Specific circuits found in a column of a hierarchical-architecture SRAM macro
The choice of m and w defines the parasitic wordline capacitances and the wordline structure. Either a non-divided wordline (non-DWL) or a DWL structure can be selected according to the capacity and wordlength. The parameter n defines the bitline hierarchy and therewith affects the global and local bitline capacitances. Especially, charging and discharging the bitlines contributes significantly to the overall power consumption. The number of cells u in one column unit determines to a large extent the minimum energy consumption for one operation. The n and u must be carefully selected for trading the least frequent use of LSAs and the minimum switching capacitances.
3 Partitioning Impact Analysis
SRAMs typically include two major contributors to power consumption: the address decoders and the memory matrices. In the hierarchical architecture (Fig. 4), the way of dividing and combining the address decoders determines how the memory matrix is partitioned into sub-blocks. A probabilistic estimation approach is employed for estimating the switching activity and power consumption of the address decoder especially regarding whether or not a distributed wordline structure is used. The memory matrices including complex assist and periphery circuits, which consume a large portion of power, were also modeled and characterized. Four basic circuit templates and a power estimation method are proposed to extract and describe the architecture and circuit characteristics for the hierarchical architecture. The specific circuits used within the four circuit templates can be altered without changing the estimation approach itself. Various power reduction techniques, e.g. precharge schemes in [3, 5], circuit techniques in [1, 2, 5], can be pre-characterized and benchmarked in the same configurations, which makes this model very appropriate for customized SRAM designs.
Partitioning parameter possibilities and their impact on SRAM components
4 Power Model of Address Decoder
4.1 Basic Circuits of Address Decoder
As shown in Fig. 6, the address decoder includes three pre-decoders and three distributed decoders. The three pre-decoders can be decoders comprising either a large fan-in or a small fan-in, depending on their input numbers. The other three intermediate decoders are regarded to be matrix-like select circuits which are composed of logic gates distributed in a matrix. A probabilistic method is employed for modeling the underlying switching activities of these logic gates, by which the transition power consumption of the matrix-like select circuit is estimated. The large fan-in decoder is composed of a matrix-like select circuit and two small fan-in decoders. Therefore, if the energies associated with small fan-in decoders and basic gates are available, the energy of the three pre-decoders and the three distributed decoders can be derived by the probabilistic method. Also, a realistic topology estimation approach is used to estimate the wire capacitances and area of different (m, n, u).
Characterization database of basic decoders (TT, 25°, 400 MHz, VDDH = 0.9 V)
Decoder 2-to-4 | Decoder 3-to-8 | Decoder 4-to-16 | |
---|---|---|---|
Dynamic energy (aJ) | 2105 | 4400 | 8460 |
Static power (nW) | 12 | 31 | 56 |
Input Capacitance (fF) | 1.44 | 2.44 | 3.93 |
Width (µm) | 2.73 | 3.45 | 4.17 |
Height (µm) | 2.02 | 3.53 | 5.63 |
4.2 Switching Activity
Four switching cases and their switching probabilities in a distributed decoder
Energy of NOR gate for all possible input transition possibilities (TT, 25°, 400 MHz, VDDH = 0.9 V, VDDL = 0.3 V)
Inputs transitions | 00 | 11 | 12 | 13 | 01 | 02 | 03 | 23 |
Total energy(aJ) | 11 | 4 | 400 | 68 | 213 | 320 | 235 | 68 |
Inputs transitions | 22 | 33 | 21 | 31 | 10 | 20 | 30 | 32 |
Total energy(aJ) | 2 | 3 | 315 | 228 | 705 | 655 | 880 | 228 |
The equation was verified using several different combinations of rows and columns and shows 5 % estimation error compared to extracted-netlist simulation results.
4.3 Energy Cost Related to Interconnects
As technology keeps shrinking the role of interconnects becomes increasingly significant in the total power budget. Particularly interconnects incur large capacitive loads in the dense SRAM layout. As described in Fig. 1, the 1st stage pre-decoders and the 2nd stage decoders are typically placed around the memory matrix. The 3rd stage LWL decoders are distributed into the block columns. Hence, the aspect ratio of the LWL, local timing circuits and the bit-cell column must be considered together. For estimating the associated interconnect lengths, a floor plan containing the dominating memory matrix and address decoder must be determined in advance.
Two empirical placement orientations for wire capacitance estimation
The wire capacitance per unit length of a metal wire is assumed as an appropriate value (0.08fF/µm) for a 40-nm technology. This value is evaluated and modified when coupling capacitances exist in very dense layouts. Moreover, under the assumption that only the column decoder switches and the row decoder does not, the switched capacitances are only determined by the width (W h ) with a switching probability (C-1)/(R·C). In case that only the row decoder switches and the column decoder does not, the switched capacitances considering its switching probability are equal to 0.08 ·H h ·(C-1)/(R·C). If both row and column decoders are switching, the switched capacitances are computed considering both width and height. To summarize, for the two typical floor plans wire lengths and capacitances are estimated, which leads to a decision regarding which floor plan has to be assumed.
4.4 Verification of Address Decoder Estimation Model
Divided wordline (DWL) structure for address decoder
The dynamic energy of three pre-decoders are represented by E dyc (n), E dyc (u) and E dyc (m). The parameters m, n and u denote the input widths of the three decoders respectively. The energy can be acquired from Table 2 (optionally in combination with small fan-in decoders and matrix-like circuits). For the second stage, energy figures for word-row and block decoders are given by E matrix (N,U) + E wire (N,U) and E matrix (M,N) + E wire (M,N). N = 2 n and U = 2 u represent the number of rows and columns of the matrix-circuit. Note that in the 3nd stage a matrix (N·U, M) is applied instead of a matrix (N·U, M·N) since every GWL signal only needs 2 m Blockact signals to select the word in that column, cf. in Fig. 9. For an address decoder in a non-DWL structure there is no use of GWLs. Therefore, the energies from E matrix (M, N) and E wire (M, N) are not counted into the total energy.
Simulation v.s. estimation energy for a 1 K Non-DWL and a 12-to-4 K DWL decoders
5 Power Model of the Memory Matrix
The contribution of the memory matrix to the total memory access energy is dominated by the cycle-based pre-charge and discharge of long bitlines. For low-power memory matrix designs, assist circuits, bit cells and pre-charge schemes span a large design space complicating the power modeling. Their complex features bring significant influences on the layout placement location and the switching capacitances. Accordingly, the total energy cannot be computed by directivity accumulating their respective individual energies. Additionally, the use of LSAs in [1] brings low-voltage swing at global bitlines and high-voltage swing at local bitlines. The complexity with multiple VDD plays at larger scale which makes it more difficult to estimate the power consumption. As before, the variable partitioning parameters (m, n, u and w) result in different access gates and parasitic capacitances due to different wire lengths. Another challenge is that read, write and standby operations must be considered separately, including a hierarchical bitline structure and the memory cell toggling state. In order to solve these issues, four circuit templates are proposed to act as a black box for pre-characterization. In this way a database depending on the use case (VDDH and VDDL), technology corners, temperature and the characteristics of gates (width) and wires is generated. Finally, the elementary energies from assist circuits, bit cells and vertical global bitlines are separated by our estimation approach. Combined with the partitioning parameters the power consumed by the overall memory matrix is estimated accurately. Leakage power is estimated in a similar way.
5.1 Four Circuit Templates
A single cell circuit template
A column unit circuit template
A row unit circuit template
A column circuit template
Since E 1 …E 4 and P static1 …P static4 are pre-characterized by simulating the extracted-netlists of the four circuit templates, the elementary energy values E lbl , E pre+ren+lwl+lsa , E vgbl can be derived. This way the dynamic energy for read/write operations and static power of the four circuit templates are obtained. It is assumed that a toggle condition occurs for each write operation. As before, the simulation configuration is TT corner, 25°C, 400 MHz and 0.9 V supply voltage in 40 nm CMOS technology. The voltage swing of vgbl pair was chosen to 300 mV to guarantee robust operations. The estimation approach is the same for other technology corners but the pre-characterization must be modified based on a Monte-Carlo simulation.
Energy of four circuit templates (TT, 25°, 400 MHz, VDDH = 0.9 V, VDDL = 0.3 V)
Circuit Templates | Cell | Column unit | Row unit | Column | |
---|---|---|---|---|---|
Dynamic energy (pJ) | Write | 4.26 | 5.25 | 25.75 | 12.58 |
Read | 2.91 | 3.71 | 22.47 | 5.88 | |
Static power (nW) | 3.23 | 24.37 | 25.88 | 195.74 |
5.2 Verification of Memory Matrix Model
Dynamic power component for a memory matrix (64 words 8 bit)
Estimation and simulation data comparison for memory matrices with four capacities
Model Estimation Errors of the four capacities (TT, 25°, 400 MHz, VDDH = 0.9 V, VDDL = 0.3 V)
Capacity | 64 × 8 bit | 128 × 8 bit | 256 × 8 bit | 1024 × 8 bit |
---|---|---|---|---|
Dynamic energy | −5 % | −1 % | 9 % | 3 % |
Static power | −4 % | −4 % | −5 % | 2 % |
Estimation and simulation data comparison for memory matrices with four wordlengths
Model Estimation Errors of the four word lengths (TT, 25°, 400 MHz, VDDH = 0.9 V, VDDL = 0.3 V)
Word lengths | 64 × 8 bit | 64 × 16 bit | 64 × 32 bit | 64 × 64 bit |
---|---|---|---|---|
Dynamic energy | −5 % | −5 % | −5 % | −7 % |
Static power | −4 % | −4 % | −3 % | −6 % |
6 Optimization Results
Dynamic energy vs address bits & wordlengths for two architectures. The yellow bottom and the rest part of one bar represent the contributions of the address decoder and the memory matrix respectively (Color figure online).
In addition, the power model can be used for optimizing a specific SRAM by determining the optimal parameter combination. As discussed before, many possibilities exist for partitioning the memory matrix, the corresponding address decoder, given the three parameters partitioning parameters (m, n, u), and many options for circuit implementations. Depending on the optimization criteria parameter combinations are picked from all possible implementations options. Note that the impact of process variations on leakage power is included in the power model.
Area cost vs read power tradeoff for a 1 K-Byte SRAM
Contribution of address decoder and memory matrix to area cost and read power for the 10 possible architectures of a 1 K Byte SRAM
7 Conclusion
In this chapter, a new method for power optimization of on-chip SRAMs comprising a hierarchical architecture was described. The method is based on a power model including various energy-efficient circuits and techniques. The introduction of the probabilistic estimation approach and the use of circuit templates provide quantified switching activities and pre-characterized customized circuits separately. Simultaneously the hierarchical architecture regarding many partitioning choices is defined by the partitioning parameters. The power model is verified by a variety of extracted-netlist simulations and it consistently exhibits good accuracy.
As a quantitative parameter optimization tool, this approach allows a fast and accurate power estimation of SRAMs comprising various capacities and wordlengths. In a hierarchical-architecture SRAM, the impact of partitioning with circuit selections on power and area were evaluated. The optimal architecture and circuits can be identified very quickly and accurately which leads to a SRAM specification with an achievable and attractive power consumption and silicon area. Moreover, this approach allows an easy tradeoff between area and power for meeting different design requirements. Furthermore, the power model can also be employed as a customized benchmark for comparing various local circuits using the same architecture. Finally, this approach can easily be extended to other CMOS technologies due to its circuit templates and switching activity analysis.
References
- 1.Sharma, V., et al.: A 4.4 pJ/access 80 MHz, 128 kbit variability resilient SRAM with Multi-Sized sense amplifier redundancy. IEEE J. Solid-State Circuits 46(10), 2416–2430 (2011)CrossRefGoogle Scholar
- 2.Clerc, S., et al.: A 65 nm SRAM achieving 250 mV retention and 350 mV, 1 MHz, 55fJ/bit access energy, with bit-interleaved radiation soft error tolerance. In: 2012 Proceedings of the ESSCIRC (ESSCIRC), pp. 313–316. IEEE (2012)Google Scholar
- 3.Rooseleer, B., Dehaene, W.: A 40 nm, 454 MHz 114 fJ/bit area-efficient SRAM memory with integrated charge pump. In: 2013 Proceedings of the ESSCIRC (ESSCIRC), pp. 201–204. IEEE (2013)Google Scholar
- 4.Ren, Y., Noll, T.G.: An accurate power estimation model for low-power hierarchical-architecture SRAMs. In: 2013 IFIP/IEEE 21st International Conference on Very Large Scale Integration (VLSI-SoC), pp. 144–149.IEEE (2013)Google Scholar
- 5.Ren, Y., et al.: Low power 6T-SRAM with tree address decoder using a new equalizer precharge scheme. In: 2012 IEEE International SOC Conference (SOCC), pp. 224–229. IEEE (2012)Google Scholar
- 6.Muralimanohar, N., et al.: CACTI 6.0: A tool to model large caches. Technical report (2009). http://www.hpl.hp.com/techreports/2009/HPL-2009-85.pdf
- 7.Liang, X., et al.: Architectural power models for sram and cam structures based on hybrid analytical/empirical techniques. In: IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2007, pp. 824–830. IEEE (2007)Google Scholar
- 8.Do, M.Q., et al.: Leakage-Conscious Architecture-Level power estimation for partitioned and Power-Gated SRAM arrays. In: 8th International Symposium on Quality Electronic Design, ISQED 2007, pp. 185–191. IEEE, Washington, DC (2007)Google Scholar
- 9.Donkoh, E., et al.: A hybrid and adaptive model for predicting register file and SRAM power using a reference design. In: 2012 49th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 62–67. IEEE (2012)Google Scholar
- 10.Sun, L., et al.: Low power and robust binary tree SRAM design for embedded systems. In: 2013 International Symposium on Electronic System Design (ISED), pp. 87–92. IEEE (2013)Google Scholar