13.1 Basics

  1. 1.

    Model metal wires in layers

  2. 2.

    Understand the parallel plate model for wire capacitance

  3. 3.

    Examine options to reduce wire capacitance

  4. 4.

    Write the equation for wire resistance

  5. 5.

    Examine options to reduce wire resistance

  6. 6.

    Understand the skin effect

  7. 7.

    Derive a simplified model for skin effect.

Metal wires on an ASIC are deposited using PVD. They are then either dry etched or polished down using CMP (Sects. 7.5 and 7.7). Metal wires are also built in multiple layers, with some technologies offering more wire layers than others (Sect. 8.5). The cross section of a wafer with multiple deposited wire layers is shown in Fig. 13.1.

Fig. 13.1
figure 1

Cross section showing different wire layers

But the word “wire” simply means any long section of conductive material used to carry a signal. For example, consider the layout in Fig. 8.20 where polysilicon is used as a local wire inside a cell to connect the inputs of transistors. Also consider the layout in Fig. 12.11 where a long diffusion layer wire is used to carry ground while a long polysilicon line is used to carry the word-line.

As we will shortly see, metal wires are significantly better in terms of delay and power than wires made of silicon, but sometimes area considerations imposed by the layout force the designer to use silicon wires. Thus, we have to be able to characterize both metal and silicon wires.

We are concerned with wires because they impact the performance of gates. Wires add capacitance and potentially resistance that significantly change the delay of gates by changing their loading. They also increase the dynamic power consumption of circuits by increasing the switched capacitance as well as dissipating their own power in their resistance.

Figure 13.2 shows a very simple model for a wire running over the substrate. The wire has a length L, width W, and thickness tw. It is a distance (t) above the substrate and is separated from it by oxide, mostly a thick field oxide.

Fig. 13.2
figure 2

Simplified model for wire capacitance. The wire is floating in oxide at a distance t above the substrate. L is dictated by design, minimum W by design rules, tw and t by technology

Since the substrate and wire are conductive and the oxide is insulating, there is a capacitance between the wire and the substrate. Notice that as the wire runs its length, it runs over other metal layers, polysilicon, and diffusion. Thus its capacitance is supposed to be divided between all said layers. However, for the majority of its trip, the wire runs over the substrate and the wells, and thus its capacitance can be reduced to be between the wire layer and the substrate/well.

The substrate and well both lie at signal ground. Thus the wire to substrate capacitance lies between the wire and ground. If the height of the wire is much smaller than both its length and width, i.e., tL, W, then the capacitance can be considered a parallel place capacitance where the upper plate is the wire and the lower plate is the substrate. The value of capacitance is

$$ C_{\rm wire} = \frac{\varepsilon A}{d} = \frac{{\varepsilon_{0} \varepsilon_{\rm rox} WL}}{t} $$

The height of the wire above the substrate, t, is dictated by the particular process and the metal layer. The relative permittivity of the oxide is a material property. The width of the wire has a lower bound dictated by the design rules (Sect. 8.5). The designer has control over W and L within the limits of DRC, the rest of the expression of the capacitance is dictated by technology. Thus we can define

$$ C_{\rm wire} = \frac{{\varepsilon_{0} \varepsilon_{\rm rox} }}{t}WL = C_{a} WL $$

where Ca is the wire capacitance per unit area. Ca is dictated by the technology for each metal layer. Note, however, that Ca is specific per metal layer. Because it is dependent on t, higher metal layers will have lower Ca. We can also define a capacitance per unit length Cl:

$$ C_{\rm wire} = \frac{{\varepsilon_{0} \varepsilon_{\rm rox} W}}{t}L = C_{l} L $$

This expression is useful when the designer always uses the minimum W dictated by DRC. In such a case, W also becomes “dictated by technology” and L becomes the only design parameter.

As we will see in the next section, wire capacitance increases the time-constant of a gate. Thus, it is in our interest to reduce wire capacitance. The equation suggests several ways to do this:

  • Increase t. This is dictated by technology. While it might seem like the height dimension is open and we can always distribute metal layers to higher heights, this is not strictly true. Putting metal layers higher means we also have to separate the metal layers from each other more. This means that when metal layers contact each other, the vias have to be much deeper. This increases the resistance of vias and makes them more prone to electromigration and other defects

  • Reduce W. DRC imposes a lower limit on W. But more importantly reducing W will increase resistance as we will shortly see

  • Reduce L. Length is dictated by the design. PAR tools (Sect. 8.7) make significant efforts to reduce length of interconnects, but closure cannot always be guaranteed

  • Reduce relative permittivity. This is a viable option although it imposes technical challenges.

Table 13.1 lists the relative permittivity of select materials. In the context of solid insulators, silicon dioxide has a reasonably low permittivity. However, we can do better. The best relative permittivity is achieved by air, but this would not give structural support for metal layers or protection for the die. Some polymers like polyethylene and polypropylene have better permittivity than silicon dioxide. However, because silicon dioxide can be grown or deposited precisely and easily in CMOS processes, the inertia to keep using oxide, especially native oxide, as an insulator is very strong.

Table 13.1 Select dielectrics and their relative permittivity

Wires also have resistance. Looking back at Fig. 13.2, current flows through the front cross section of the wire. The cross-sectional area through which current flows is Wtw, thus the resistance of the wire is

$$ R = \frac{L}{{\sigma Wt_{w} }} $$

The resistance of the wire has a devastating effect on the delay of gates, more so than capacitance. The following are factors that can help control resistance:

  • Reduce L. As with capacitance, this is dictated by the design and the effort of the PAR tools. Because L affects both C and R in the same way, the imperative to reduce L is even higher

  • Increase W. This creates more area for current flow. But it increases capacitance

  • Increase tw. To a first order, this helps reduce R without impacting C. However, we will see in Sect. 13.2 that this is not accurate due to inter-wire capacitance. But increasing tw is still a viable option for controlling wire resistance

  • Increase conductivity. Although some technical, chemical, and economic factors may limit this option, using materials with higher conductivity is a win-win proposition. This is the main reason most modern interconnects are made of copper instead of the historical aluminum, despite the fact that copper is much harder to pattern (Sect. 7.7)

Table 13.2 shows that even when conductivity is increased by changing the metal, the increase is limited to at most an order of magnitude. As we will see in Sect. 13.3, the problem of resistance is a lot more obvious for silicon wires where conductivity can be four orders of magnitude lower than metals.

Table 13.2 Selected metals and their conductivity. Most modern processes use copper. Legacy CMOS processes used aluminum

By examining the resistance equation, only W and L are within the designer’s control. Thus, we can divide the resistance expression into a technology-controlled portion and a designer-controlled portion:

$$ R = \frac{L}{{\sigma Wt_{w} }} = \frac{1}{{\sigma t_{w} }}\frac{L}{W} = R_{s} \frac{L}{W} $$

The technology-defined portion of the equation Rs is called the square resistance or the sheet resistance. As the equation shows, it is the resistance of any square wire, i.e., a wire where W = L, regardless of the value of W.

Square resistance is sometimes a more useful parameter than conductivity because it also folds in the technology-specific parameter tw. Conductivity is material dependent while square resistance is dictated by both the material and the process. Note that square resistance will be different for different metal layers in the same process if the thickness of metal layers is different.

We will shortly see (Sect. 13.2) that the resistance of metal wires is very small. And is usually negligible next to their capacitance and to the resistance of silicon wires. However, in modern chips, even metal wires have resistance that has to be taken into consideration, the cause is a phenomenon called the skin effect.

The skin effect is a phenomenon where AC current fails to flow through the entire cross section of a wire. Instead, the current flows through an outer shell or skin of the wire as shown in Fig. 13.3.

Fig. 13.3
figure 3

The skin effect. Current tends to flow more intensely toward the perimeter of a metal wire, with the effect more pronounced the higher the frequency of the signal

The “skin effect” is highly dependent on frequency. The higher the frequency, the thinner the skin through which the current effectively flows. Characterizing this problem is a complicated electromagnetic problem.

The skin effect does not strictly mean that current only flows in the outer shell. Instead it means that the largest current density flows near the outer border of the wire, decreasing monotonically away from it and becoming minimum near the center of the wire. The current density can be characterized as

$$ J = J_{0} e^{{ - \left( {1 + j} \right)d/\delta }} $$

where Jo is the maximum current at the surface. d is the depth below the surface. δ is a parameter that indicates the depth at which the current drops to 1/e from its maximum value at the surface. Simplified models for skin effect consider current to flow uniformly at depths shallower than δ and no current flows at depth deeper than δ.

Thus δ, the skin depth, is a critical parameter. It can be estimated as

$$ \delta = \sqrt {\frac{1}{\pi f\mu \sigma }} \sqrt {\sqrt {1 + \left( {\frac{2\pi f\varepsilon }{\sigma }} \right)^{2} } + \frac{2\pi f\varepsilon }{\sigma }} $$

For typical values of permittivity, conductivity, and frequency, the entire second square root is typically near unity. Thus the expression for skin depth most often used in ASICs is approximately

$$ \delta = \sqrt {\frac{1}{\pi f\mu \sigma }} $$

According to the simplified skin effect model in Fig. 13.4, the skin effect has no impact for any frequencies where the skin depth is not low enough to cause the peripheral current carrying rectangles to not overlap. In other words if δ is not smaller than both tw/2 and W/2, then one of the two dimensions will have overlapping current carrying skins. And the entire wire will carry current, leading to no skin effect.

Fig. 13.4
figure 4

Simplified skin effect model (right). Consider current to flow uniformly in an outer shell and not at all near the center

Thus, the condition to observe skin effect is

$$ \delta < \hbox{max} \left\{ {W,t_{w} } \right\}/2 $$

The skin depth is a function of metal permeability and conductivity. But more critically, it is a function of frequency. If we substitute into the inequality above for skin depth, we can translate it into a condition on frequency at which skin effect becomes visible:

$$ \begin{array}{*{20}l} {\sqrt {\frac{1}{\pi f\mu \sigma }} < \hbox{max} \left\{ {W,t_{w} } \right\}/2} \hfill \\ {\frac{1}{\pi f\mu \sigma } < \frac{{(\hbox{max} \left\{ {W,t_{w} } \right\})^{2} }}{4}} \hfill \\ {f > \frac{4}{{\pi \mu \sigma (\hbox{max} \left\{ {W,t_{w} } \right\})^{2} }}} \hfill \\ \end{array} $$

The danger of skin effect is that it significantly reduces the available cross section, thus causing effective resistance to increase. The resistance of a wire suffering from skin effect according to the model in Fig. 13.4 is

$$ \begin{aligned} R & = \frac{L}{{\sigma \left( {2W\delta + 2\left( {t_{w} - 2\delta } \right)\delta } \right)}} \\ R & = \frac{L}{{\sigma \left( {2W\delta + 2t_{w} \delta - 4\delta^{2} } \right)}} \\ \end{aligned} $$

13.2 Lumped C Wires

  1. 1.

    Model a wire running over a substrate as a parallel plate capacitor

  2. 2.

    Use the lumped capacitor model to estimate total delay in CMOS

  3. 3.

    Understand that the skin effect reduces the validity of the lumped capacitor model

  4. 4.

    Recognize inter-wire capacitance

  5. 5.

    Understand why inter-wire capacitance increases in modern technology.

Metal wires are used to carry signals over long distances. The outputs of CMOS gates are provided through MOSFET drains, thus through the diffusion layer. The output is through a common PMOS and NMOS drain node. As shown in Fig. 8.20 in Sect. 8.3, the two drains have to be connected through a metal line. When we avoid metal wires, we do so to avoid the overhead of contacting the metal layer. However, as shown in Fig. 8.20, CMOS outputs are in the metal layer in any case, and thus, wires from the output of a CMOS gate to the input of another CMOS gate might as well be in metal layers.

Figure 13.5 shows a CMOS inverter feeding another inverter through an intermediate metal wire. The metal wire has limited resistance due to the high conductivity of the metal. Thus, we can consider the metal as a single, large, lumped capacitor that exists between the wire and ground (Sect. 13.1). The ground represents the substrate and well over which the metal line runs. Notice that the metal line through its path also overruns polysilicon, diffusion, and other metal layers. However, the overwhelming majority of its plate area runs over the substrate/well.

Fig. 13.5
figure 5

Cascaded CMOS inverters with connecting metal wire

Figure 13.6 shows the delay model of the network in Fig. 13.5 with the lumped capacitance of the metal wire added. The model is not very different from that where wires are considered ideal short circuits (Chap. 3). The effect of the metal wire is to increase the loading capacitance on the first inverter. Thus the time-constant at out1 for an ideal (short circuit) wire versus a lumped capacitance wire is

Fig. 13.6
figure 6

Model for time-constant with metal wire capacitance (Cwire) taken into consideration

$$ \begin{array}{*{20}c} {\tau_{\rm ideal} = R_{n1} \left( {C_{d1} + C_{g2} } \right)} \\ {\tau_{\rm lumped} = R_{n1} \left( {C_{d1} + C_{g2} + C_{\rm wire} } \right)} \\ \end{array} $$

The wire capacitance Cwire can be calculated as in Sect. 13.1. The time-constants above are for high to low transitions at out1, but the only difference in low to high transitions would be to replace Rn1 with Rp1. The metal wire obviously increases delay by increasing the time-constant through increasing the capacitive load on the gate.

The lumped capacitance model is simple and useful. It allows us to make quick estimates of the impact of wires on delay and power. However, it has some limitations we must be aware of, primary among them: inter-wire capacitance and skin effect.

As discussed in Sect. 13.1, the skin effect reduces the available area for current flow in a wire. At very high frequencies, this can significantly increase the resistance of metal wires, eventually making the lumped capacitance model misleading. In such cases, metal wires have to be modeled similarly to silicon wires (Sect. 13.3).

In Sect. 13.4, we will find that one way to address increasing wire delays in modern technologies is to relatively increase the thickness of wires tw. This leads to the second limitation in the parallel plate capacitance model. This is because increasing tw leads to more inter-wire capacitance.

Figure 13.7 shows adjacent wires in a CMOS technology with a large channel length. The wires are much wider than they are thick, thus W ≫ tw. Individual wires are also very widely separated. As shown, there are two types of capacitances from every wire. The first is the wire to substrate capacitance which we have been discussing so far. The second is inter-wire capacitance. This capacitance exists because of two wires. This is because any two wires in the same metal layer have metal faces separated by the insulating oxide. The inter-wire plates have area L∙tw, and thus the inter-wire capacitance has the value:

Fig. 13.7
figure 7

Inter-wire and wire to substrate capacitance in old technology

$$ C_{\rm interwire} = \frac{{\varepsilon_{\rm ox} t_{w} L}}{{L_{s} }} $$

where Ls is the separation between the two wires. This inter-wire capacitance is small and negligible relative to the wire to substrate capacitance for two reasons: tw is very small, especially relative to W and Ls is very large relative to tw. Thus, traditionally we would ignore inter-wire capacitance and use a single lumped wire to substrate capacitance.

Figure 13.8 shows the situation in a more modern process. The same equation still applies to calculating inter-wire capacitance. However, the value of this capacitance is much more important due to two factors: tw is increased or at least not scaled as fast as W for reasons that will become clear in Sect. 13.4. Also, Lsep is reduced as modern technologies scale down dimensions, pack wires closer to support increased functionality, and use more permissive design rules.

Fig. 13.8
figure 8

Inter-wire and wire to substrate capacitance in a modern technology

Wire to substrate capacitance increases delay by increasing the capacitive load on gate outputs. Inter-wire capacitance has a more profound impact: it allows wires to couple to each other. Thus, it allows a transition on a wire to be, at least partially, transferred to another wire. Separation DRC rules usually stipulate that higher metal layers be more widely spaced than lower layers. This is because higher metal layers are also thicker, and thus should be more widely spaced to prevent them from coupling.

13.3 Silicon Wires

  1. 1.

    Recognize cases where silicon wires have to be used

  2. 2.

    Understand where diffusion capacitance comes from

  3. 3.

    Use the Elmore delay method to calculate delay for a distributed RC wire

  4. 4.

    Use the Elmore method to calculate delay in a loaded and driven wire.

Silicon wires should only be used for very short wires. However, as shown repeatedly in Chap. 12, sometimes we do use silicon over considerable distances to save on area. The main problem with silicon wires is their resistance. The sheet resistance of polysilicon and silicon is 2–3 orders of magnitude higher than commonly used ASIC metals. In this section, we will develop a model to deal with wires where both resistance and capacitance are significant. This can also be applied to high-frequency metal lines where skin effect has a significant impact.

The resistance and capacitance of polysilicon lines can be modeled similarly to metal lines in Sect. 13.2. The resistance of diffusion wires can also be modeled similarly. However, its capacitance to substrate needs a little more consideration.

Figure 13.9 shows a diffusion wire running through the substrate. The capacitance here is a diffusion capacitance of a reverse-biased PN junction. As discussed in Sect. 1.9, this capacitance is nonlinear and difficult to characterize. However, we can still consider it proportional to the length and width of the wire.

Fig. 13.9
figure 9

Diffusion wire to substrate capacitance

In wires where there is resistance and capacitance, it is very challenging to develop a model to use in circuits. At first glance, it might look like we can calculate the wire capacitance and wire resistance from Sect. 13.1 and then use these values to modify the time-constant of the circuit. However, the resistance and capacitance of the wire are intertwined and cannot be lumped at a single location.

Would the capacitance be lumped at the beginning of the wire or at its end? Or should it be divided between the beginning and the end? The reality is that both the capacitance and the resistance are fully distributed throughout the wire. At every location there is capacitance to substrate and resistance through the cross section, Fig. 13.10.

Fig. 13.10
figure 10

RC wire model

The wire resistance, wherever we decide to model it, will exist serially through the wire. This will create an RC ladder structure where the time-constant cannot be calculated in a straightforward manner as in Chap. 3. To overcome this, we introduce the Elmore time-constant, a method to approximate the equivalent time-constant of a network which we cannot reduce to a single capacitance and resistance.

The Elmore time-constant is fairly easy to calculate for tree networks where the feedforward is purely resistive, and there is no feedback. Figure 13.11 shows an example of such a network. There are four capacitive nodes separated by resistances. To find the time-constant at node “out” in response to a transition at node “in”, we calculate the Elmore time-constant.

Fig. 13.11
figure 11

Sample network for estimating Elmore delay. Elmore delay only works for trees and cannot function where there are loops

The Elmore time-constant is found by calculating a partial time-constant for each capacitance and then adding these partial time-constants. The partial time-constants are calculated by multiplying the capacitance of the node by the value of resistance from the input node to the capacitance node, as long as that resistance is also part of the resistance from the input to the output.

Thus, the four partial time-constants in Fig. 13.11 are

$$ \begin{array}{*{20}c} {\tau_{1} = C_{1} R_{1} } \\ {\tau_{2} = C_{2} R_{1} } \\ {\tau_{3} = C_{3} \left( {R_{1} + R_{3} } \right)} \\ {\tau_{4} = C_{4} \left( {R_{1} + R_{3} + R_{4} } \right)} \\ \end{array} $$

Notice that the resistance multiplied by C2 is only R1 since R2 is not also part of the input to output resistive path. The time-constant at out is thus

$$ \tau = R_{1} \left( {C_{1} + C_{2} + C_{3} + C_{4} } \right) + R_{3} \left( {C_{3} + C_{4} } \right) + R_{4} C_{4} $$

This expression shows another way to calculate the Elmore time-constant: each resistance should be multiplied by those capacitances that see the resistance in the path from input to output. For example, R3 is only present in the path from input to C3 and C4, and is thus multiplied only by these two capacitances.

Figure 13.12 shows a resistive wire divided into RC sections. Because the resistance and the capacitance are distributed throughout the wire, the model is more accurate the more sections it is divided into. The model is most accurate when the number of sections is infinite.

Fig. 13.12
figure 12

Distributed RC sectional model

Now assume the total wire resistance is Rwire and it is divided into N sections, then the resistance of a section is Rw = Rwire/N. Similarly if Cwire is the total wire capacitance then Cw = Cwire/N.

Calculating the Elmore time-constant at the output of the N sections in Fig. 13.12:

$$ \begin{array}{*{20}c} {\tau = R_{w} C_{w} + 2R_{w} C_{w} + 3R_{w} C_{w} + \ldots + NR_{w} C_{w} } \\ {\tau = R_{w} C_{w} \left( {1 + 2 + 3 + \ldots + N} \right)} \\ \end{array} $$

The bracket contains an arithmetic series with a base and common difference of 1:

$$ \tau = R_{w} C_{w} .\frac{{N\left( {N + 1} \right)}}{2} $$

As N tends to infinity:

$$ \tau = R_{w} C_{w} .\frac{N.N}{2} = \frac{{R_{w} N.C_{w} N}}{2} = \frac{{R_{\rm wire} C_{\rm wire} }}{2} $$

This result suggests that if the wire has distributed resistance and capacitance, then it can be represented by a lumped resistance and lumped capacitance just as long as one of the two has its value reduced by half. However, this is misleading. This result is only valid when the wire is driven through null impedance and is driving null load.

To derive a more realistic result, consider Fig. 13.5, but assume that the wire connecting the two inverters is also resistive. In this case, the delay model is as seen in Fig. 13.13. Rdrv is the drive resistance, which is Rn1 for high to low delay and Rp1 for low to high delay. Cd1 is the summation of self-loading of the first inverter, and exists before the wire. Cg2 is the summation of gate capacitance from the second inverter and exists at the end of the wire.

Fig. 13.13
figure 13

Sectional model used to model wire in an inverter pair. This differs from Fig. 13.12 in that the driver has finite impedance and the load is non-null

To find the time-constant at the input of the second inverter, we use the Elmore method:

$$ \begin{aligned} \hfill \\ \begin{array}{*{20}c} {\tau = R_{\rm drv} C_{d1} + \left( {R_{\rm drv} + R_{w} } \right)C_{w} + \left( {R_{\rm drv} + 2R_{w} } \right)C_{w} + .. + \left( {R_{\rm drv} + NR_{w} } \right)C_{w} + \left( {R_{\rm drv} + NR_{w} } \right)C_{g2} } \\ {\tau = R_{\rm drv} C_{d1} + R_{\rm drv} NC_{w} + R_{w} C_{w} \left( {1 + 2 + \ldots + N} \right) + R_{\rm drv} C_{g2} + NR_{w} C_{g2} } \\ {\tau = R_{\rm drv} C_{d1} + R_{\rm drv} C_{\rm wire} + \frac{{R_{\rm wire} C_{\rm wire} }}{2} + R_{\rm drv} C_{g2} + R_{\rm wire} C_{g2} } \\ \end{array} \hfill \\ \end{aligned} $$

We can confirm that for Rwire = 0, the result reduces to that in Sect. 13.2 for the lumped C model.

13.4 Scaling Wires

  1. 1.

    Derive the scaling behavior of wires

  2. 2.

    Understand the distinction between local and global wires

  3. 3.

    Recognize that resistance is the main culprit in the dismal behavior of wire delay scaling

  4. 4.

    Realize that wire thickness is an open dimension for preserving wire conductance

  5. 5.

    Distinguish wire thickness for lower local wires and higher global wires.

How significant is wire delay relative to gate delay? Historically wire delay was very small. It was considered negligible relative to gates. The reason is gates were large and their capacitances dominated the time-constants of circuits. In modern circuits, wires scale much worse than gates. This has led to a pattern where gate delay consistently decreased with technology while wire delays remained constant or got worse. This trend has caused wire delays to be as important as gate delays, if not more so. This puts a strain on PAR tools and favors designs with less routing congestion.

Consider a gate-to-gate time-constant where the wire is negligible. The time-constant consists of a transistor channel resistance and a capacitance consisting of a gate and a drain component (Sect. 3.5). If the technology advances so that all dimensions are scaled down by K, this is usually accompanied by a much smaller drop in supply voltage, which we will give the symbol U. Thus

$$ \begin{array}{*{20}c} {W,L,t_{\rm ox} \to \frac{W}{K},\frac{L}{K},\frac{{t_{\rm ox} }}{K}} \\ {V_{\rm th} ,V_{\rm DD} \to \frac{{V_{\rm th} }}{U},\frac{{V_{\rm DD} }}{U}} \\ \end{array} $$

Voltage has to scale down slower than dimensions to preserve noise margins. Notice that historically, according to Moore’s law, K has been around 2 every two technology nodes. If voltages had kept up with this scale, we would be using supplies significantly below the noise floor.

Capacitance per unit area has the following scaling behavior:

$$ C_{\rm ox} = \frac{\varepsilon }{{t_{\rm ox} }} \to \frac{K\varepsilon }{{t_{\rm ox} }} = KC_{\rm ox} $$

And thus gate capacitance has the scaling behavior:

$$ C_{\rm gate} = C_{\rm ox} WL \to KC_{\rm ox} .\frac{W}{K}.\frac{L}{K} = \frac{{C_{\rm gate} }}{K} $$

The current flowing through a velocity saturated device has the following scaling behavior:

$$ I = WC_{\rm ox} v_{\rm sat} \left( {V_{\rm gs} - V_{\rm th} - V_{\rm dssat} /2} \right) $$
$$ I \to \frac{W}{K}.KC_{\rm ox} v_{\rm sat} .\frac{{V_{\rm gs} - V_{\rm th} - \frac{{V_{\rm dssat} }}{2}}}{U} = \frac{I}{U} $$

And thus the channel resistance of a MOSFET does not scale

$$ R = \frac{V}{I} \to \frac{V}{U}.\frac{U}{I} = R $$

The scaling behavior of gate time-constant is

$$ \tau = RC_{\rm gate} \to \frac{{RC_{\rm gate} }}{K} = \frac{\tau }{K} $$

which means that the delay of a velocity saturated device scales down at the same rate as dimensions.

If we try to repeat the same analysis for wires, we hit an obstacle. The dimensions of wires according to Fig. 13.2 are W, L, and tw. Its separation from the substrate is t. We can assume that W, t, and tw all scaled down by K, the same ratio that shrinks gate dimensions.

However, the length of the wire does not scale down by any predictable pattern. The length of wires depends on placement and routing, the size of the die, and the complexity of the design. With every technology node, the dimensions of the device decrease, but typically, the size of the die also increases. The ratio by which the die size increases is usually smaller than an independent from the ratio of device dimension scaling.

We distinguish two extreme cases for wires: long range wires, and short range wires. Short range wires are used to connect devices in the same gate. This could be, for example, the metal section used to connect PMOS and NMOS drains in a CMOS inverter. Long range wires carry the signal across large distances, usually from one side of the die to another. They normally connect modules and subsystems.

The length of short range wires can be assumed to scale the same way as device dimensions as well as W, tw, and t. Long range wire length actually increases, thus we assume its scaling behavior is

$$ L \to LK_{w} $$

where Kw is the ratio by which the die dimension rises.

For short range wires, the capacitance scaling is

$$ C_{\rm wire} = \frac{\varepsilon WL}{t} \to \varepsilon .\frac{W}{K}.\frac{L}{K}.\frac{K}{t} = \frac{{C_{\rm wire} }}{K} $$

Resistance scaling is

$$ R_{\rm wire} = \frac{L}{{\sigma Wt_{w} }} \to \frac{L}{K}.\frac{K}{W}.\frac{K}{{t_{w} }}.\frac{1}{\sigma } = KR_{\rm wire} $$

The wire time-constant does not scale

$$ \tau = R_{\rm wire} C_{\rm wire} \to KR_{\rm wire} .\frac{{C_{\rm wire} }}{K} = R_{\rm wire} C_{\rm wire} $$

For long range wires, capacitance scaling is

$$ C_{\rm wire} = \frac{\varepsilon WL}{t} \to \varepsilon .\frac{W}{K}.LK_{w} .\frac{K}{t} = \frac{{C_{\rm wire} K_{w} }}{K} $$

And resistance scaling is

$$ R_{\rm wire} = \frac{L}{{\sigma Wt_{w} }} \to LK_{w} .\frac{K}{W}.\frac{K}{{t_{w} }}.\frac{1}{\sigma } = K_{w} K^{2} R_{\rm wire} $$

Leading to time-constant scaling of

$$ \tau = R_{\rm wire} C_{\rm wire} \to K_{w} K^{2} R_{\rm wire} .\frac{{C_{\rm wire} K_{w} }}{K} = KK_{w}^{2} R_{\rm wire} C_{\rm wire} $$

Thus wire delay at best does not scale at all, and at worst it increases linearly with technology and quadratically with the dimension of the die. Compared to the scaling of gate delays, both kinds of wires have their relative delay increase with technology. But long range wires are of particular concern.

Modern processes use higher conductivity metals and lower permittivity insulators to decrease resistance and capacitance, respectively. However, both options are limited by available materials and are barely enough to keep up with the effects of technology scaling on wires.

If we examine the scaling behavior of wires, we find that resistance is the primary culprit in the dismal behavior of the time-constant. Resistance always increases, and for long range wires it increases quadratically with the technology scaling parameter.

Digging deeper, we find that the problem is that the available area for current flow decreases quadratically because both W and tw decrease by a factor K. W has to scale with technology because wires have to get narrower to allow the same number of wires to fit between gates. In fact, if we use scalable design rules (Sect. 8.5), we find that the minimum width of wires is always proportional to the minimum dimensions of transistors.

However, there is no primary limitation on tw. This wire thickness dimension can be scaled independently from W and t, Fig. 13.14. The height of the wire affects the thickness of the oxide over the substrate, a dimension we normally are not concerned with.

Fig. 13.14
figure 14

The top figure shows a scaling scheme where all dimensions of the wire scale similarly (except length). The bottom figure shows a scheme where the height (tw) of the wire does not scale or scales slower than the rest of the dimensions

If we assume the wire thickness scales by a factor Kt < K, then the time-constant of a long range wire scales by

$$ \begin{array}{*{20}c} {C_{\rm wire} = \frac{\varepsilon WL}{t} \to \varepsilon .\frac{W}{K}.LK_{w} .\frac{K}{t} = \frac{{C_{\rm wire} K_{w} }}{K}} \\ {R_{\rm wire} = \frac{L}{{\sigma Wt_{w} }} \to LK_{w} .\frac{K}{W}.\frac{{K_{t} }}{{t_{w} }}.\frac{1}{\sigma } = K_{w} K_{t} KR_{\rm wire} } \\ {\tau = R_{\rm wire} C_{\rm wire} \to K_{w} K_{t} KR_{\rm wire} .\frac{{C_{\rm wire} K_{w} }}{K} = K_{t} K_{w}^{2} R_{\rm wire} C_{\rm wire} } \\ \end{array} $$

And for short range wires:

$$ \begin{array}{*{20}c} {C_{\rm wire} = \frac{\varepsilon WL}{t} \to \varepsilon .\frac{W}{K}.\frac{L}{K}.\frac{K}{t} = \frac{{C_{\rm wire} }}{K}} \\ {R_{\rm wire} = \frac{L}{{\sigma Wt_{w} }} \to \frac{L}{K}.\frac{K}{W}.\frac{{K_{t} }}{{t_{w} }}.\frac{1}{\sigma } = K_{t} R_{\rm wire} } \\ {\tau = R_{\rm wire} C_{\rm wire} \to K_{t} R_{\rm wire} .\frac{{C_{\rm wire} }}{K} = \frac{{K_{t} }}{K}.R_{\rm wire} C_{\rm wire} } \\ \end{array} $$

If we choose the extreme case of not scaling down wire thickness at all, i.e., Kt = 1, then long range wire delay becomes dependent only on die dimension. Short range wire delay would scale down similar to gate delay.

However, keeping tw unscaled or scaling it slower than other dimensions, combined with the scaling down of W increases the inter-wire capacitance. This is because the distance between wires drops while the lateral area of the wire plates facing each other increases.

Using different metal layers judiciously allows us to utilize the advantages of unscaled tw without suffering from inter-wire capacitance. In most CMOS technologies higher metal layers have larger tw while lower metals tend to be thinner. Long range wires tend to be much fewer than short range wires.

Thus lower, thinner metal layers are used to perform local short range connections. Higher metal layers are used to make the far fewer long range connections. These few wires can then be kept far apart from each other to control inter-wire capacitance. This is illustrated in Fig. 13.15.

Fig. 13.15
figure 15

Different layers for different communications. Lower layer wires are kept thin and are packed closer together. This allows local communication to happen without increased inter-wire capacitance. Higher level metals are wider, thicker, and kept further apart, reducing both their resistance and inter-wire capacitance at the expense of density. Higher level metals are used for infrequent long range routing

13.5 Interchip Communication

  1. 1.

    Understand the requirements of interchip communication

  2. 2.

    Understand components of the pad circuitry

  3. 3.

    Realize the need for ESD protection

  4. 4.

    Design a circuit for level conversion

  5. 5.

    Recognize when wires need to be modeled as transmission lines

  6. 6.

    Compare source and load termination scenarios in terms of signal settling behavior.

All wires on a chip eventually end up at or come from output or input pins. At the output pin, the signal travels over a PCB track to another chip’s input pin. This setup is shown in Fig. 13.16. The nature of signals on chips and on PCB tracks is very different. PCBs are unprotected, and signals in PCB tracks have to travel huge distances compared to on-chip signals. Thus, signals on the PCB tend to be larger in order to better resist noise and interference. PCB tracks are also very wide to reduce resistive drops over the large distances.

Fig. 13.16
figure 16

PCB interchip communication. The PCB is an insulating substrate. Conductive copper tracks connect chip pins to each other

The wide, long PCB tracks offer considerable capacitive loads to chips. The output pins have to drive this load while transforming the signal into levels more suitable for off-chip communication. Thus pins contain complicated circuitry to manage the transition to and from the outside world.

Figure 13.17 shows a pin pad. The pad is a very large metal square on the periphery or top of the circuit depending on package type (Sect. 14.6). The pad protrudes through the overglass and is connected to the pin by metal wires during packaging, allowing communication to the outside world. As shown in Fig. 13.17, interface circuitry is attached to the pad, interceding between it and the core of the die.

Fig. 13.17
figure 17

Pin pad. The pad is a large metal surface to be contacted while bonding. Interface circuitry allows for signal conditioning. Reverse-biased guard rings isolate the pin circuitry from the core, protecting both from each other

Pads are surrounded by one or more sets of guard rings. A guard ring is a closed loop of semiconductor material of the opposite type to the surrounding environment. So, if the pad exists in the p substrate, we use n-type rings. If the pad exists in an N-well or substrate, we use p-type rings. Rings are biased to create reverse-biased junctions to the environment. The guard rings provide isolation for the pad from the rest of the circuit. Preventing the large signals in the pads from interfering with small signals in the core and pad interface, and preventing the high frequency noise of the core from affecting the pad interface.

The pin, pad, and interface have to support the following functions:

  • Drive large off-chip capacitance for output pins

  • Provide electrostatic discharge protection for input pins

  • Provide level conversion for both types of pins

  • Protect the pad from latch-up

Drive off-chip capacitance

This problem stems from the fact that PCB tracks provide huge capacitive loads due to their length and width while die core driver gates are significantly smaller. Thus the problem is how to optimally drive a large capacitance starting from a small gate.

This problem is discussed in detail in Sect. 4.2. The best way to drive such a load is through a sequence of progressively larger inverters. The optimal sizing of each inverter and the optimal number of stages are derived systematically in Sect. 4.2 and can be easily obtained using logical effort.

If we follow the equal fan-out sizing in Sect. 4.2, then the inverters in later stages of the buffer will be extremely huge. These inverters will contain transistors with extremely large W, capable of driving the large off-chip capacitance.

Figure 13.18 shows how extreme-sized transistors can be laid out. The gate of the transistor consists of parallel fingers; this creates a number of transistors in series. Metal lines are used to short sources and drains of these component transistors. This causes them to become parallel, creating a single equivalent, extremely wide transistor.

Fig. 13.18
figure 18

Large transistor layout

Practical pin driver transistors will consist of several of the wide transistors in Fig. 13.18 connected again in parallel, thus adding two levels of parallelism and creating an even wider transistor.

One curious aspect of the layout of output buffers is the preponderance of substrate/well contacts. While a common guideline is to add one contact per transistor, for output pads, we add as many contacts as the area allows. The reason is detailed later under “latch-up”.

ESD protection

Pins are exposed to the outside world. Static charges build up in humans, printed circuit boards, and probes that may come in contact with the chip pins. Because such bodies (especially humans) are significantly larger than chips, the amount of charge they carry can be enormous.

If this charge is transferred to an input pin, it builds up on the polysilicon gates of the first MOSFETs after the input pad. The large charges cause a huge voltage to form on the small MOSFET gate capacitance. Static charges can be at kilovolt potential on human bodies; when transferred to MOSFETs, it can cause even larger voltages.

Modern MOSFET oxides tend to be very thin to manage subthreshold conduction (Sect. 10.4). Thus, their breakdown voltage is very low. Voltages as low as a few dozen volts are usually enough to destroy a MOSFET irreparably.

Thus, if left unprotected and in the absence of perfect grounding, all input MOSFETs would certainly be destroyed. Thus all input pad interfaces must contain electrostatic discharge (ESD) protection circuitry.

The ESD protection circuit is shown in Fig. 13.19. The CMOS inverter is not part of the protection circuit, but rather the first CMOS gate in the die. The two MOSFET gates are the ones we need to protect from charge buildup. The protection is provided by the resistor R and the two diodes D1 and D2.

Fig. 13.19
figure 19

ESD protection circuit. Large positive or negative charges at the gates of the MOSFETs would leak through D1 and D2, respectively. This prevents breakdown fields from building on the gates of the MOSFETs

If enough charge builds up to raise the voltage of the anode of D1 above a certain limit (defined as Vx), then D1 turns on. This causes current to flow through D1 from its anode. This current carries away all the excess charge that would have built up on the gates. The voltage Vx should be adjusted to allow enough voltage to build up to turn on the MOSFET gate during normal operation, but turn on the diode before enough charge builds up to break the oxide.

If the built up static charge is negative, then the voltage of the cathode of D2 would drop. D2 turns on before enough negative charge builds up to break the MOSFET gates. When D2 turns on, it allows this excess charge to leak to Vss.

The resistor R plays an important role. It limits the current that flows when the diodes turn on. Without R, D1 and D2 could temporarily cause enormous current to flow due to their low breakdown resistance.

The resistance R should be large enough that current caused by typical ESD build up is limited. But R should not be too high because it causes extra power dissipation. Any current that flows through R causes power dissipation. This includes not only ESD current but also switching current during normal operation. Because input pins can potentially carry very large current, this can be a significant concern.

The resistance R is implemented as a passive semiconductor resistor usually using the polysilicon layer or the diffusion layer, whichever has lower conductivity. To increase the resistance, the wire is kept as thin as the DRC allows, and a serpentine layout is used to increase the effective length, as shown in Fig. 13.20.

Fig. 13.20
figure 20

Layout of ESD resistance R

Level conversion

Off-chip signals have to be large to combat the large noise and coupling in such an environment. On-chip signals have to be much smaller to manage power and dielectric breakdown. Interface circuits must transform the level of signals between the two domains. One such level-conversion circuit is shown in Fig. 13.21.

Fig. 13.21
figure 21

Level-conversion circuit

Inputs at point A are between ground and a first supply reference level Vdd1. Outputs at point B use ground for “0” and a second supply level Vdd2 for ‘1’. The static CMOS inverter between A and B uses the first reference level Vdd1 while the sources of the PMOS transistors M3 and M4 are connected to the second supply Vdd2.

When input A is 0 V, M1 is off. The inverter produces an output Vdd1, which turns M2 on, causing output Y to be 0 V. When input A is Vdd1, the inverter outputs 0 V at B. M2 is off. M1 is on, passing 0 V to X. X causes M4 to turn on, passing Vdd2 to Y.

Thus 0 V passes as 0 V while Vdd1 is transferred to Vdd2. This setup also allows us to transform the signal ground, moving inputs at Vss1 to Vss2. This would happen if the sources of M1 and M2 are connected to Vss2 instead of the common ground of the inverter.

Latch-up protection

Latch-up is a concern in all CMOS circuits. It is discussed in detail in Sect. 7.7. In input/output pins, latch-up is even more disturbing than in the core. The reason is that pins handle the largest current swings in the circuit. Thus the pad interface circuits can see the largest ground and supply drops, making them particularly susceptible to latch-up.

One way in which latch-up is prevented is by adding well and substrate contacts. These contacts reduce the resistance on the path to supply and ground, reducing the possibility of positive feedback in the latch. Pad circuitry uses as many well/substrate contacts as area allows, in opposition to normal CMOS circuitry which uses only one per transistor.

Inductive effects

So far we have only considered the impact of capacitance and resistance on wire delay. Wires also have inductance. On-chip, the short length of wires and their straight paths reduces the inductance. Inductance appears where wires become larger, particularly in pin pads. But in cutting-edge circuits inductance can have an impact even on in-chip interconnects, especially long-range wires.

Off-chip, PCB tracks have substantial inductance, and their behavior cannot be fully predicted just from capacitance and resistance. The danger of inductance is not mainly in its impact on delay, but in the way it changes signal behavior. One such way is Ldi/dt noise which is explored in Sect. 13.6.

But the main way that inductance impacts wires is that it forces us to treat them as transmission lines. Strictly speaking, all wires are transmission lines. But under certain conditions, we can treat them as RC lines.

Figure 13.22 shows a very simplified view of a transmission line (TL) wire. It is treated as a black box. Signals entering the TL on one end travel through as an electromagnetic wave. The velocity of said wave can be calculated from the capacitance and inductance per unit length, l and c as

Fig. 13.22
figure 22

Transmission line with unknown termination

$$ v = \frac{1}{{\sqrt {lc} }} $$

This velocity can also be related to the permittivity and the permeability of the medium and the wire as

$$ v = \frac{{c_{0} }}{{\sqrt {\mu_{r} \varepsilon_{r} } }} $$

where Co is the speed of light through vacuum. Thus for materials with low inductance, the velocity of the signal through the wire is very high. This makes the flight time of the signal extremely fast, allowing us to treat wires as RC structures.

If, however, the material has high inductance, or the signal is of very high frequency, then the limited velocity of the signal becomes a factor in calculating how long it takes the signal to get from one point to another. When should we ignore inductance, and when should it be taken into consideration?

In general, if the frequency of the signal is much smaller than the reciprocal of the time of flight of the signal, we can reduce the wire to RC. Thus, there are several factors that affect the decision: the frequency of the signal, the length of the wire, and the velocity through the medium.

In a wire of length Lw, the time of flight Tf is

$$ T_{f} = \frac{{L_{w} }}{v} = \frac{{L_{w} \sqrt {\mu_{r} \varepsilon_{r} } }}{{c_{0} }} = L_{w} \sqrt {lc} = \sqrt {LC} $$

where L and C are the lumped inductance and capacitance of the length of the wire, respectively. Thus transmission line effects can be ignored only if

$$ t_{\rm pd} \gg \sqrt {LC} $$

where tpd is the typical delay in the circuit. Consider how these numbers translate in chip-to-chip communication. Assume two chips communicate using PCB tracks over a length of 20 cm. If the relative permittivity of the PCB insulator is 4 and the relative permeability of copper is 1, what is the range of chip frequencies for which transmission line effects must be taken into consideration?

The velocity in the material is

$$ v = \frac{{c_{0} }}{{\sqrt {\varepsilon_{r} } }} = 3 \times\frac{{10^{8} }}{2} = 1.5\times10^{8} {\rm m/s} $$

The time of flight over the PCB track is

$$ T_{f} = \frac{0.2}{{1.5\times10^{8} }} = 1.33\times10^{ - 9} {\rm s} = 1.33\,{\rm ns} $$

Thus if gate frequency is not much smaller than 750 MHz, transmission line effects must be taken into consideration.

Transmission lines have something called the characteristic impedance. This is a property of the geometry of the wire and the surrounding insulation. There are closed forms for the characteristic impedances of different wire geometries. However, most transmission lines are designed to have characteristic impedance of 50 ohm.

Characteristic impedance is important because it defines reflection coefficients at the termination of a transmission line. In general, if the resistance at one end of a TL is ZL, then the reflection coefficient at that end is

$$ \Gamma = \frac{{Z_{L} - Z_{0} }}{{Z_{L} + Z_{0} }} $$

The reflection factor defines a new phenomenon: even if the signal manages to reach the end of the wire, part of it might still reflect back through the wire. Why is this dangerous? Reflections can cause the signal to take much longer than the time of flight to settle at the receiver, thus increasing the effective delay.

Figure 13.23 shows a transmission line connecting two chips. The originating chip is called a transmitter, the load chip is called the receiver. The load at the transmitter is called the source resistance. To understand the impact of reflections, we must consider specific termination cases. Both the receiver load resistance as well as the transmitter source resistance are important, because each defines the reflection factor at either side. Thus the reflection coefficients at transmitter (tx) and receiver (rx) are

Fig. 13.23
figure 23

Transmission line between transmitting chip and receiving chip. Zs is purely resistive, thus Zs = Rs

$$ \begin{aligned}\Gamma _{\rm tx} & = \frac{{Z_{s} - Z_{0} }}{{Z_{s} + Z_{0} }} \\\Gamma _{\rm rx} & = \frac{{Z_{L} - Z_{0} }}{{Z_{L} + Z_{0} }} \\ \end{aligned} $$

At the load end of the transmission line, the signal is the summation of the incident and the reflected wave (constructive interference), thus

$$ V_{\rm load} = V_{\rm incident} +\Gamma _{\rm rx} V_{\rm incident} = \left( {1 +\Gamma _{\rm rx} } \right)V_{\rm incident} $$

At the source, the initial wave is a voltage divider between the TL characteristic impedance and the source resistance:

$$ V_{\rm initial} = V_{s} .\frac{{Z_{0} }}{{Z_{s} + Z_{0} }} $$

Afterward, the signal at the source must also include reflections:

$$ V_{\rm source} = \left( {1 +\Gamma _{\rm tx} } \right)V_{\rm incident} $$

There are many scenarios regarding what happens to the signal based on the values of source and load resistance. We are discussing this in the context of CMOS chips. Thus, the “receiver” will have CMOS gates at its input. The input to CMOS gates is purely capacitive. Thus, for all practical scenarios we will consider load resistance to be infinite, variation only comes from the value of source resistance.

Case 1: Zs = Zo and ZL = infinite:

In this case, the reflection factor at the load is 1 and at the source is 0.

Figure 13.24 shows how the waveform progresses at the source and the load for a matched source termination. Initially a signal is injected from the source into the transmission line according to the voltage divider between the source and the transmission line:

Fig. 13.24
figure 24

Source and load voltages for matched source, capacitive load

$$ V_{\rm source} \left( 0 \right) = V_{s} .\frac{{Z_{0} }}{{Z_{0} + Z_{0} }} = \frac{{V_{s} }}{2} $$

This signal reaches the load after one time of flight (Tf); the total signal at the load is the summation of the incident and reflected waves:

$$ V_{\rm load} \left( {{\text{T}}_{f} } \right) = \left( {1 +\Gamma _{\rm rx} } \right)V_{\rm incident} = \frac{{\left( {1 + 1} \right)V_{s} }}{2} = V_{s} $$

The reflected component is Vs/2, this reaches the source at 2Tf, for a total signal at the source of

$$ V_{\rm source} \left( {2{\text{T}}_{f} } \right) = V_{\rm source} \left( 0 \right) + V_{\rm incident} = \frac{{V_{s} }}{2} + \frac{{V_{s} }}{2} = V_{s} $$

Since the reflection coefficient at the source is zero, there is no reflection at the source and the signal settles on both ends at Vs. This shows that if the load is capacitive, then the best thing we can do is match source impedance to the transmission line. This kills reflection at the source and allows the signal to settle after a single time of flight.

Case 2: Zs = 3Zo and ZL = infinite:

This also applies to all cases where the source load is larger than the characteristic impedance. The source reflection coefficient is 1/2, the load coefficient is still 1. Initially the injected signal at the source is

$$ V_{\rm source} \left( 0 \right) = V_{s} .\frac{{Z_{0} }}{{3Z_{0} + Z_{0} }} = \frac{{V_{s} }}{4} $$

After a single time of flight, the signal at the load is

$$ V_{\rm load} \left( {{\text{T}}_{f} } \right) = \left( {1 +\Gamma _{\rm rx} } \right)V_{\rm incident} = \frac{{\left( {1 + 1} \right)V_{s} }}{4} = \frac{{V_{s} }}{2} $$

The reflected component reaches the source after another time of flight:

$$ V_{\rm source} \left( {2{\text{T}}_{f} } \right) = V_{\rm source} \left( 0 \right) + V_{\rm incident} \left( {1 +\Gamma _{\rm tx} } \right) = \frac{{V_{s} }}{4} + \frac{3}{2}.\frac{{V_{s} }}{4} = 0.625V_{s} $$

At 3Tf, the signal at the load becomes

$$ V_{\rm load} \left( {3{\text{T}}_{f} } \right) = V_{\rm load} \left( {{\text{T}}_{f} } \right) + \left( {1 +\Gamma _{\rm rx} } \right)V_{\rm incident} = 0.5V_{s} + \left( {1 + 1} \right)*0.125V_{s} = 0.75V_{s} $$

The pattern expressed by the equations and Fig. 13.25 shows the signal approaching Vs asymptotically at both the source and the load. Evidently the “delay” for the signal to reach the load is much higher than a single time of flight, taking many multiples to settle. The speed of settling or the time-constant is dominated by the source termination rather than by the resistance and capacitance of the line.

Fig. 13.25
figure 25

Waveforms with a source load higher than characteristic impedance

Case 3: Zs = Zo/3 and ZL = infinite:

This also applies in all cases where the source resistance is lower than the line impedance. The source reflection coefficient is –1/2 and the load reflection coefficient is 1. The initial injected signal at the source is

$$ V_{\rm source} \left( 0 \right) = V_{s} .\frac{{Z_{0} }}{{Z_{0} /3 + Z_{0} }} = \frac{{3V_{s} }}{4} $$

After a single time of flight, the signal at the load is

$$ V_{\rm load} \left( {{\text{T}}_{f} } \right) = \left( {1 +\Gamma _{\rm rx} } \right)V_{\rm incident} = \frac{{\left( {1 + 1} \right)3V_{s} }}{4} = 1.5V_{s} $$

The reflected component reaches the source after another time of flight:

$$ V_{\rm source} \left( {2{\text{T}}_{f} } \right) = V_{\rm source} \left( 0 \right) + V_{\rm incident} \left( {1 +\Gamma _{\rm tx} } \right) = 0.75V_{s} + \left( {1 - 0.5} \right)*0.75V_{s} = 1.125V_{s} $$

The reflected component from the source is

$$ - 0.375V_{s} $$

At 3Tf, the signal at the load becomes

$$ V_{\rm load} \left( {3{\text{T}}_{f} } \right) = V_{\rm load} \left( {{\text{T}}_{f} } \right) + \left( {1 +\Gamma _{\rm rx} } \right)V_{\rm incident} = 1.5V_{s} + \left( {1 + 1} \right)*\left( { - 0.375} \right)V_{s} = 0.75V_{s} $$

From Fig. 13.26 and the equations, we observe a resonant effect at both the source and termination. The signal overshoots on both ends. It then rings continuously, asymptotically approaching the final value of Vs.

Fig. 13.26
figure 26

Flight pattern for source impedance lower than characteristic impedance

We can easily conclude that when the destination is capacitive, source matching is the best-case scenario, reducing the delay to a single time of flight. When the source mismatches the line, there is either ringing or an underdamped approach to the final value. If the source is not matched and we try to apply a new signal before the old signal settles at the load, then inter-symbol interference occurs leading to loss of information on both symbols.

Source termination should always be the first choice in chip-to-chip communication. Chips naturally have capacitive inputs, which leave us no option but matching at the source. Assuming we do have the option to match at the load, would that be better than source matching? The answer is no, for two reasons: load matching dissipates power, and it does not allow the whole input signal to reach the load.

A resistive load to a transmission line means the flow of DC current. Because transmission line impedance is typically low, this current can be large, leading to significant power in the source, line, and load.

But more critically, assume the source is unmatched and is equal to Zs = Zo/3, but the load is matched ZL = Zo. The initial injected signal at the source is

$$ V_{\rm source} \left( 0 \right) = V_{s} .\frac{{Z_{0} }}{{Z_{0} /3 + Z_{0} }} = \frac{{3V_{s} }}{4} $$

The reflection coefficient at the load is 0. Thus after Tf, the signal at the load is

$$ V_{\rm load} \left( {{\text{T}}_{f} } \right) = \left( {1 +\Gamma _{\rm rx} } \right)V_{\rm incident} = \frac{{\left( {1 + 0} \right)3V_{s} }}{4} = 0.75V_{s} $$

There is no reflection from the load. Thus, the load and source voltages both settle at 0.75Zo, never reaching the full swing value Vs.

13.6 Supply and Ground

  1. 1.

    Recall supply/ground distribution schemes

  2. 2.

    Understand the problem with resistive drops in supply lines

  3. 3.

    Summarize sources of inductance in supply/ground distribution

  4. 4.

    Realize Ldi/dt drops in supply can be a major problem

  5. 5.

    Use a decoupling capacitor to ease ground and supply bounces

  6. 6.

    Trace current and impedance behavior in the presence of a decoupling capacitor

Supply and ground run through wires, similar to signals. However, supply and ground are distinguished by the fact that they must be directly provided to all parts of the chip. With the exception of some transmission gates, all combinational and sequential circuits have to be provided with supply and ground. All the supplies and grounds through the chip originate from at most a few ground and supply pins. Thus supply and ground routing inside a chip is constrained by the fact that they have single origins outside the chip. This can be seen in Figs. 8.28 through 8.30

The main problem with supply and ground distribution is the resistance of wires. Delay is not a major concern because neither ground nor supply carry a signal. However, resistance is a major issue because it causes supply to drop and ground to rise for gates far away from the pins. In Sect. 8.3, we discussed several layout techniques used to reduce these drops. This included wires whose thickness increased the closer to the pin they came. It also included distributing ground and supply from both left and right, or even from all four sides of the chip. This helps confine the maximum drops to the center of the chip.

The ultimate solution to this problem is to dedicate an entire metal layer for ground and another for supply. This is particularly popular in technologies where a large number of metal layers are available. In this case an entire higher level layer becomes either a ground or supply. Because the higher level metal layers are also thicker, this allows grounds and supplies to be provided to most locations with minimal drops. Local access to ground and supply is through vias, so special attention has to be given to via resistance and to local distribution of ground and supply.

The problem with resistive supply drop is that it reduces the available on-current to the gate. This leads to increase in delay. If this occurs in a path with small slack (Chap. 6), it can cause unplanned setup-time violations. These problems are dependent on the particular sequence of signals that cause them, and can thus escape traditional fault detection tests (Chap. 14).

Another very similar complication occurs due to parasitic inductance on ground and supply lines. We consider not only the parasitic inductance of the wires but also inductance due to the bonding pad, pin, and even off-chip PCB tracks. Figure 13.27 illustrates these sources of inductance. The largest sources are due to off-die causes. This is because sizes are significantly larger at these locations.

Fig. 13.27
figure 27

Sources of inductance around a ground and supply pins. The pin itself and the PCB track contribute most of the inductance

Figure 13.28 shows the parasitic inductances lumped into single inductances in series with the supply and ground pins. True supply, marked by Vddt and true ground, marked by 0 V, occur below and above the inductors. The ground seen by the chip is marked as GND, the supply seen by the chip is marked as Vdd.

Fig. 13.28
figure 28

Inductive model for ground and supply pins

With all signals steady, Vdd = Vddt and GND = 0 V. However, signals are rarely steady. In normal operation, tens of millions of CMOS gates are switching at any particularly instant. Switching CMOS gates draws current from supply or source current to ground in order to charge or discharge their capacitances (Chap. 3).

This switching current leads to dynamic power dissipation. This current is a peak current, normally the saturation current of the NMOS or PMOS depending on the direction of switching. This current flows for a limited duration, it is not a steady-state current.

However, because a huge number of gates may be switching at a time, their total peak current can be large. This total peak current flows through the supply/ground distribution network and ultimately ends up flowing through the ground and/or supply pins.

Resistance in the wires can cause drops due to these currents, however, inductance can be significantly more dangerous. The inductance current–voltage equation is

$$ V = L.\frac{di}{dt} $$

Thus, the voltage drop on an inductor is not a function of the magnitude of the current, but rather on the rate at which such a current changes. While switching, gates cause large peak currents to flow. If properly designed, gate delay should be optimal and such currents should flow for a short time. This leads to a very large total di/dt.

The large current slew combined with the large pin inductance leads to a large voltage drop. Thus GND does not remain at 0 V, instead it bounces to Ldi/dt. Vdd drops to VddtLdi/dt. di/dt is the slew rate of the total current drawn from or sourced to the chip.

This phenomenon is called ground bounce. It is very dangerous and is complicated to estimate since it cannot be predicted entirely from the chip design alone. Most of the inductance comes from off-die sources. Thus engineers at all design stages must be cognizant of ground bounce.

The problem with ground bounce is that it is highly variable and has a complex relation to the behavior of the circuit. When there is little switching activity in the circuit, we might not even notice it. When there is large switching activity, it might cause circuit failure. As shown in Fig. 13.29, the bounces repeat, can happen on both lines equally, both lines differently, or on either line alone. It is very difficult to predict in advance.

Fig. 13.29
figure 29

Ground and supply bounce graph. Both ground and supply see transient variations due to inductive drops. The bounces and drops can be in either direction, leading to substantial risk of latch-up

One particular case where ground and supply bounces are noticeable is when the chip is switched on. During this period, abnormally large transient currents are drawn to charge internal ground and supply planes/lines. The bounce resulting from this transient current can easily trigger a latch-up state in the chip (Sect. 7.7), which is a severe failure.

Ground bounce is usually addressed by including a decoupling capacitance, Fig. 13.30. These are very large capacitors inserted in series between ground and supply. The decoupling capacitor can be manufactured on-chip or assembled off-chip on the PCB. If the capacitor is off-chip, it will not decouple any bounce due to inductance from on-chip sources. If it is on-chip, then the value of the capacitance is limited by area.

Fig. 13.30
figure 30

Decoupling capacitance outside a chip. In this case, C can only decouple Ldi/dt noise due to off-chip inductance

Figure 13.31 shows the action of the decoupling capacitor. We assume only ground bounce occurs. But the same discussion applies to supply bounce or to simultaneous bounce.

Fig. 13.31
figure 31

Model for capacitive decoupling. When the high frequency slew occurs, most charge is drawn into C. This can then leak through L at a low frequency

C is very large, so in steady-state it forms a near-perfect open circuit, having no impact on circuit operation. During transience when large currents are sourced, di/dt is large. This is due to both the large di and the small time dt in which these peaks occur. The small dt indicates these currents are mostly high frequency.

In high frequency, the impedance of C becomes extremely small, especially because C is large. On the other hand the impedance of L is very high in high frequency, which is why we had significant bounces in the first place.

As shown in Fig. 13.31, i is not only sourced to true ground through L but can also flow into C. Thus i = iL + iC. The majority of the transient current will flow into the capacitor instead of the inductor because the capacitive impedance is much lower during fast current switching. This significantly reduces di/dt observed in the inductor and thus ground bounce.

The current that flows into C leads to charge accumulation on the lower plate. If charge keeps accumulating on C due to repeated current sinking, the capacitor may become overloaded leading to insulator breakdown and failure.

However, the charge stored on the capacitor does not remain there. Figure 13.32 shows the current sourced from the pin, the current into C, and the current into L. When the pin sources a large current in a small time, the majority of the current goes to C. When i dies down, frequency drops to naught, and C sees a charge, and thus DC voltage at its lower plate. At DC, the impedance of C shoots up, and the impedance of L drops to zero. Thus, all the charge on C starts escaping through L. However, this charge seeps at a limited frequency, leading to a negligible Ldi/dt and thus negligible bounce.

Fig. 13.32
figure 32

Currents in C (ic), L (iL), and pin (i) with time

When C stores the extra charge, would it not also cause a ground bounce? Recall that I = CdV/dt. Thus, the flow of current into a capacitor causes its voltage to rise. This is the main reason we use a very large C, because the current that flows into it causes a much smaller voltage rise than the bounce that would be caused by the inductor. The bounce on C is related to the amount of sustained current while the bounce observed on L is dependent on the rate di/dt.

To reiterate, decoupling works as follows (Fig. 13.33): when current changes suddenly, the impedance of C drops and L rises. Current flows into C causing a limited increase in voltage due to the size of C. When the current transient dies down, the voltage on C becomes DC, and the impedance of the inductor drops to near null. The stored charge on the capacitor leaks through the inductor to ground at low frequency, leading to little bounce.

Fig. 13.33
figure 33

L and C during phases of bounce resolution. 1 Most of the current goes to C, causing a limited bounce. 2 The charge on the capacitor leaks through the inductor over a longer period

Thus, the role of the decoupling capacitor is to absorb the current transients while minimizing their voltage impact. Then allowing these transients to escape through the inductor over a much longer duration where they do not cause bounce. Because we are allowing the same amount of charge to escape over a longer period, we can reduce the amount of bounce. The rise in voltage on the capacitor is much smaller than would have been observed on the inductor because capacitor voltage is proportional to charge rather than rate of change of current.

13.7 Clock Networks

  1. 1.

    Understand the peculiarities of the clock signal

  2. 2.

    Distinguish skew from jitter

  3. 3.

    List sources of skew and jitter

  4. 4.

    Understand the impact of skew and jitter on setup and hold violations

  5. 5.

    Compare grids and trees for clock distribution

  6. 6.

    Understand how hybrid clock distribution methods can combine the advantages of multiple clock distribution methods.

In Chap. 6, we discussed synchronous circuits as the most important class of digital circuits. This entails a single clock distributed to a very large number of registers throughout a large die. This is not an easy task and practical designs have to think of clocks as being synchronous only in local islands.

The clock signal is distributed all over the chip. Its load is enormous. The reference clock is provided by an external pin. This pin is used as an input to a “clock driver” or a set of clock drivers. Clock drivers have the task of distributing the clock signal to the large load of registers on the chip while minimizing delay. Clock drivers can be designed as optimal inverter chains with logical effort methods used to minimize total delay (Sect. 4.2).

However, regardless of the strength of clock drivers, delays in wires create unavoidable differences between local clocks. Figure 13.34 shows a situation where the wire leading the clock to register R1 is much shorter than that for register R2. Because the clock is distributed throughout the die, differences in wire lengths can be substantial. Due to this effect, clk1 and clk2 are at different phases regardless of the design of the buffer.

Fig. 13.34
figure 34

Differential delay in clock wires. Because the clock travels longer to R2, there is an inevitable skew between clk1 and clk2. This is not a problem that an optimal driver can solve

Clk1 and clk2 will necessarily have the same frequency because they come from the same source. However, they will have a small but important relative delay to each other. This delay is called the “clock skew” and is at least partly determined by the positions of the two registers relative to the driver.

Thus, a clock distribution network has to be carefully designed. The main objectives of a clock network are

  • Clocks should reach all registers with the same phase, or at least with a well-characterized phase shift (skew)

  • The clock has to be low on phase noise

  • Clock distribution has to be independent from process variations and variations of temperature

  • Preserve a well-known duty cycle

  • Power and area minimization. Clock networks use substantial metal wire resources, leading to power dissipation that could run amok

  • The network should be independent of the design. In other words, clock networks that fulfill the above but are a custom fit for only a single layout, will not support any late stage changes to the underlying design

So we are trying to distribute clocks with minimal distortions to all registers. Meanwhile we have to maintain simplicity, openness to modification, and low power operation. This task is made difficult by the many causes of clock distribution imperfections. While Fig. 13.34 suggests the main source of imperfection is variations in wire length, the list of possible sources of imperfection is much longer:

  • Clock generation. The clock comes from an external source. This source is often a crystal oscillator. The oscillator contains phase noise. Thus, the clock not only is imperfect at the source but it also suffers from more phase noise as it travels from the oscillator to the die core over PCB tracks, pins, and pads

  • Buffer mismatches. As we will shortly see, distributing the clock requires multiple levels of buffers. If buffers at the same level are mismatched, this could lead to differential drives and thus relative skew. Matching buffers perfectly is nearly impossible

  • Wire mismatches. This is the classical source we think of when we mention skew. However, in practice, it is only a single component of skew

  • Load mismatches. This is a major source of skew. Because different registers cannot be perfectly matched, they offer a different load to clock distribution branches. Thus even if the driving buffers are matched, and the wire traces to the register are identical, the load will still introduce skew

  • Temperature variations. Because different parts of the die have different activity, they will also have different temperature profiles. The different temperatures cause variations in signal behavior that introduce skew. They also introduce noise that translates into phase noise

  • Noise—particularly coupling noise. This leads to variations in the clock from cycle to cycle, thus phase noise

  • Power supply variations. The level of power and ground that reaches every device varies both by position and time. Because there are drops over power and ground lines (Sects. 8.3 and 13.6), the spatial variations cause variable delay from buffers and thus skew. But variations are also dependent on the momentary current being drawn, which introduces a temporal effect in the form of phase noise

We can see two major types of clock “errors”. There is a spatial phenomenon as shown in Fig. 13.34 that we call “skew”. But there is also a temporal component which represents uncertainty in the phase of the clock from cycle to cycle. We have termed this component “phase noise” so far. Phase noise is also commonly known as jitter, Fig. 13.35. We can summarize the impact of each source of clock imperfection as shown in Table 13.3. Notice that a single source of clock imperfection may manifest in the form of both or either skew and jitter.

Fig. 13.35
figure 35

Jitter in a clock signal

Table 13.3 Sources of jitter and skew

But why do we care if there is skew or jitter in a clock? Because it has a direct impact on the available clock period. This can create a setup-time violation where none can be discovered from gate and signal interconnect delay. Even more critically, skew and jitter can tighten the already tight conditions on hold-time. This can create new contamination paths with hold-time violations that can be very difficult to detect.

In Sect. 6.6, we deduced that the condition determining the clock period is

$$ T > t_{\rm CQ} + t_{\rm pd} + t_{\rm su} $$

However, as seen in Fig. 13.36, the available time for CLB propagation, register setup, and register propagation is changed by skew. The second edge of clk2 does not come T seconds after the first edge of clk1. Instead it arrives T + Ts seconds later.

Fig. 13.36
figure 36

Two successive registers with skew. The available period for the pipeline stage to finish calculating is changed by the skew

The period available in the presence of skew is

$$ \begin{array}{*{20}c} {T + T_{s} > t_{\rm CQ} + t_{\rm pd} + t_{\rm su} } \\ {T > t_{\rm CQ} + t_{\rm pd} + t_{\rm su} - T_{s} } \\ \end{array} $$

Thus, skew reduces the required clock period. Surprisingly, skew allows us to improve performance! Does this mean that skew would never cause new setup-time violations to occur? Notice that the above results are obtained when data and clock flow in the same direction. This lead to clk2 being delayed relative to clk1. If the clock and data are driven in opposite directions, then clk2 would be ahead of clk1. This would translate into a negative Ts, which increases the required clock period. This then tightens the clock period, and leads to new setup-time violations if skew is not considered while calculating the critical path.

Still, why not always distribute clock and data in the same direction? First, we often do not have the option. Notice that routing data is a very complicated problem (Sect. 8.7). Clock distribution is also a complicated problem in its own right. Thus we cannot always choose the direction in which clocks are driven relative to data.

Also, we have to examine what happens to conditions on hold-time (Sect. 6.5):

$$ t_{\rm hold} < \hbox{min} \left\{ {t_{\rm pd} } \right\} + t_{\rm CQ} $$

Following the same approach in determining clock period, skew impacts hold-time condition in the following way:

$$ t_{\rm hold} + T_{s} < \hbox{min} \left\{ {t_{\rm pd} } \right\} + t_{\rm CQ} $$

Thus, if clock and data are distributed in the same direction, skew would tighten the hold-time condition. If they are distributed in opposite directions, skew would relax the hold-time condition. There are conflicting requirements about clock-data distribution directions. If we relax conditions on hold-time violation, we tighten conditions on setup-time violations and vice versa. As a result, most clock distribution strategies focus on reducing the magnitude of skew, not on managing the direction of clock distribution relative to data.

Jitter has a similar impact on clock period, as shown in Fig. 13.37. Here we are assuming positive jitter on the first clock edge and a negative jitter on the second clock edge. This reduces the available period between registers R1 and R2 by two entire jitters, thus

Fig. 13.37
figure 37

Effect of jitter on available period. Because jitter is a stochastic phenomenon we have to assume worst-case edges when calculating both period and hold-time

$$ \begin{array}{*{20}c} {T - 2T_{j} > t_{\rm CQ} + t_{\rm pd} + t_{\rm su} } \\ {T > t_{\rm CQ} + t_{\rm pd} + t_{\rm su} + 2T_{j} } \\ \end{array} $$

Unlike skew, there is no “best case” for jitter. In some successive cycles, jitter might arrive so that conditions are relaxed for the clock cycle. However, this is a stochastic event, and we know for a fact that the worst-case scenario will happen at some point, and thus we have to design for it. This is as opposed to skew, where we can design so that the best case sometimes occurs. This is due to the fact that skew is a spatial deterministic phenomenon while jitter is a temporal stochastic phenomenon.

Because jitter is a random process, there is no single value of Tj. So unlike Ts, we have to use a statistical value for Tj. Using the mean value for Tj is meaningless since it is zero-mean. Thus Tj is usually chosen to be a few standard deviations of the distribution of phase noise. Somewhere between a single and double standard deviations is normally enough.

Note that jitter does not have a direct impact on hold-time violations. The hold-time condition occurs entirely following a single active edge of the clock. It does not occur due to the budget of available time between two edges. In other words, everything that has to do with hold-time occurs and concludes shortly after a single edge. If jitter moves the active edge, it moves it for all registers simultaneously. Thus regardless of the position of the edge, no new hold-time violations can occur. Skew had the potential to create new hold-time violations because it caused different registers to observe different clocks. Thus, it forces R2 to hold on clk2 but allows R1 to generate outputs on clk1, allowing data to potentially race and create a violation.

At first glance, we might conclude that jitter principally comes from clock generation while skew principally comes from wire delay mismatches. In reality, the majority of skew is caused by power supply variations while the majority of skew is caused by load variations. Temperature variations also impact skew significantly, but are too slow to show up as jitter.

Clock distribution networks aim to either minimize skew or to minimize its impact. The two objectives (reducing absolute skew or reducing its impact) have their own advantages and disadvantages and lead to very different topologies.

The most intuitive clock distribution method is the grid. This is shown in Fig. 13.38. This approach has some analogies to the plane power–ground distribution method (Sect. 8.3). In this case we distribute the clock using a grid of metal lines.

Fig. 13.38
figure 38

Grid distribution of clock. An intermediate metal layer is used for routing. The mesh pattern aims to reduce effective resistance and net skew

The clock is distributed in a mesh of redundant horizontal and vertical paths. The highest metal layers are reserved for supply and ground, and thus an intermediate layer is used. The many parallel paths reduce the net resistance from the buffer chain on the top to any point in the grid.

Thus the grid method aims to minimize the absolute skew from the driver to any point on the grid. The maximum skew in Fig. 13.38 occurs at the bottom, which is farthest away from the driver.

Figure 13.39 shows a grid driven from all four directions. This minimizes the skew further by reducing the distance from any point in the grid to the nearest driver. The maximum skew now occurs in a single point in the center of the grid which is equidistant from all four drivers.

Fig. 13.39
figure 39

Clock grid driven from four sides

The size of the grid is obviously a factor in the maximum skew observed, with larger grids seeing more maximum skew. Thus, in very large floorplans, the layout is divided into multiple grids as shown in Fig. 13.40. If each sub-grid is driven from all four directions, the maximum skew observed anywhere is put under very tight control.

Fig. 13.40
figure 40

Multiple clock grids on a single die

The grid method for clock distribution has many advantages. It achieves low skew in most locations. It is also independent from the underlying design. The grid can distribute the clock with nearly the same skew to any functional network. Load variations will lead to differences in the achieved skew, however, no changes in the clock grid are possible or necessary if the underlying design is modified. This is particularly valuable in the late stages of the design flow.

However, grids are a brute-force approach to clock distribution. We simply throw a lot of metal at the problem in the hope of solving it. The large metal resources occupy a lot of area, practically monopolizing an entire metal layer. But more critically power dissipation can be very high in the metal resistance.

The opposite approach to the grid is the tree approach. As shown in Fig. 13.41, the tree tries to match delays at different levels of the design. Thus skew at A and B is nearly matched, but is very different from skew at C, D, E, and F. The tree approach relies on the fact that it is not the absolute value of skew that matters, but rather the relative skew between two consecutive registers in a pipeline.

Fig. 13.41
figure 41

Tree clock distribution

Thus if there are two registers at A and B, it is unimportant if they have significant skew relative to registers at C and D, as long as A and B only communicate with each other and not with C or D. Thus, the aim of the clock tree is not be to minimize delay from the clock generator to every point on the floorplan, but rather to minimize delay between nearby registers. This is the opposite of the grid approach.

Figure 13.42 shows a popular tree structure for clock distribution called the H tree. In the H tree, progressively smaller H shapes are used to distribute the clock. The aim is to equalize RC delays to points at the same level in the hierarchy.

Fig. 13.42
figure 42

H tree. A fractal of ever smaller H shapes distribute clock to specific locations on the die

The tree approach is very efficient. It uses only as much metal as necessary. Thus its area and power budget are not nearly as bad as the grid approach. On the other hand, trees are very susceptible to mismatches. Note that the primary aim is to equalize delay in branches. This aim can and will be sabotaged by many sources of skew unless considered in design. Primary among the saboteurs is load mismatch.

The grid approach was independent of these mismatches because it used a brute-force approach to minimize absolute skew rather than equalize relative skew. The other disadvantage of trees is that they are tightly knit to the design. The location of local drivers and H shapes are dependent on the final location of cells and modules in the layout. Thus, if any change is made to the design, a corresponding modification has to be made to the clock network.

Practical clock distribution is usually a hybrid, combining different approaches at different levels. The ultimate aim is to reduce the impact of skew whether by removing it, or by equalizing it, or a combination of both.

Figure 13.43 shows a hybrid clock distribution network. The clock from the pin is fed to a central spine through a cascade of buffers. The central spine is a specialized tree structure. At every level in the spine, a thick metal wire is used to short all the buffer outputs, further reducing the relative skew at that level. In the final level of the spine, the clock is distributed along the vertical direction with minimum skew.

Fig. 13.43
figure 43

Hybrid clock distribution. The spine distributes the clock vertically. H trees create minimal skews to clock domains. Each clock domain is covered by a small clock grid

To distribute the clock with minimal skew along the horizontal direction would require an arbitrarily large number of horizontal spines. Thus, instead an H tree is used to distribute the clock horizontally. The tree carries the clock to local domains. The relative skew between these domains is minimized if the H tree is carefully designed. At each local clock domain, a grid is used to minimize the absolute skew within the domain.

This approach combines the advantages of grid and tree approaches. Efficiency is maintained by using an H tree to distribute the clock to large areas or domains on the chip. In each domain, the grid makes the clock network independent from the design. This allows changes to the design to be independent from the clock network, just as long as such changes are limited to within a domain.

13.8 Metastability

  1. 1.

    Recognize the need for systems with multiple clocks

  2. 2.

    Understand that setup and hold-time violations are inevitable when passing data between clock domains

  3. 3.

    Understand why metastability is a failure

  4. 4.

    Calculate the probability to enter metastability and the probability to stay in it

  5. 5.

    Calculate the MTBF for metastable failures as a function of time

Modern integrated circuits contain billions of transistors. They are also normally systems on a chip, with multiple functional subsystems coexisting on the same piece of silicon. We have so far assumed that all registers in a circuit use the same clock, skew notwithstanding. However, this is neither possible, nor is it useful.

When large systems are designed, subsystems can have wildly different critical paths. Forcing all subsystems to work synchronously is inefficient because it forces all subsystems to work at the speed of the slowest path in the entire chip. Given the fact that signals move mostly within subsystems, and occasionally between them, we are slowing down all our processing for the sake of the occasional intersystem communications.

Additionally, some blocks have to work at different clocks simply because they receive inputs or are expected to provide outputs at different rates. This is particularly common in communication systems where the transceiver moves from symbol to bit domain, requiring very different clock speeds.

Thus, we often end up using multiple clocks on the same chip. By multiple we mean a few. On-chip and especially on-FPGA resources rarely allow anything above nine independent clocks. Notice that independent clocks mean independent clock distribution networks (Sect. 13.7), which imposes severe conditions on the metal layer and on the place and route tool.

Notice also that clocks we call “different” are normally clocks with totally independent frequencies. Clocks whose frequencies are multiples of another clock are usually much easier to handle, and are often not considered “different”.

So what is the problem with using multiple clocks? The issue occurs when data tries to cross from a block using clk1 to a block using clk2. This is the “occasional” inter-subsystem communication we discussed above. However occasional this communication is, it happens, and it must happen reliably (Fig. 13.44).

Fig. 13.44
figure 44

System on a chip. Subsystems are best allowed to operate at their own speed. Inter-subsystem communication is much less frequent than intra-subsystem communication

What happens if we simply pass the data from block 1 using clk1 to block 2 using clk2? As shown in Fig. 13.45, most of the time this would work fine. We might not get a word registered (received) by block 2 in every cycle, but this is to be expected and should not be considered a failure because block 2 understands that there might not be a new sample in every cycle.

Fig. 13.45
figure 45

The problem with passing data between clocks: setup and hold-time violations will inevitably happen

The main problem with crossing clock domains is the creation of unexpected setup and hold-time violations. This is marked clearly on Fig. 13.45. In the cycle marked by “setup-time violation” in Fig. 13.45, data from block 1 arrives so close to the edge of clk2 that it causes a setup-time violation. This happens when data from block 1 arrives less that tsu2 before the active edge of clk2. A similar problem also occurs if the data arrives within thold2 after the active edge of clk2. Thus, there is a window around the active edge of clk2 that is problematic.

The above event will occur, with predictable regularity as long as the frequencies of the two clocks are independent. Next, we have to understand why this setup/hold violation is problematic, how it manifests at the output of a register, and how likely it is to occur.

To understand why setup-time violations are harmful, consider Fig. 13.46, recreated from Chap. 6. Setup time is instituted to allow data to pass through and settle at the outputs of inverters I2 and I3. “Settling” in this context means that the inverters are in the low-gain range of their VTC, producing either a DC low output or a DC high output.

Fig. 13.46
figure 46

Internal view of a register, showing setup-time violation. This is reproduced from Chap. 6

If the positive edge of the clock arrives under this condition, the outputs of I2 and I3 will be complements. The positive feedback will close through T2 and the state would be preserved correctly in the master latch.

A setup-time violation in the context explained in Fig. 13.46 entails that not enough time is given for the output to “settle” at the output of either or both of I2 and I3. We can focus on I2 for now. A setup-time violation means that the active edge catches the output of I2, Qi, at a transitional point. Thus, as shown in Fig. 13.47, the inverter is in neither low-gain region. It is in the high-gain transitional region caught in the middle while switching between two low-gain regions.

Fig. 13.47
figure 47

Metastable inverter VTC. If a setup-time violation occurs, the input to master latch inverters is in the high-gain “metastable” region. This is a failure, since proper operation necessitates both latch inverters to be in the low-gain stable regions

An inverter caught in the high-gain region is often called “metastable”. This is strictly a misnomer, but is so commonly used that its use is expected. Strictly speaking, the only “metastable” point in the VTC in Fig. 13.47 is the point (Vin = Vm, Vout = Vm). According to the regenerative property in Chap. 3, the inverter is unstable in the metastable region and will exit it in a few inverter stages. The only stable regions for an inverter are the low-gain regions.

However, the exact midpoint (Vin = Vm, Vout = Vm) is also stable. If the input is exactly at Vm, the output will also be at Vm, and no matter how many inverter stages are cascaded, the outputs will all remain at Vm. However, to observe this stability the input has to be exactly at the inverter logic threshold. Any deviation will cause the inverter chain to diverge and exit the high-gain region. The exact equations that describe this will be derived shortly. Because we can never ensure any signal has an exact value due to noise, interference, and mismatches; then it is inevitable that the inverter will exit the logic threshold point. Thus the point is metastable rather than stable.

In a less formal manner, we call an inverter caught with an input anywhere in the transition region metastable. If I2 is metastable and the active edge occurs, then the outputs of I2 and I3 will be in the metastable region. An equally bad situation occurs if I2 manages to settle, but I3 is still metastable.

In Sect. 3.2, we showed that inverters in the metastable region will “quickly” regenerate values to full logic after only a few inverter stages. When T2 closes the feedback, I2 and I3 form a loop. This loop should allow the output Qi to settle at a full logic value “quickly”, so why is metastability even a problem? For two reasons:

  • When metastability occurs, there is no guarantee how the inverters will exit it. We know that eventually I2 and I3 should end up in a stable state. However, there is no guarantee that Qi will settle at the correct value of D. What values I2 and I3 resolve to depends on the initial conditions when metastability occurred, as well as device variations in I2 and I3, and noise. This behavior is nearly impossible to predict. We can only assume that we cannot guarantee a resolution to the correct value of D

  • It can take a very long time for I2 and I3 to resolve. Although the high gain of the transitional region suggests a quick resolution of metastability, the time to resolve is also a very strong function of the initial voltage difference between Qi and the output of inverter I3. In all cases, the additional time to resolve metastability is an uncalculated overhead after the active edge of the clock. Thus, the output Q will not be D Tcq after the active edge of the clock. This represents a departure from the correct behavior of a register, and thus a failure

Metastability does not mean that we observe electrical values between logic “1” and logic “0” at the output of the register. In some cases such intermediate values can be observed at the output of the master latch Qi. However, Qi is an internal signal that the designer does not get to see.

The output of the register is two inverters removed from the node Qi. These two inverters I4 and I5 in Fig. 13.46 have high-gain transitional regions. Thus, even if Qi is an intermediate value, Q appears as either “1” or “0” according to Sect. 3.3.

However, as shown in Fig. 13.48, due to metastability point Qi can resolve in many different ways. It can overshoot then settle. It can also resolve out of metastability asymptotically. Depending on the logic thresholds of I4 and I5 and the behavior at Qi, Q can glitch, or it can make a single transition. In all cases, however, metastability is a failure for the following reasons:

Fig. 13.48
figure 48

Behavior at points Qi and Q. Qi can ring (2), or asymptotically approach either value (1 and 3). Q is never an intermediate value. It can glitch and settle down (2), rise to “1” (1), or remain at “0” (3)

  • If Q glitches, that means it settles to a logic value at a time more than Tcq after the clock edge

  • If Q does not glitch, it will settle to a value a time more than Tcq after the clock edge

  • In both the above cases, there is no guarantee that the output Q will settle to the correct value of D

  • Even if Q settles to the “correct” value of D, the fact that it settles more than Tcq after the edge is as good as resolving to the wrong value

Back to the two clock domains in Fig. 13.45. If data passes from clk1 domain to clk2 domain, then a setup-time violation will inevitably occur with a certain frequency. When this violation occurs, block 2 is basically unable to tell if the data coming from block 1 belongs to the upcoming cycle or to the cycle that just ended. This causes a metastable behavior, and the data at the output of the first register in block 2 will appear later than Tcq after the active edge of clk2.

Because all registers downstream in block 2 have clock cycles calculated based on data exiting registers after Tcq, this leads to the wrong values being registered and propagated in the receiver. These wrong values are again not intermediate electrical values; they are either “0” or “1”, but are certainly incorrect. The wrong bit is propagated downstream in the receiver subsystem leading to overall system failure.

Metastability is inevitable when crossing clock domains, and it causes unmitigated failure. To address metastability, we have to find ways to reduce its impact. To do this, we have to figure out how commonly it occurs. After all, if metastability is a rare or infrequent event, then why not just ignore it.

Figure 13.49 shows the time budget around the active edge of clk2. This is the budget that determines if we see metastability. The cycle has a period T2. There is a window around the clock edge where an arriving signal from block 1 is metastable. This window is Tsu + Thold, with Tsu seconds before the edge and Thold seconds after the edge. Let Tm = Thold + Tsu.

Fig. 13.49
figure 49

Window for metastability

To find the probability that metastability occurs, we assume that data from block 1 is uniformly distributed over period T2. This assumption is acceptable when the two clock frequencies are fully independent. In this case, the probability that metastability occurs is

$$ P\left( {\rm metastability} \right) = \frac{{T_{m} }}{{T_{2} }} $$

and data arrives from block 1 every T1, thus the rate at which metastability occurs is

$$ R\left( {\rm metastability} \right) = \frac{{T_{m} }}{{T_{2} }}f_{1} = \frac{{T_{m} }}{{T_{1} T_{2} }} $$

Metastability is a failure that happens as the circuit is operating and the rate R above is its failure rate. This makes metastability a matter of reliability. From Sect. 14.8, we see that mean time between failures (MTBF) is the most useful measure of reliability. MTBF answers the question: if a failure occurs, how long does it take for the same failure to repeat. If MTBF is high for any kind of failure, then we do not need to solve it. This is simply because if the failure occurs, then all we have to do is reset the system, and the same failure will not repeat for a long time represented by MTBF. MTBF is the reciprocal of failure rate.

Assume typical values for the clocks:

$$ \begin{array}{*{20}c} {T_{m} = 10\,{\rm ps}} \\ {T_{1} = 2\,{\rm ns}} \\ {T_{2} = 1\,{\rm ns}} \\ \end{array} $$

These numbers assume a very tight Tm, thus favoring a lower failure rate. However, we still find that R:

$$ R = \frac{{10\times10^{ - 12} }}{{2\times10^{ - 9} \times 1\times 10^{ - 9} }} = 5/ \upmu s $$

This translates into MTBF of

$$ MTBF = 0.2\,\upmu s $$

Thus if a failure due to metastability occurs and we fix it or ignore it, then the next metastability would occur only 0.2 μs later. So obviously metastability is a major issue. It fundamentally interferes with the operation of circuits that pass data between clock domains, rendering them useless. And it happens so frequently that simply ignoring it is unviable. Similar results would be obtained for a very large range of Tm, T1, and T2. Thus none of these parameters can be used to manage MTBF. We need a more fundamental solution.

To find a solution for metastability, we have to better understand how latches exit it. Figure 13.50 shows a model for inverters I1 and I2 and their output nodes X and Qi. The “output” of this circuit is actually the differential voltage VqiVx. We call this differential voltage Vd. When the absolute value of Vd reaches Vdd, the two inverters are considered to have exited metastability.

Fig. 13.50
figure 50

Model for metastable latch. Each amplifier represents one of the inverters I1 and I2 in the transition region. C is the parasitic capacitance at the output of either inverter. We assume everything is matched for simplicity

The two inverters I1 and I2 are in the transitional region, thus their transistors are saturated, and they have to be modeled as amplifiers. The model is shown in Fig. 13.50. The two transistors in each of the inverters are simultaneously in saturation region, leading to amplifier-like behavior. A deeper discussion of inverters acting as amplifiers is presented in Sect. 12.6.

The transconductance of either inverter is the parallel transconductance of its NMOS and PMOS:

$$ g_{m} = g_{\rm mn} + g_{\rm mp} $$

We will assume inverters I2 and I3 are matched and thus have the same transconductance. Mismatches between inverters are inevitable, however, the general conclusions we will draw here are not fundamentally affected by such mismatches.

We also assume the capacitive loading at the outputs of both inverters is equal. This assumption is unlikely to be true because it assumes that the complex load offered at the output of I2 is equal to the loading at the output of I3.

With the matching assumptions in mind, we can perform KCL at X and Qi, the current exiting the inverter amplifiers equals the current entering the capacitors:

$$ \begin{array}{*{20}c} { - g_{m} V_{x} = C.\frac{{{\rm d}V_{\rm Qi} }}{{{\rm d}t}}} \\ { - g_{m} V_{\rm Qi} = C.\frac{{{\rm d}V_{x} }}{{{\rm d}t}}} \\ \end{array} $$

Note the negative sign because amplifiers by default sink signals. Subtracting the two equations:

$$ \begin{array}{*{20}c} {g_{m} \left( {V_{x} - V_{\rm Qi} } \right) = - C.\frac{{d(V_{x} - V_{\rm Qi} )}}{{{\rm d}t}}} \\ {g_{m} V_{d} = - C.\frac{{{\rm d}V_{d} }}{{{\rm d}t}}} \\ \end{array} $$

The time-constant at points X and Qi is

$$ \tau = \frac{C}{{g_{m} }} $$

The differential equation has an exponential solution:

$$ V_{d} = V_{d} \left( 0 \right)e^{t/\tau } $$

where Vd(0) is the initial voltage difference when clk2 catches I2 and I3 in a metastable state. In short, how far apart Vx and Vqi are, and how far from the true metastable point both are. We can assume that if Vd equals Vdd, the inverters have exited metastability. But in fact it is enough to allow the two nodes to just enter the low-gain region of the VTC, we do not need to wait for them to reach the full rail values. Thus, to find the time to exit metastability, we use Vd = Vdd, knowing that in reality Vd can be slightly lower. In fact according to Fig. 13.47, Vd can be as low as VihVil. This allows us to calculate Texit, the time to exit metastability:

$$ T_{\rm exit} = \tau \ln (V_{\rm DD} /V_{d} \left( 0 \right)) $$

Thus, the time to get out of metastability is a function of the time-constant at the output of the inverters. But more critically, it is a strong function of the ratio of the final and initial voltages. Assuming that the distribution of Vd(0) is uniform between 0 V and Vdd, then the likelihood that we are still in metastability at time t = 0 is 1. This probability then drops exponentially, as the voltage difference increases exponentially, until it reaches 0 at time t = infinity. Thus, the probability that we are still in metastability after a time t is

$$ P\left( {metastable,t} \right) = e^{ - t/\tau } $$

This is the conditional probability that we are still metastable at time t given metastability has already occurred. The probability that we both enter metastability and have not exited at time t is

$$ \begin{array}{*{20}c} {P\left( {still\,metastable,t} \right) = P\left( {metastability} \right).P\left( {metastable,t} \right)} \\ {P\left( {still\,metastable,t} \right) = \frac{{T_{m} }}{{T_{2} }}e^{ - t/\tau } } \\ \end{array} $$

So if metastability occurs, what is the probability that the metastability has not resolved a full cycle later:

$$ P\left( {still metastable,T_{2} } \right) = \frac{{T_{m} }}{{T_{2} }}e^{{ - T_{2} /\tau }} $$

And the rate of this event is

$$ R\left( {still metastable,T_{2} } \right) = \frac{{T_{m} }}{{T_{1} T_{2} }}e^{{ - T_{2} /\tau }} $$

Leading to an MTBF of

$$ MTBF = \frac{{T_{1} T_{2} }}{{T_{m} }}.e^{{T_{2} /\tau }} $$

Substituting for the typical values:

$$ \begin{array}{*{20}c} {T_{m} = 10\,{\rm ps}} \\ {T_{1} = 2\,{\rm ns}} \\ {T_{2} = 1\,{\rm ns}} \\ {\tau = 20\,{\rm ps}} \\ \end{array} $$

Gives an MTBF of

$$ \begin{array}{*{20}c} {MTBF = \frac{{T_{1} T_{2} }}{{T_{m} }}.e^{{\frac{{T_{2} }}{\tau }}} = \frac{2000*1000}{10}*e^{{\frac{1000}{10}}} ps} \\ {MTBF = 1.7\times10^{20} billion\,years} \\ \end{array} $$

Thus, if we observe the output after a full cycle has passed and a failure occurs due to metastability, the same failure would recur only after billions of billions of years, which we can safely say is as good as “never”.

Thus, the simplest solution to metastability is not to sample inputs incoming from clock domain 1 at the edge of clk2, but to delay such sampling a whole period of clk2. This might sound wasteful, but it has a pipeline-like nature that means we do not actually “waste” the extra period.

However, finding a systematic way to communicate using this result is difficult. We will do so in the next section. And along the way, we will discover that using this approach to synchronize the two clock domains is only viable for very occasional communications. Burst communication requires different hardware.

13.9 Synchronization

  1. 1.

    Use a two register synchronizer

  2. 2.

    Understand scenarios for the two register synchronizer

  3. 3.

    Trace handshaking using synchronizers

  4. 4.

    Design an asynchronous FIFO

  5. 5.

    Understand the conservative nature of synchronizers in asynchronous FIFOs

  6. 6.

    Understand when asynchronous FIFOs are useful.

Limiting metastability by introducing a cycle of delay is best done in practice by passing data coming from clk1 through two cascaded registers using clk2. Because the output Q2 in Fig. 13.51 samples Q1 a whole cycle later, the MTBF of metastability at Q2 is huge as derived in Sect. 13.8.

Fig. 13.51
figure 51

Two register synchronizer. D is synchronous with clk1, Q2 is synchronous with clk2

The two register approach in Fig. 13.51 is a popular approach to synchronization. We can show how a bit entering D1 in a metastable window will not show up as metastable on Q2. This can be done by tracing all the different ways Q1 can change and observing how this appears on Q2. However, two critical notes help in understanding the behavior at Q2:

  • The value at Q1 due to metastable latching in the master of the first register is never an intermediate value. Q1 will be a full logic value, albeit not necessarily the correct one. This is because there are four inverters in series from D to Q1, helping regenerate Q1 into a full logic value (Sect. 13.8)

  • If the fist register is metastable, this can manifest on Q1 in one of three ways: Q1 can glitch, it can register the correct value after more than Tcq, or it can register the wrong value after more than Tcq

Note that if D is maintained for another clk2 cycle, the probability that on the next active edge Q1 has not resolved into the correct value is negligible according to the MTBF calculations in Sect. 13.8. If D makes the transition from 0 to 1, we can distinguish the following scenarios about the behavior of Q1 and Q2. These scenarios are shown in Fig. 13.52 and summarized in Table 13.4:

Fig. 13.52
figure 52

Scenarios 1, 2, and 3 from Table 13.4

Table 13.4 Behavior of two register synchronizer over three cycles
  1. 1.

    The first register can catch the correct value of the transition at D. This can be either because metastability has not occurred, or because metastability has occurred but Q1 has resolved to the correct value of D. There is an important difference between the two cases, if metastability has not occurred, then D appears to Q1 after Tcq. If metastability has occurred, D appears on Q1 more than Tcq after the edge. In both cases, register 2 will see the correct value on Q1 with no setup-time violations on the next clock edge. Thus, the correct value of D appears on Q2 in cycle 2. But is there not a chance that the metastable case would take too long to resolve, thus causing a setup-time violation at the input of R2? Yes, but as shown in the calculation in the end of the last section, the MTBF of such an event is astronomically large

  2. 2.

    The first register can completely miss the transition in D. Thus Q1 does not sample the transition on D at all, instead remaining at its old value. This can be because D changes after the window of clk2, or because metastability occurs and resolves to the wrong value. In the next cycle, Q1 will certainly register the correct value of D. This correct value will appear on Q2 on the third cycle

  3. 3.

    The first register can glitch, going to the correct value, and back to the wrong value. Q2 will sample Q1 in the first cycle before the glitching happens, because metastability manifests after Tcq. By the time the second cycle comes along, Q1 will sample D without any setup-time violations. This sampling is then handed over to Q2 on the third edge of clk2

Notice there is only one thing in common between the three scenarios: when a transition occurs on Q2, that transition is correct, it will not glitch, it is not a resolution to a wrong value, and it is free from metastability.

However, there is uncertainty about when exactly the signal appears on Q2. In some scenarios it appears on the second edge, in some others it appears on the third edge. When the signal appears on the second edge, there is no extra delay in the process, it is simply a pipeline. If it appears on the third cycle, then a whole period was “wasted”. However, this waste is not truly waste, it is the inevitable overhead of sometimes having to wait for metastability to resolve itself.

Signals that pass between clock domains are usually buses rather than single bit signals. If we use the double register synchronizer to try to synchronize a data bus across two clock domains as shown in Fig. 13.53, we face a fundamental problem. The one cycle uncertainty about when data is registered in the receiver makes reading the bus impossible. Some bits in the bus are synchronized on the second cycle, some are synchronized on the third. While we can always read on the third cycle, this reduces the efficiency of the method.

Fig. 13.53
figure 53

Using synchronizers to pass a bus of data. Because Q0 through Q3 can independently transition on the second or third cycle, this method is not viable

Another point of contention is how long the data should be held at the output of the transmitter. The transmitter needs to keep its data unchanged until the receiver has correctly registered the data. So there has to be some feedback information from the receiver to the transmitter telling the latter that it is safe to change data.

Synchronizers are systematically used to manage clock domain transitions by synchronizing single bit control signals. These signals are handshake controls that manage the exchange of data between the transmitter and the receiver.

The setup is shown in Fig. 13.54. Data is readied on the data bus synchronous with the transmitter clock clk1. Simultaneously, the req flag is raised. The req flag indicates the transmitter is requesting a transfer to the receiver. The req flag is raised synchronous with data on clk1.

Fig. 13.54
figure 54

Two register synchronizer used to transfer bus data

The req flag is synchronized to the receiver clock clk2 using a two register synchronizer. When the receiver senses req, it will register data. It will then generate an acknowledge flag, ack, on the edge of clk2. Ack is synchronized to the transmitter clock, and when read indicates data has been received properly.

The steps of handshaking are

  1. 1.

    A data word is prepared on the bus synchronous with clk1. Simultaneously req is raised on clk1

  2. 2.

    Req is synchronized to the receiver clock clk2 using a two register synchronizer. The receiver will either read the req on the second or third edge of the clock

  3. 3.

    Because req and data were generated on the same clock, when req is synchronized at the receiver, it is safe to read the entire data bus. Thus the receiver registers the data bus

  4. 4.

    The receiver raises the “ack” flag on clk2, indicating the word has been read

  5. 5.

    The ack flag is synchronized to the transmitter clock using a two register synchronizer. The transmitter reads the ack flag on the second or third edge

  6. 6.

    When the transmitter reads ack, it lowers the req signal, indicating it now understands the receiver has registered the data and it is now ready to send a new word

  7. 7.

    The falling req signal is synchronized to the receiver, the receiver then lowers the ack signal, indicating it understands the transmitter might send a new word

  8. 8.

    The falling ack signal is synchronized to the transmitter, indicating the receiver is ready to receive a new word. The cycle can now restart and a new word can be transmitted.

This handshaking approach is very safe. The MTBF of failures due to metastability is as shown in Sect. 13.8, indicating metastability is as good as solved. However, the large number of steps and the cycles consumed in every step insinuate this method has a significant delay overhead.

To calculate the overhead for sending a single word on the bus, we calculate the number of cycles in every step. Steps where synchronization takes place between the two clock domains consume 1 or 2 cycles, so we will assume that on average they consume 1.5 cycles. Steps where data is simply received by one side take only one cycle. Thus, the total cycle budget (written in order) is

$$ {\text{Cycles}}\,{\text{to}}\,{\text{send}}\,{\text{and}}\,{\text{receive}}\,{\text{one}}\,{\text{sample}} = 1 + 1.5 + 1 + 1 + 1.5 + 1 + 1.5 + 1.5 = 9 $$

This budget assumes very efficient data management on both sides where any two actions that can be done in the same cycle are. Note also that it is not strictly correct to add the cycles as we did above because some of the cycles are from clk1 while others are from clk2. But in the best case, it takes at least 9 of the shorter of the two clock cycles to get a single data word across, which is very inefficient.

More efficient handshaking approaches can be found. And some clever solutions, like sending multiple words in parallel on the data bus in Fig. 13.54, can reduce the overhead. But any method that uses handshaking for synchronizing has fundamentally high overhead. Thus handshaking is only viable for communication channels where words are transmitted only occasionally. It is also useful when one of clk1 and clk2 is many multiples of the other.

However, when we need to communicate long bursts of data between two clock domains with unrelated but close clock frequencies, another approach has to be followed. This approach is the asynchronous FIFO.

A FIFO is at its core a RAM. But as shown in Fig. 13.55, it has some additional circuitry to indicate the status of the queue. The FIFO has two ports. One port is used exclusively for writing, the other for reading. There are two pointers in the FIFO, the read pointer and the write pointer. The read pointer indicates the next position we should read from. Thus, if we have just read from position 2, the read pointer should be 3. When the next read is performed, we read from position 3. The write pointer is very similar, it indicates the next position to write to.

Fig. 13.55
figure 55

Asynchronous FIFO schematic

There are two flags derived from the read and write pointers. These flags are critical for correct operation of the FIFO. Notice that the write port should not write to a position that has not been read yet. Only a clear address can be written, and an address is “clear” only if it has been already read.

But what if there are no positions that can be written to? If this happens, then all the FIFO is yet to be read, and we must wait until a read is performed before we can write. In this condition, we say the FIFO is full. When is a FIFO full? When all locations are unread, Fig. 13.56.

Fig. 13.56
figure 56

Full FIFO condition

Note the read pointer indicates the next location to be read and the write pointer indicates the next position to be written. Note also that addresses in a FIFO are always incremented. Thus, if we write a word and increment the write pointer, and then find that the write pointer is equal to the read pointer, then the FIFO is full. This is because the next position to be written is also the next position to be read, which indicates such position is still unread. There can be no other randomly placed clear positions in the FIFO because both reading and writing take place sequentially by incrementing the address in order.

The other flag is the empty flag, Fig. 13.57. When a FIFO is empty, it has no locations carrying unread information to be read. Thus, we have to wait for a new word to be written to the queue before we try to read. So, if we read a word and increment the address, then find that the read pointer is equal to the write pointer, then the FIFO is empty. This is because the next location to be read is also the next location to be written. Which means the location has already been read before and has not had new data written to it. There can be no other locations carrying data that has not been read yet because the FIFO reads and writes in order.

Fig. 13.57
figure 57

Empty FIFO condition

Note that the condition for empty and full is the same: The read and write pointers are equal. The only distinguishing feature between the two is the action that leads to this equality. If we increase the read pointer and find equality, then the FIFO is empty. If we incremented the write pointer and find equality, then the FIFO is full. Thus, the last action before equality decides the state.

An asynchronous FIFO is a FIFO where the read and write ports use different clocks. Figure 13.58 shows an asynchronous FIFO used to get data across two clock domains. The transmitter handles the write port. The receiver controls the read port. Thus, the transmitter dumps data into the queue while the receiver reads it out.

Fig. 13.58
figure 58

Asynchronous FIFO used to communicate between clock domains. The two rectangles below the empty and full flag calculating blocks are two register synchronizers

The read pointer is generated by the receiver on clk2. The write pointer is generated by the transmitter on clk1. Calculating the empty and full flags requires a comparison (subtraction) of the read and write pointers. However, these two pointers are generated in different clock domains.

As shown in Fig. 13.58, to calculate the full flag, the read pointer is synchronized to clk1 by a two register synchronizer. This is then subtracted from the write pointer, to produce the flag. Similarly, to calculate the empty flag, the write pointer is synchronized to clk2 by a two register synchronizer before the receiver calculates the empty flag.

The synchronization allows the empty and full flags to be calculated reliably without being affected by metastability. However, synchronization randomly takes between 1 and 2 cycles to produce a result; we have to check that such a random delay would not cause a problem. A “problem” in a FIFO occurs if the receiver tries to read a word it has already read, i.e., if it reads from a clear position. It is also problematic if the transmitter tries to write to a location that has not been read, causing such data to be overwritten and lost.

Both problems described above are a result of missing an empty or full flag. So we have to check if this would happen due to the single cycle uncertainty in synchronizers.

The empty flag is raised when a read happens and the read and write pointers become equal. The empty flag is calculated at the receiver. The read pointer is generated by the receiver. The write pointer comes from the transmitter and has to be synchronized to the receiver. Thus if there is synchronization delay while calculating the empty flag, it is due to the write pointer rather than the read pointer.

If we increment the read pointer and find equality, the empty flag is raised. But what if the write pointer had updated, but is being delayed in the synchronizer. In such a case, the write pointer should have been higher than it is. And thus, due to the synchronization delay, we see an equality when we should not, and we raise the empty flag when we should not. So we cause the receiver to not read for a single cycle when it could have safely read. This can be called conservative or wasteful, but it is certainly not a failure. A failure would be to read when we should not have read.

Similarly, the full flag is calculated at the transmitter. It is triggered by an increment in the write pointer. The write pointer is generated on clk1. We synchronize the read pointer to clk1 to perform the comparison.

Due to the synchronization uncertainty, we could declare a full condition when we did not have to. This would cause the transmitter to stop writing when it could have continued to. Thus, again this is conservative, rather than a failure.

The asynchronous FIFO is extremely efficient at transmitting long bursts of data. Its only delay overhead occurs when the empty or full flags are raised. In conditions where the receiver is much faster than the transmitter, the empty flag is raised often. In conditions where the transmitter is much faster than the receiver, the full flag is raised often. The latter case can be mitigated by increasing the size of the FIFO. In conditions where the two clocks are close to each other in frequency, the communication can be very efficient, with the empty or full flags raised rarely.

The read and write flags are both multi-bit words. Yet we still pass them through two register synchronizers. We discussed earlier in this section that buses should never be passed through synchronizers because the one cycle uncertainty of synchronization affects different bits differently.

However, the read and write pointers are always Gray encoded. This means that they do not increment in binary order, but rather in a Gray-encoded order. Thus, there is at most a single bit difference in every cycle if the pointers increment. Thus, the synchronizers are effectively only synchronizing a single bit change, and they can be safely used.