A wise man once said in a conference presentation that if you put SQL on your resume and do not know analytic functions, you are lying. I can only agree. It would be similar to stating you know Windows and have never worked with a newer windows version than Windows 95.

I use analytic functions almost daily when developing. There are so many cases where they either are necessary to create a SQL solution at all (the alternative being a slow procedural solution instead) or at the very least make the SQL much more performant than not using analytic functions (often cases of many self-joins leading to multiple lookups of the same data).

The fantastic bit about analytic functions is that you can retrieve or reference values across rows – you are not restricted to values in the row itself when doing calculations. You can use different subclauses of analytic functions in different combinations to achieve this.

The basics of these subclauses, and how they work together, are shown in this chapter. The rest of Part 2 contains different use cases of analytic functions solving tasks that often would be hard without.

Sums of quantities

To showcase the different subclauses of an analytic function call, I’ll be using the orderlines table shown in Figure 11-1.

Figure 11-1
figure 1

Orderlines table of how much of each product is ordered by customers

The orderlines table contains how much is in order from customers for each of the beers in the products table. In the example queries of this chapter, I’ll join the two tables just to make it easier to spot the two different beers whose data I show in Listing 11-1.

Listing 11-1 Content of orderlines table for two beers

SQL> select   2     ol.product_id as p_id   3   , p.name        as product_name   4   , ol.order_id   as o_id   5   , ol.qty   6  from orderlines ol   7  join products p   8     on p.id = ol.product_id   9  where ol.product_id in (4280, 6600)  10  order by ol.product_id, ol.qty; P_ID  PRODUCT_NAME     O_ID  QTY 4280  Hoppy Crude Oil  423   60 4280  Hoppy Crude Oil  427   60 4280  Hoppy Crude Oil  422   80 4280  Hoppy Crude Oil  429   80 4280  Hoppy Crude Oil  428   90 4280  Hoppy Crude Oil  421   110 6600  Hazy Pink Cloud  424   16 6600  Hazy Pink Cloud  426   16 6600  Hazy Pink Cloud  425   24

I’ll make a lot of different sums of the qty column. With the basic ideas you can apply to most of the analytic functions, sum is just a handy example.

Analytic syntax

I’m sure you have seen Figure 11-2 in the SQL Reference Manual, showing that all analytic functions use the keyword over followed by parentheses surrounding an analytic clause.

Figure 11-2
figure 2

Basic analytic function syntax diagram

Many functions are aggregate functions when used without over and become analytic when you add over. The interesting bits happen within the analytic clause shown in Figure 11-3.

Figure 11-3
figure 3

The three parts that make up the analytic clause

The analytic clause has three parts:

  • query_partition_clause to split the data into partitions and apply the function separately to each partition

  • order_by_clause to apply the function in a specific order and/or provide the ordering that the windowing_clause depends upon

  • windowing_clause to specify a certain window (fixed or moving) of the ordered data in the partition

But you’ll notice that the three parts are all optional in the syntax diagram, so the analytic clause itself is allowed to be empty. Listing 11-2 shows what happens then.

Listing 11-2 The simplest analytic function call is a grand total

SQL> select   2     ol.product_id as p_id   3   , p.name        as product_name   4   , ol.order_id   as o_id   5   , ol.qty   6   , sum(ol.qty) over () as t_qty   7  from orderlines ol   8  join products p   9     on p.id = ol.product_id  10  where ol.product_id in (4280, 6600)  11  order by ol.product_id, ol.qty;

I’ve just taken Listing 11-1 and added line 6: a sum of the qty column as analytic function (recognizable by the over keyword) with an empty analytic clause. The output becomes:

P_ID  PRODUCT_NAME     O_ID  QTY  T_QTY 4280  Hoppy Crude Oil  423   60   536 4280  Hoppy Crude Oil  427   60   536 4280  Hoppy Crude Oil  422   80   536 4280  Hoppy Crude Oil  429   80   536 4280  Hoppy Crude Oil  428   90   536 4280  Hoppy Crude Oil  421   110  536 6600  Hazy Pink Cloud  424   16   536 6600  Hazy Pink Cloud  426   16   536 6600  Hazy Pink Cloud  425   24   536

The t_qty column simply contains the sum of all the qty values – not of the entire table, but of those rows that satisfy the where clause.

When executing a SQL statement, evaluation of analytic functions happens after the rows have been found (where clause evaluation) and also after any group by aggregation that may be in the statement. Therefore, analytic functions cannot be used in the where, group by, and having clauses. But they can be used in the order by clause, if you need to.

The empty analytic clause means that no partitioning has been defined, so there is just a single partition containing all the rows. Also no ordering and windowing have been defined, so the entire partition is the window on which the sum function is applied. Therefore it becomes the grand total.

Often, though, I’d like to apply the analytic function on smaller subsets, which I’ll show next.

Partitions

There are two ways to split the rows into smaller subsets for analytic functions, each serving different purposes. The first is partitioning with the query_partition_clause shown in Figure 11-4.

Figure 11-4
figure 4

Syntax diagram for the query_partition_clause

You can use one or more expressions to do the partitioning, where there will be created a partition for each distinct value in the expression(s). Each partition is completely separated, and the analytic function evaluated in one partition cannot see data in any other partition.

Note

You’ll see that Listing 11-3 is the same as Listing 11-2, only changed in the analytic function call. This goes for most of the examples in the chapter – if nothing else is indicated, they are copies of Listing 11-2 with just the changed function call shown.

I show a simple example of using partition by in Listing 11-3.

Listing 11-3 Creating subtotals by product with partitioning

...   6   , sum(ol.qty) over (   7        partition by ol.product_id   8     ) as p_qty ...

The analytic clause is no longer empty; I have added line 7 to create a partition for each beer, and the grand totals now apply within each partition only. This way p_qty is a grand total per product:

P_ID  PRODUCT_NAME     O_ID  QTY  P_QTY 4280  Hoppy Crude Oil  423   60   480 4280  Hoppy Crude Oil  427   60   480 4280  Hoppy Crude Oil  422   80   480 4280  Hoppy Crude Oil  429   80   480 4280  Hoppy Crude Oil  428   90   480 4280  Hoppy Crude Oil  421   110  480 6600  Hazy Pink Cloud  424   16   56 6600  Hazy Pink Cloud  426   16   56 6600  Hazy Pink Cloud  425   24   56

That’s nice, but I can be much more creative with the second form of splitting the data into subsets – windowing with the order_by_clause and windowing_clause.

Ordering and windows

For the order_by_clause syntax shown in Figure 11-5, the authors of the SQL Reference Manual have copied the syntax for the regular order by in a query.

Figure 11-5
figure 5

Syntax diagram for the order_by_clause

But it isn’t quite the truth. When you read the following description in the manual, it is explained that keyword siblings cannot be used, and you also cannot use position and c_alias for an analytic order by.

For some analytic functions, query_partition_clause and order_by_clause are all there are – the third subclause is unavailable. But for many, you also have the windowing_clause (Figure 11-6) available. To use windowing, you must have filled the order_by_clause.

Figure 11-6
figure 6

Syntax diagram for the windowing_clause

I’ll do a running total in Listing 11-4 by using both ordering and windowing.

Listing 11-4 Creating a running sum with ordering and windowing

...   6   , sum(ol.qty) over (   7        order by ol.qty   8        rows between unbounded preceding   9                 and current row  10     ) as r_qty ...  15  order by ol.qty;

Line 7 contains my order by and lines 8–9 my window specification. I specify that when the analytic sum is to be evaluated on a given row, the sum should be applied to a rolling window of all the preceding rows up to and including the current row. To see easily what happens, I change the order by in line 15 to match the order by in line 7, giving me an output with r_qty being a running sum of qty:

P_ID  PRODUCT_NAME     O_ID  QTY  R_QTY 6600  Hazy Pink Cloud  426   16   16 6600  Hazy Pink Cloud  424   16   32 6600  Hazy Pink Cloud  425   24   56 4280  Hoppy Crude Oil  427   60   116 4280  Hoppy Crude Oil  423   60   176 4280  Hoppy Crude Oil  422   80   256 4280  Hoppy Crude Oil  429   80   336 4280  Hoppy Crude Oil  428   90   426 4280  Hoppy Crude Oil  421   110  536

The qty of each row is added as the rows are processed in order, resulting in the running sum. When the ordering is not unique, whichever row the database happens to access first will be added first. The first two lines of the output might have shown o_id 424 before 426 instead, if the access plan had been such that 424 was accessed first.

I can change the order by in line 15 back to the same ordering as Listing 11-2 (and most other examples), ordering by product_id first, then qty:

...  15  order by ol.product_id, ol.qty;

Now my output is ordered differently, but the running sum is still calculated with the order by in the analytic sum, namely, qty alone. You’ll see, for example, that the two first lines of the previous output are now near the end, but o_id 426 still has a value of 16 in r_qty and o_id 424 a value of 32 and so on:

P_ID  PRODUCT_NAME     O_ID  QTY  R_QTY 4280  Hoppy Crude Oil  423   60   176 4280  Hoppy Crude Oil  427   60   116 4280  Hoppy Crude Oil  422   80   256 4280  Hoppy Crude Oil  429   80   336 4280  Hoppy Crude Oil  428   90   426 4280  Hoppy Crude Oil  421   110  536 6600  Hazy Pink Cloud  424   16   32 6600  Hazy Pink Cloud  426   16   16 6600  Hazy Pink Cloud  425   24   56

Having analytics applied in a different order than the output itself is a useful technique in a quite a few situations.

Tip

The lower half of Figure 11-6 shows the shortcut syntax. When you have a window that is rows between something and current row, you can simply use rows something, and it will default to using something as start row and current row as end row of the window. In Listing 11-4, I could have replaced lines 8–9 with a single line containing rows unbounded preceding. Personally I like to always use the between syntax, but you can use the shortcut if you like. It is only syntactical difference, and the result is identical.

Of course I can combine all three clauses in a single call, as I do it in Listing 11-5.

Listing 11-5 Combining partitioning, ordering, and windowing

...   6   , sum(ol.qty) over (   7        partition by ol.product_id   8        order by ol.qty   9        rows between unbounded preceding  10                 and current row  11     ) as p_qty ...

I partition in line 7 by product_id and order in line 8 by qty, so the window in lines 8–9 gives me a running sum for each beer, which the output shows nicely since I kept the usual query ordering of product_id, qty:

P_ID  PRODUCT_NAME     O_ID  QTY  P_QTY 4280  Hoppy Crude Oil  423   60   60 4280  Hoppy Crude Oil  427   60   120 4280  Hoppy Crude Oil  422   80   200 4280  Hoppy Crude Oil  429   80   280 4280  Hoppy Crude Oil  428   90   370 4280  Hoppy Crude Oil  421   110  480 6600  Hazy Pink Cloud  424   16   16 6600  Hazy Pink Cloud  426   16   32 6600  Hazy Pink Cloud  425   24   56

Windowing is very handy and often used for running totals, but the window can be much more flexible than that.

Flexibility of the window clause

The running totals in the previous two listings was up to and including current row, which is quite normal. But the window does not need to include the current row, as I show in Listing 11-6 that calculates running total of all previous rows.

Listing 11-6 Window with all previous rows

...   6   , sum(ol.qty) over (   7        partition by ol.product_id   8        order by ol.qty   9        rows between unbounded preceding  10                 and 1 preceding  11     ) as p_qty ...

In line 10, I replaced the current row with 1 preceding, meaning the window is all rows up to and including the row just before the current row:

P_ID  PRODUCT_NAME     O_ID  QTY  P_QTY 4280  Hoppy Crude Oil  423   60 4280  Hoppy Crude Oil  427   60   60 4280  Hoppy Crude Oil  422   80   120 4280  Hoppy Crude Oil  429   80   200 4280  Hoppy Crude Oil  428   90   280 4280  Hoppy Crude Oil  421   110  370 6600  Hazy Pink Cloud  424   16 6600  Hazy Pink Cloud  426   16   16 6600  Hazy Pink Cloud  425   24   32

You’ll notice that means that p_qty is null on the first row of each partition, since there are no preceding rows at that point.

Windows can also look ahead in the data rather than just look at the preceding rows. I can change the window specification of Listing 11-6 to a window starting at the current row and including all the following rows in the partition:

...   9        rows between current row  10                 and unbounded following ...

That gives me a reversed running total:

P_ID  PRODUCT_NAME     O_ID  QTY  P_QTY 4280  Hoppy Crude Oil  423   60   480 4280  Hoppy Crude Oil  427   60   420 4280  Hoppy Crude Oil  422   80   360 4280  Hoppy Crude Oil  429   80   280 4280  Hoppy Crude Oil  428   90   200 4280  Hoppy Crude Oil  421   110  110 6600  Hazy Pink Cloud  424   16   56 6600  Hazy Pink Cloud  426   16   40 6600  Hazy Pink Cloud  425   24   24

Again I do not only need to include the current row; I can also do a window of all rows yet to come:

...   9        rows between 1 following  10                 and unbounded following ...

The null value at the end of each partition indicates there are no rows following:

P_ID  PRODUCT_NAME     O_ID  QTY  P_QTY 4280  Hoppy Crude Oil  423   60   420 4280  Hoppy Crude Oil  427   60   360 4280  Hoppy Crude Oil  422   80   280 4280  Hoppy Crude Oil  429   80   200 4280  Hoppy Crude Oil  428   90   110 4280  Hoppy Crude Oil  421   110 6600  Hazy Pink Cloud  424   16   40 6600  Hazy Pink Cloud  426   16   24 6600  Hazy Pink Cloud  425   24

I can give the window bounds in both ends to sum, for example, the values from the previous row, the current row, and the following row:

...   9        rows between 1 preceding  10                 and 1 following ... P_ID  PRODUCT_NAME     O_ID  QTY  P_QTY 4280  Hoppy Crude Oil  423   60   120 4280  Hoppy Crude Oil  427   60   200 4280  Hoppy Crude Oil  422   80   220 4280  Hoppy Crude Oil  429   80   250 4280  Hoppy Crude Oil  428   90   280 4280  Hoppy Crude Oil  421   110  200 6600  Hazy Pink Cloud  424   16   32 6600  Hazy Pink Cloud  426   16   56 6600  Hazy Pink Cloud  425   24   40

Or I can make a window that is unbounded in both ends:

...   9        rows between unbounded preceding  10                 and unbounded following ...

But this makes little sense, as the totally unbounded window is the entire partition, which means that the order by clause actually does not make a difference to the output, which is the same as I got from Listing 11-3 that had no order by and no windowing clause:

P_ID  PRODUCT_NAME     O_ID  QTY  P_QTY 4280  Hoppy Crude Oil  423   60   480 4280  Hoppy Crude Oil  427   60   480 4280  Hoppy Crude Oil  422   80   480 4280  Hoppy Crude Oil  429   80   480 4280  Hoppy Crude Oil  428   90   480 4280  Hoppy Crude Oil  421   110  480 6600  Hazy Pink Cloud  424   16   56 6600  Hazy Pink Cloud  426   16   56 6600  Hazy Pink Cloud  425   24   56

So for the completely unbounded window, I recommend just skipping order by and windowing clause.

In the syntax diagram, you saw that a window could be specified using either rows between or range between. As I gave several examples of, a rows between window is determined by a number of rows before or after the current row. It is different with range between .

Windows on value ranges

If I want, I can specify a window not as “two rows before to two rows after the current row” but instead as “those rows where the value is from 20 less to 20 more than the value in the current row.” This I can do with range between like Listing 11-7 .

Listing 11-7 Range window based on qty value

...   6   , sum(ol.qty) over (   7        partition by ol.product_id   8        order by ol.qty   9        range between 20 preceding  10                  and 20 following  11     ) as p_qty ...

When I specify between 20 preceding and 20 following in lines 9–10, I ask that the window will contain those rows where the value is the same as the value in the current row plus/minus 20. But the value of what?

The value that range will use is the value of the column used in the order by in the analytic function. Therefore, in order to use range windows, the order by column must be a number or a date/timestamp.

The column I calculate the total of in the sum function does not have to be the same as the one I use for ordering and range, but in practice, it often is, giving me an output where you can see both third and fourth rows get a sum of 370, as it is the sum of all the rows in the partition with values between 80-20=60 and 80+20=100:

P_ID  PRODUCT_NAME     O_ID  QTY  P_QTY 4280  Hoppy Crude Oil  423   60   280 4280  Hoppy Crude Oil  427   60   280 4280  Hoppy Crude Oil  422   80   370 4280  Hoppy Crude Oil  429   80   370 4280  Hoppy Crude Oil  428   90   360 4280  Hoppy Crude Oil  421   110  200 6600  Hazy Pink Cloud  424   16   56 6600  Hazy Pink Cloud  426   16   56 6600  Hazy Pink Cloud  425   24   56

Even range windows do not have to include the current row value; I can also specify I want the window to contain those rows with a qty value between the current qty + 5 and the current qty + 25:

...   9        range between  5 following  10                  and 25 following ... P_ID  PRODUCT_NAME     O_ID  QTY  P_QTY 4280  Hoppy Crude Oil  423   60   160 4280  Hoppy Crude Oil  427   60   160 4280  Hoppy Crude Oil  422   80   90 4280  Hoppy Crude Oil  429   80   90 4280  Hoppy Crude Oil  428   90   110 4280  Hoppy Crude Oil  421   110 6600  Hazy Pink Cloud  424   16   24 6600  Hazy Pink Cloud  426   16   24 6600  Hazy Pink Cloud  425   24

Running totals can be performed with range windows as well:

...   9        range between unbounded preceding  10                  and current row ...

But notice how the running totals are identical for the rows that have same qty value:

P_ID  PRODUCT_NAME     O_ID  QTY  P_QTY 4280  Hoppy Crude Oil  423   60   120 4280  Hoppy Crude Oil  427   60   120 4280  Hoppy Crude Oil  422   80   280 4280  Hoppy Crude Oil  429   80   280 4280  Hoppy Crude Oil  428   90   370 4280  Hoppy Crude Oil  421   110  480 6600  Hazy Pink Cloud  424   16   32 6600  Hazy Pink Cloud  426   16   32 6600  Hazy Pink Cloud  425   24   56

Compare this output to the output of Listing 11-5, where the first two rows have values in p_qty of 60 and 120, respectively. Here they both have 120.

That is because of the nature of the range window, which gives a different meaning to the term current row. It no longer specifically means the current row, but rather the value of the current row. (In my opinion it would have been nice to use wording like current value for range windows, but that is unfortunately not supported syntax.)

So you see range windows using the current row can actually include following rows in case of value ties. This leads me to showing you a pitfall that is all too easy to fall into.

The danger of the default window

In Figure 11-3, you can see that it is possible to use order by without specifying a windowing clause. That leads to a default windowing clause, which might surprise you. In Listing 11-8, I show you the difference between the default, range between, and rows between.

Listing 11-8 Comparing running sum with default, range, and rows window

SQL> select   2     ol.product_id as p_id   3   , p.name        as product_name   4   , ol.order_id   as o_id   5   , ol.qty   6   , sum(ol.qty) over (   7        partition by ol.product_id   8        order by ol.qty   9        /* no window - rely on default */  10     ) as def_q  11   , sum(ol.qty) over (  12        partition by ol.product_id  13        order by ol.qty  14        range between unbounded preceding  15                  and current row  16     ) as range_q  17   , sum(ol.qty) over (  18        partition by ol.product_id  19        order by ol.qty  20        rows between unbounded preceding  21                 and current row  22     ) as rows_q  23  from orderlines ol  24  join products p  25     on p.id = ol.product_id  26  where ol.product_id in (4280, 6600)  27  order by ol.product_id, ol.qty;

I have three analytic function calls here:

  • Column def_q in lines 6–10 uses order by but leaves the windowing clause empty.

  • Column range_q in lines 11–16 uses the range between window for a running total.

  • Column rows_q in lines 17–22 uses the rows between window for a running total.

You see in the output that def_q and range_q are identical:

P_ID  PRODUCT_NAME     O_ID  QTY  DEF_Q  RANGE_Q  ROWS_Q 4280  Hoppy Crude Oil  423   60   120    120      60 4280  Hoppy Crude Oil  427   60   120    120      120 4280  Hoppy Crude Oil  422   80   280    280      200 4280  Hoppy Crude Oil  429   80   280    280      280 4280  Hoppy Crude Oil  428   90   370    370      370 4280  Hoppy Crude Oil  421   110  480    480      480 6600  Hazy Pink Cloud  424   16   32     32       16 6600  Hazy Pink Cloud  426   16   32     32       32 6600  Hazy Pink Cloud  425   24   56     56       56

Yes, if you have an order_by_clause, the default for the windowing_clause is range between unbounded preceding and current row.

I have seen many blog and forum posts showing a running total as something like sum(col1) over (order by col2) and leaving it at that. And when you test your code with this default window, often you get the result you expect, as the difference in output only occurs when there are duplicates in the values. So you might not spot the error until the code has gone into production.

Note

It is not just a problem when there are duplicate values. Even if your order by is unique, using default range between windows for running totals can potentially incur some overhead by evaluation of the analytic function, impacting performance. This is because rows between can be executed more optimally by the SQL engine, while range between requires the SQL engine to “look ahead” in the rows and see if possibly any following rows have the same value. For more detailed explanation of this, see a blog post I did a while back: www.kibeha.dk/2013/02/rows-versus-default-range-in-analytic.html.

In my opinion, the default ought to have been rows between, as in my experience, this is by far the most used window specification. It is very often I use rows between and only once in a rare while range between.

So my best practice rule of thumb is that whenever I have an order by clause, I always explicitly write the windowing clause, never relying on the default. Even for those rare cases where my window actually happens to be range between unbounded preceding and current row, I still write it explicitly. This tells the future me, or any developers maintaining my code in the future, that the range between is desired. If I see code where the windowing clause is absent, I always wonder if it is really meant to be range between or if it is simply a misunderstood copy-paste from a forum post.

This applies only to analytic functions that support the windowing clause, of course. And I also do not use it if my window is the entire partition, then I simply omit order by and windowing clause rather than write rows between unbounded preceding and unbounded following.

But even though Listing 11-5 adheres to this rule of thumb, there is another issue with it: the fact that it is possibly to get a different output from the same data in different executions of the code, because the rows with duplicate values might be in different order in the output depending on the access plan used by the optimizer.

This issue does not strictly influence the correctness of the solution, but users are liable to question the correctness when they observe different outputs (even if both outputs are correct). So I make it my best practice to make the combination of the columns used in partition by and order by unique in the analytic function (when using rows between, not applicable to range between). This makes the output deterministic, so the user can verify he gets the same result in each run.

Listing 11-9 represents both these best practices for doing running totals.

Listing 11-9 A best practice for a running sum

SQL> select   2     ol.product_id as p_id   3   , p.name        as product_name   4   , ol.order_id   as o_id   5   , ol.qty   6   , sum(ol.qty) over (   7        partition by ol.product_id   8        order by ol.qty, ol.order_id   9        rows between unbounded preceding  10                 and current row  11     ) as p_qty  12  from orderlines ol  13  join products p  14     on p.id = ol.product_id  15  where ol.product_id in (4280, 6600)  16  order by ol.product_id, ol.qty, ol.order_id;

In reality I am only interested in the qty ordering within each product_id partition (as in Listing 11-5), but the combination of those two columns is not unique, making the output nondeterministic. Therefore, I add order_id to both order by clauses (lines 8 and 16):

P_ID  PRODUCT_NAME     O_ID  QTY  P_QTY 4280  Hoppy Crude Oil  423   60   60 4280  Hoppy Crude Oil  427   60   120 4280  Hoppy Crude Oil  422   80   200 4280  Hoppy Crude Oil  429   80   280 4280  Hoppy Crude Oil  428   90   370 4280  Hoppy Crude Oil  421   110  480 6600  Hazy Pink Cloud  424   16   16 6600  Hazy Pink Cloud  426   16   32 6600  Hazy Pink Cloud  425   24   56

This ensures a deterministic output.

And in this case the statement can even execute using only a single sorting operation, since the columns in the analytic partition by followed by the columns in the analytic order by match the columns in the final order by in line 16. This enables the optimizer to skip the final ordering, as the analytic function evaluation has already ordered the data correctly.

Lessons learned

This chapter introduced the basic elements of the three subclauses of analytic functions. Although I’ve shown it specifically using the sum function, you can generalize to other analytic functions and use what you’ve learned about

  • Using partition by to split rows into parts where the analytic function is applied within each part separately.

  • Using the windowing clause in conjunction with order by to create moving windows of rows to calculate, for example, running totals.

  • Understanding that the default windowing clause is rarely a good match for your use case, so always using an explicit windowing clause is a good idea.

With a good understanding of these subclauses, you can make analytic functions solve many otherwise difficult tasks for you. The following chapters in this part of the book are dedicated to several such solutions.