Now that we are familiar with the basic anatomy of a Window Function, let’s explore the most commonly used one,

## ROW_NUMBER()

*ROW_NUMBER()* assigns a sequential integer number to each row of a table or a partition in case we are using *PARTITION BY* clause. The common syntax is,

ROW_NUMBER() OVER ([PARTITION BY clause] [ORDER BY clause])

Here is the sample data from the table *PRODUCTS, *it holds the data of a range of products available with the vehicle retailer.

`--query PRODUCTS table`

SELECT

*

FROM

PRODUCTS

LIMIT 10;

Let’s start basic,

`--assigns a row number to each row in a table`

SELECT

*,

ROW_NUMBER() OVER() AS ROW_NUM

FROM

PRODUCTS;

Here, *ROW_NUMBER()* has assigned a sequential integer number starting from 1 to each row of the table *PRODUCTS*. Let’s level it up a little by adding row numbers for each *PRODUCTCATEGORY, *for that we will have to use *PARTITION BY* clause.

`--row_number by productcategory`

SELECT

*,

ROW_NUMBER() OVER(PARTITION BY PRODUCTCATEGORY) AS ROW_NUM

FROM

PRODUCTS;

We have the 2 distinct *PRODUCTCATEGORY* available in the *PRODUCTS* table,

*ROW_NUMBER()* generated a sequential integer number for each row and *PARTITION BY* clause, broke down the resultset into buckets** **based on *PRODUCTCATEGORY*. So basically *ROW_NUMBER()* along with *OVER* and *PARTITION B*Y clause, generated a unique sequence of numbers for each *PRODUCTCATEGORY.*

Now, let’s utilize the *ORDER BY* clause as well. This was also one of the most asked interview questions at the entry/intermediate level. Say, we want to know the top 3 products from each *PRODUCTCATEGORY* with the highest quantity in stock available.

`--top 3 products with the highest quantity in each product category`

WITH PRODUCT_INVENTORY AS

(

SELECT

PRODUCTCATEGORY,

PRODUCTNAME,

QUANTITYINSTOCK,

ROW_NUMBER() OVER (PARTITION BY PRODUCTCATEGORY ORDER BY QUANTITYINSTOCK DESC) AS ROW_NUM

FROM

PRODUCTS

)SELECT

PRODUCTCATEGORY,

PRODUCTNAME,

QUANTITYINSTOCK,

ROW_NUM AS TOP_3_PRODUCTS

FROM

PRODUCT_INVENTORY

WHERE ROW_NUM <= 3;

Let’s first break down the entire query into two parts, as shown in the following image; first, we are creating a *PRODUCT_INVENTORY. *The table data will be divided into partitions/groups of each *PRODUCTCATEGORY*, ordered in descending order of the quantity available in stock for each product category. *ROW_NUMBER()* will then generate sequential integer numbers for each partition. The cool part here is, the row number for each row is sort of reset for each *PRODUCTCATEGORY*.

The above query will return the following result,

Now the second part of our query is pretty straightforward. It will utilize this result-set as an input and will pick up the top 3 products from each *PRODUCTCATEGORY* based on the condition *ROW_NUM ≤ 3*. The final result is as follows,

That leads us to the final result as,

## RANK()

As the name suggests, *RANK()* assigns a rank to each row of the table or each row in a partition. The general syntax is,

RANK() OVER ([PARTITION BY clause] [ORDER BY clause])

Continuing to our *PRODUCTS* table example, let us assign a rank to products based on the quantity available in stock in descending order, partitioned by *PRODUCTCATEGORY*.

`--generate rank for each product category`

SELECT

PRODUCTCATEGORY,

PRODUCTNAME,

QUANTITYINSTOCK,

RANK() OVER (PARTITION BY PRODUCTCATEGORY ORDER BY QUANTITYINSTOCK DESC) AS "RANK"

FROM

PRODUCTS;

I have limited the result-set for demo purpose. Now, Don’t get mixed up between *ROW_NUMBER() *and *RANK()*. The result-set for both of them may look alike; however, there is a difference. *ROW_NUMBER()* assigns a unique sequential integer number to each row of a table or in a partition; whilst, *RANK()* too generates a sequential integer number to each row of a table or in a partition but it assigns the same rank for rows with the same values.

Let’s understand this with an example, here is the sample data from the table *CUSTOMERS*,

`-- sample data from table customers`

SELECT

CUSTOMERID,

CUSTOMERNAME,

CREDITLIMIT

FROM

CUSTOMERS

LIMIT 10;

In the following demo, I have generated *ROW_NUMBER()* and *RANK()* for the table *CUSTOMERS*, ordered in descending order of their *CREDITLIMIT*.

`--row_number() and rank() comparison`

SELECT

CUSTOMERID,

CUSTOMERNAME,

CREDITLIMIT,

ROW_NUMBER() OVER (ORDER BY CREDITLIMIT DESC) AS CREDIT_ROW_NUM,

RANK() OVER (ORDER BY CREDITLIMIT DESC) AS CREDIT_RANK

FROM

CUSTOMERS;

I have limited the result-set for demo purpose. Highlighted in green is *ROW_NUMBER()* and highlighted in blue is *RANK().*

Now if you refer to 3 records highlighted in red, that’s where the result-set for both the function differs; *ROW_NUMBER()*generated a unique sequential integer number for all the rows. But on the other hand, *RANK()* assigned the same rank, ** 20**, for

*CUSTOMERID*

**and**

*239***since they have the same credit limit, which is**

*321***. Not just that, for the next record, that is**

*105000.00**CUSTOMERID*

**, it skipped rank**

*458***and assigned it rank**

*21***.**

*22*## DENSE_RANK()

Now if you are wondering, why do we need *DENSE_RANK()* if we are already equipped with *RANK()*? As we already witnessed in the previous example, *RANK()* generates the same rank for rows with the same values and then it skips the next consecutive rank* *(refer to the above image).

*DENSE_RANK()* is similar to the *RANK()* except for this one difference, it doesn’t skip any ranks when ranking the rows. The common syntax is,

DENSE_RANK() OVER ([PARTITION BY clause] [ORDER BY clause])

Going back to *CUSTOMERS* table, let’s compare result-set for both *RANK() *and *DENSE_RANK()*,

`--dense_rank() and rank() comparison`

SELECT

CUSTOMERID,

CUSTOMERNAME,

CREDITLIMIT,

RANK() OVER (ORDER BY CREDITLIMIT DESC) AS CREDIT_RANK,

DENSE_RANK() OVER (ORDER BY CREDITLIMIT DESC) AS CREDIT_DENSE_RANK

FROM

CUSTOMERS;

Similar to *RANK() *(highlighted in blue)*, DENSE_RANK() (*highlighted in green), generated the same rank for *CUSTOMERID* ** 239** and

*321**,*what

*DENSE_RANK()*did differently is instead of skipping the next consecutive number like

*RANK()*did, it maintained the sequence and assigned

**rank to**

*21**CUSTOMERID*

**.**

*458*## NTH_VALUE()

This is a bit different than what we have discussed so far. *NTH_VALUE()* will return the value of the *Nth* row from the expression in a specified window. The common syntax is,

NTH_VALUE(expression, N) OVER ([PARTITION BY clause] [ORDER BY clause] [ROW/RANGE clause])

*’N’* must be the positive integer value. If data doesn’t exist on *Nth* position, the function will return *NULL*. Here if you have noticed, we have an additional clause in the syntax which is the *ROW/RANGE *clause*.*

RAW/RANGE is part of the *Frame Clause* in Window Function which defines a subset within a window partition. *ROW/RANGE* defines the start and end points of this subset with respect to the current row, you take your current row’s location as a base point, and with that reference, you define your frame within the partition.

*ROWS*– This defines the beginning and end of the frame by specifying the number of rows that precede or follow the current row.*RANGE*– Contrary to the*ROWS*,*RANGE*specifies the range of values compared to the value of the current row to define a frame within the partition.

Generic syntax is,

{ROWS | RANGE} BETWEEN <frame_starting> AND <frame_ending>

Whenever you use *ORDER BY* clause, it sets the default frame as,

{ROWS/RANGE} BETWEENUNBOUNDED PRECEDINGANDCURRENT ROW

Without ORDER BY clause, the default frame as,

{ROWS/RANGE}BETWEENUNBOUNDED PRECEDINGANDUNBOUNDED FOLLOWING

It may all seem too much at the moment but no need to memorize the syntax and its meaning, just practice! You may read the detailed *Frame Clause* with a bunch of examples here,

Now let’s say, we need to find out *PRODUCTNAME *from each *PRODUCTCATEGORY *with 2nd highest buy-price,

`--productname for each productcategory with 2nd highest buy price`

SELECT

PRODUCTNAME,

PRODUCTCATEGORY,

BUYPRICE,

NTH_VALUE(PRODUCTNAME,2) OVER(PARTITION BY PRODUCTCATEGORY ORDER BY BUYPRICE DESC) AS SECOND_HIGHEST_BUYPRICE

FROM

PRODUCTS;

We have 2 more value functions similar to *NTH_VALUE()*; *FIRST_VALUE()* and *LAST_VALUE()*. As the name suggests, they return the highest(first) and lowest(last) values from an ordered list based on the user expression, respectively. The common syntax is,

FIRST_VALUE(expression) OVER ([PARTITION BY clause] [ORDER BY clause] [ROW/RANGE clause])

LAST_VALUE(expression) OVER ([PARTITION BY clause] [ORDER BY clause] [ROW/RANGE clause])

Similar to the above example, can you now find out the *PRODUCTNAME* with the highest and the lowest buy-price for each *PRODUCTCATEGORY *?

## NTILE()

Sometimes there are scenarios where you would want to arrange the rows within the partition to a certain number of groups or buckets. NTILE() is used for such purpose, it divides the ordered rows in the partition into a specific number of buckets. Each such bucket is assigned a group number starting from 1. It will try to create groups of equal size as possible. For each row, *NTILE()* function returns a group number representing the group to which that row belongs.

General syntax is,

NTILE(N) OVER ([PARTITION BY clause] [ORDER BY clause])

Where ’N’ a is a positive integer that defines the number of groups one wants to create.

Say, we want to segregate *PRODUCTCATEGORY -‘Cars’* such that we have a list of cars with high-range, mid-range, and low-range buy-price.

`--segregate the 'Cars' for high range, mid range and low range buy price`

SELECT

PRODUCTNAME,

BUYPRICE,

NTILE(3) OVER (ORDER BY BUYPRICE DESC) AS BUYPRICE_BUCKETS

FROM

PRODUCTS

WHERE

PRODUCTCATEGORY = 'Cars';

## LAG() & LEAD()

We often encounter scenarios where some sort of comparative analysis is required.* *e.g. comparing sales of the selected year with the previous year or the following year. Such comparisons are really useful when working with time series data and calculating differences across time.

*LAG() *pulls the data from the row that precedes the current row. If there is no preceding row then it returns *NULL*. The common syntax is,

LAG(expression, offset) OVER ([PARTITION BY clause] [ORDER BY clause])

*LEAD()* gets the data from the row that follows the current row. If there is no following row then it returns *NULL. *The common syntax is,

LEAD(expression, offset) OVER ([PARTITION BY clause] [ORDER BY clause])

Where *offset** *is optional and but when used its value must be either 0 or a positive integer,

- When specified as 0, then
*LAG()*and*LEAD()*evaluate the expression for the current row. - When omitted, 1 is considered a default value, which takes the immediately preceding or following row to the current row.

`--yearly total sales for each product category`

WITH YEARLY_SALES AS

(

SELECT

PROD.PRODUCTCATEGORY,

YEAR(ORDERDATE) AS SALES_YEAR,

SUM(ORDET.QUANTITYORDERED * ORDET.COSTPERUNIT) AS TOTAL_SALES

FROM

PRODUCTS PROD

INNER JOIN

ORDERDETAILS ORDET

ON PROD.PRODUCTID = ORDET.PRODUCTID

INNER JOIN

ORDERS ORD

ON ORDET.ORDERID = ORD.ORDERID

GROUP BY PRODUCTCATEGORY, SALES_YEAR

)SELECT

PRODUCTCATEGORY,

SALES_YEAR,

LAG(TOTAL_SALES) OVER (PARTITION BY PRODUCTCATEGORY ORDER BY SALES_YEAR) AS LAG_PREVIOUS_YEAR,

TOTAL_SALES,

LEAD(TOTAL_SALES) OVER (PARTITION BY PRODUCTCATEGORY ORDER BY SALES_YEAR) AS LEAD_FOLLOWING_YEAR

FROM YEARLY_SALES;

Here, we have first used *CTE (Common Table Expression)* to get the data of total sales for each *PRODUCTCATEGORY *year-wise. We then use this data with *LAG()* and *LEAD()* to fetch data of total sales partitioned by *PRODUCTCATEGORY* and ordered by *SALES_YEAR* for the previous and following calendar year respectively.