Window Functions: A Must Know for Data Engineers and Data Scientists | by Iffat Malik Gore | Jun, 2023


Now that we are familiar with the basic anatomy of a Window Function, let’s explore the most commonly used one,

ROW_NUMBER()

ROW_NUMBER() assigns a sequential integer number to each row of a table or a partition in case we are using PARTITION BY clause. The common syntax is,

ROW_NUMBER() OVER ([PARTITION BY clause] [ORDER BY clause])

Here is the sample data from the table PRODUCTS, it holds the data of a range of products available with the vehicle retailer.

--query PRODUCTS table
SELECT
*
FROM
PRODUCTS
LIMIT 10;
Image by author

Let’s start basic,

--assigns a row number to each row in a table
SELECT
*,
ROW_NUMBER() OVER() AS ROW_NUM
FROM
PRODUCTS;
Image by author

Here, ROW_NUMBER() has assigned a sequential integer number starting from 1 to each row of the table PRODUCTS. Let’s level it up a little by adding row numbers for each PRODUCTCATEGORY, for that we will have to use PARTITION BY clause.

--row_number by productcategory
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY PRODUCTCATEGORY) AS ROW_NUM
FROM
PRODUCTS;
GIF by author

We have the 2 distinct PRODUCTCATEGORY available in the PRODUCTS table,

Image by author

ROW_NUMBER() generated a sequential integer number for each row and PARTITION BY clause, broke down the resultset into buckets based on PRODUCTCATEGORY. So basically ROW_NUMBER() along with OVER and PARTITION BY clause, generated a unique sequence of numbers for each PRODUCTCATEGORY.

Now, let’s utilize the ORDER BY clause as well. This was also one of the most asked interview questions at the entry/intermediate level. Say, we want to know the top 3 products from each PRODUCTCATEGORY with the highest quantity in stock available.

--top 3 products with the highest quantity in each product category
WITH PRODUCT_INVENTORY AS
(
SELECT
PRODUCTCATEGORY,
PRODUCTNAME,
QUANTITYINSTOCK,
ROW_NUMBER() OVER (PARTITION BY PRODUCTCATEGORY ORDER BY QUANTITYINSTOCK DESC) AS ROW_NUM
FROM
PRODUCTS
)

SELECT
PRODUCTCATEGORY,
PRODUCTNAME,
QUANTITYINSTOCK,
ROW_NUM AS TOP_3_PRODUCTS
FROM
PRODUCT_INVENTORY
WHERE ROW_NUM <= 3;

Image by author

Let’s first break down the entire query into two parts, as shown in the following image; first, we are creating a PRODUCT_INVENTORY. The table data will be divided into partitions/groups of each PRODUCTCATEGORY, ordered in descending order of the quantity available in stock for each product category. ROW_NUMBER() will then generate sequential integer numbers for each partition. The cool part here is, the row number for each row is sort of reset for each PRODUCTCATEGORY.

Image by author

The above query will return the following result,

Image by author

Now the second part of our query is pretty straightforward. It will utilize this result-set as an input and will pick up the top 3 products from each PRODUCTCATEGORY based on the condition ROW_NUM ≤ 3. The final result is as follows,

Image by author

That leads us to the final result as,

Image by author

RANK()

As the name suggests, RANK() assigns a rank to each row of the table or each row in a partition. The general syntax is,

RANK() OVER ([PARTITION BY clause] [ORDER BY clause])

Continuing to our PRODUCTS table example, let us assign a rank to products based on the quantity available in stock in descending order, partitioned by PRODUCTCATEGORY.

--generate rank for each product category
SELECT
PRODUCTCATEGORY,
PRODUCTNAME,
QUANTITYINSTOCK,
RANK() OVER (PARTITION BY PRODUCTCATEGORY ORDER BY QUANTITYINSTOCK DESC) AS "RANK"
FROM
PRODUCTS;
Image by author

I have limited the result-set for demo purpose. Now, Don’t get mixed up between ROW_NUMBER() and RANK(). The result-set for both of them may look alike; however, there is a difference. ROW_NUMBER() assigns a unique sequential integer number to each row of a table or in a partition; whilst, RANK() too generates a sequential integer number to each row of a table or in a partition but it assigns the same rank for rows with the same values.

Let’s understand this with an example, here is the sample data from the table CUSTOMERS,

-- sample data from table customers
SELECT
CUSTOMERID,
CUSTOMERNAME,
CREDITLIMIT
FROM
CUSTOMERS
LIMIT 10;
Image by author

In the following demo, I have generated ROW_NUMBER() and RANK() for the table CUSTOMERS, ordered in descending order of their CREDITLIMIT.

--row_number() and rank() comparison
SELECT
CUSTOMERID,
CUSTOMERNAME,
CREDITLIMIT,
ROW_NUMBER() OVER (ORDER BY CREDITLIMIT DESC) AS CREDIT_ROW_NUM,
RANK() OVER (ORDER BY CREDITLIMIT DESC) AS CREDIT_RANK
FROM
CUSTOMERS;
Image by author

I have limited the result-set for demo purpose. Highlighted in green is ROW_NUMBER() and highlighted in blue is RANK().

Now if you refer to 3 records highlighted in red, that’s where the result-set for both the function differs; ROW_NUMBER()generated a unique sequential integer number for all the rows. But on the other hand, RANK() assigned the same rank, 20, for CUSTOMERID 239 and 321 since they have the same credit limit, which is 105000.00. Not just that, for the next record, that is CUSTOMERID 458, it skipped rank 21 and assigned it rank 22.

DENSE_RANK()

Now if you are wondering, why do we need DENSE_RANK() if we are already equipped with RANK()? As we already witnessed in the previous example, RANK() generates the same rank for rows with the same values and then it skips the next consecutive rank (refer to the above image).

DENSE_RANK() is similar to the RANK() except for this one difference, it doesn’t skip any ranks when ranking the rows. The common syntax is,

DENSE_RANK() OVER ([PARTITION BY clause] [ORDER BY clause])

Going back to CUSTOMERS table, let’s compare result-set for both RANK() and DENSE_RANK(),

--dense_rank() and rank() comparison
SELECT
CUSTOMERID,
CUSTOMERNAME,
CREDITLIMIT,
RANK() OVER (ORDER BY CREDITLIMIT DESC) AS CREDIT_RANK,
DENSE_RANK() OVER (ORDER BY CREDITLIMIT DESC) AS CREDIT_DENSE_RANK
FROM
CUSTOMERS;
Image by author

Similar to RANK() (highlighted in blue), DENSE_RANK() (highlighted in green), generated the same rank for CUSTOMERID 239 and 321, what DENSE_RANK() did differently is instead of skipping the next consecutive number like RANK() did, it maintained the sequence and assigned 21 rank to CUSTOMERID 458.

NTH_VALUE()

This is a bit different than what we have discussed so far. NTH_VALUE() will return the value of the Nth row from the expression in a specified window. The common syntax is,

NTH_VALUE(expression, N) OVER ([PARTITION BY clause] [ORDER BY clause] [ROW/RANGE clause])

’N’ must be the positive integer value. If data doesn’t exist on Nth position, the function will return NULL. Here if you have noticed, we have an additional clause in the syntax which is the ROW/RANGE clause.

RAW/RANGE is part of the Frame Clause in Window Function which defines a subset within a window partition. ROW/RANGE defines the start and end points of this subset with respect to the current row, you take your current row’s location as a base point, and with that reference, you define your frame within the partition.

  • ROWS – This defines the beginning and end of the frame by specifying the number of rows that precede or follow the current row.
  • RANGE – Contrary to the ROWS, RANGE specifies the range of values compared to the value of the current row to define a frame within the partition.

Generic syntax is,

{ROWS | RANGE} BETWEEN <frame_starting> AND <frame_ending>

Image by author

Whenever you use ORDER BY clause, it sets the default frame as,

{ROWS/RANGE} BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

Without ORDER BY clause, the default frame as,

{ROWS/RANGE} BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

It may all seem too much at the moment but no need to memorize the syntax and its meaning, just practice! You may read the detailed Frame Clause with a bunch of examples here,

Now let’s say, we need to find out PRODUCTNAME from each PRODUCTCATEGORY with 2nd highest buy-price,

--productname for each productcategory with 2nd highest buy price
SELECT
PRODUCTNAME,
PRODUCTCATEGORY,
BUYPRICE,
NTH_VALUE(PRODUCTNAME,2) OVER(PARTITION BY PRODUCTCATEGORY ORDER BY BUYPRICE DESC) AS SECOND_HIGHEST_BUYPRICE
FROM
PRODUCTS;
Image by author

We have 2 more value functions similar to NTH_VALUE(); FIRST_VALUE() and LAST_VALUE(). As the name suggests, they return the highest(first) and lowest(last) values from an ordered list based on the user expression, respectively. The common syntax is,

FIRST_VALUE(expression) OVER ([PARTITION BY clause] [ORDER BY clause] [ROW/RANGE clause])

LAST_VALUE(expression) OVER ([PARTITION BY clause] [ORDER BY clause] [ROW/RANGE clause])

Similar to the above example, can you now find out the PRODUCTNAME with the highest and the lowest buy-price for each PRODUCTCATEGORY ?

NTILE()

Sometimes there are scenarios where you would want to arrange the rows within the partition to a certain number of groups or buckets. NTILE() is used for such purpose, it divides the ordered rows in the partition into a specific number of buckets. Each such bucket is assigned a group number starting from 1. It will try to create groups of equal size as possible. For each row, NTILE() function returns a group number representing the group to which that row belongs.

General syntax is,

NTILE(N) OVER ([PARTITION BY clause] [ORDER BY clause])

Where ’N’ a is a positive integer that defines the number of groups one wants to create.

Say, we want to segregate PRODUCTCATEGORY -‘Cars’ such that we have a list of cars with high-range, mid-range, and low-range buy-price.

--segregate the 'Cars' for high range, mid range and low range buy price
SELECT
PRODUCTNAME,
BUYPRICE,
NTILE(3) OVER (ORDER BY BUYPRICE DESC) AS BUYPRICE_BUCKETS
FROM
PRODUCTS
WHERE
PRODUCTCATEGORY = 'Cars';
Image by author

LAG() & LEAD()

We often encounter scenarios where some sort of comparative analysis is required. e.g. comparing sales of the selected year with the previous year or the following year. Such comparisons are really useful when working with time series data and calculating differences across time.

LAG() pulls the data from the row that precedes the current row. If there is no preceding row then it returns NULL. The common syntax is,

LAG(expression, offset) OVER ([PARTITION BY clause] [ORDER BY clause])

LEAD() gets the data from the row that follows the current row. If there is no following row then it returns NULL. The common syntax is,

LEAD(expression, offset) OVER ([PARTITION BY clause] [ORDER BY clause])

Where offset is optional and but when used its value must be either 0 or a positive integer,

  • When specified as 0, then LAG() and LEAD() evaluate the expression for the current row.
  • When omitted, 1 is considered a default value, which takes the immediately preceding or following row to the current row.
--yearly total sales for each product category
WITH YEARLY_SALES AS
(
SELECT
PROD.PRODUCTCATEGORY,
YEAR(ORDERDATE) AS SALES_YEAR,
SUM(ORDET.QUANTITYORDERED * ORDET.COSTPERUNIT) AS TOTAL_SALES
FROM
PRODUCTS PROD
INNER JOIN
ORDERDETAILS ORDET
ON PROD.PRODUCTID = ORDET.PRODUCTID
INNER JOIN
ORDERS ORD
ON ORDET.ORDERID = ORD.ORDERID
GROUP BY PRODUCTCATEGORY, SALES_YEAR
)

SELECT
PRODUCTCATEGORY,
SALES_YEAR,
LAG(TOTAL_SALES) OVER (PARTITION BY PRODUCTCATEGORY ORDER BY SALES_YEAR) AS LAG_PREVIOUS_YEAR,
TOTAL_SALES,
LEAD(TOTAL_SALES) OVER (PARTITION BY PRODUCTCATEGORY ORDER BY SALES_YEAR) AS LEAD_FOLLOWING_YEAR
FROM YEARLY_SALES;

Here, we have first used CTE (Common Table Expression) to get the data of total sales for each PRODUCTCATEGORY year-wise. We then use this data with LAG() and LEAD() to fetch data of total sales partitioned by PRODUCTCATEGORY and ordered by SALES_YEAR for the previous and following calendar year respectively.

Image by author



Source link

Leave a Comment