How To Find Duplicate Values In A Table Using SQL – 2 Best Ways


Before directly jumping into the queries, you need to define criteria to find duplicate records in a table. There can be scenarios when certain values in a single column are duplicated or the entire record i.e. values in all the columns in a specific row are duplicated in the table.

You will explore both the possibilities and the ways to deal with such duplicated records in this quick read.

The easiest way to identify the duplicated records is to simply count how many times each record appears in the table. And the record which appears more than once is duplicated.

The function GROUP BY is widely used in SQL for data aggregation. It means you can group the records based on values in one or multiple columns and get aggregated values such as count, or sum of other columns.

Keeping this in mind, let’s explore how you can find out duplicated values in a single column.

Find duplicate values in one column

There can be situations when duplicate values are present only in one column. The reason for such duplicate records can be as simple as human error while making the data entry or updating the database.

Let’s take an example from the orders table and find out which OrderID are duplicated. As you need to count — how many times each OrderID appeared in the table — you should group the records by OrderID as shown below.

SELECT OrderID
, COUNT(*) as occurrences
FROM orders
GROUP BY OrderID
Records that appeared more than once in the table | Image by Author

The highlighted records (OrderIDs) occurred in the dataset more than once i.e. those are duplicated.

However, you don’t need to create separate columns as seen in the above picture. You can directly get the duplicated OrderIDs using the HAVING clause after GROUP BY, as shown below.

SELECT OrderID
FROM orders
GROUP BY OrderID
HAVING COUNT(*) > 1;
Duplicated records | Image by Author

So, you get only the duplicated OrderIDs which are the same as the highlighted ones in the above table.

Similarly, there can be situations when the values in the multiple columns for a row are duplicated in the table.

Find Duplicate Values In Multiple Columns

Although the entire row is duplicated across the table, the logic remains the same, only the columns you mention in the GROUP BY clause change.

Instead of grouping the records by a single column, here you need to group the records by multiple columns.

Let me show you how.

Suppose you would like to see the records where a combination of OrderID, Quantity, and Product_Category appeared multiple times in the table.

SELECT OrderID
, Quantity
, Product_Category
, COUNT(*) as occurrences
FROM orders
GROUP BY OrderID
, Quantity
, Product_Category
Find duplicated records in multiple columns | Image by Author

In this way, you can see that the highlighted combination of the values in the columns OrderID, Quantity, and Product_Category occurred in the table more than once.

Again you need to simply add HAVING COUNT(*) > 1 at the end of the query to retrieve these duplicated records.

As the process to find out duplicates depends on the count of the number of times a record appears in the table, you can use the window function ROW_NUMBER as well.

The window function ROW_NUMBER() assigns a unique sequential number to each record in the window defined using the PARTITION_BY clause.

So, you can actually define the window using the same columns, where you expect to have duplicated values. So, if a record appears multiple times, a row number of more than 1 will be assigned to the duplicated records.

Let’s continue with the same example.

To get the records where a combination of OrderID, Quantity, and Product_Category appeared multiple times in the table, you need to define a window using these columns in the PARTITION_BY clause as shown below.

SELECT OrderID
, Quantity
, Product_Category
, ROW_NUMBER() OVER (PARTITION BY OrderID, Quantity, Product_Category ORDER BY OrderID) AS row_num
FROM orders
Find duplicates in the table using ROW_NUMBER() in SQL | Image by Author

This is how you’ll get all the records and the corresponding row numbers partitioned by a given set of columns. So, all the highlighted records where row number is 2 are duplicated records.

You can pass the above entire query as a sub-query to the outer SELECT statement below to get only the duplicated records.

SELECT OrderID
, Quantity
, Product_Category
FROM (
SELECT OrderID
, Quantity
, Product_Category
, ROW_NUMBER() OVER (PARTITION BY OrderID, Quantity, Product_Category ORDER BY OrderID) AS row_num
FROM orders
) AS subquery
WHERE row_num > 1;
Get duplicated records in SQL | Image by Author

Alternatively, if you don’t want to use the sub-query, you can create a CTE and get the data from that CTE using another query as shown below.

WITH temp_orders AS
(
SELECT OrderID
, Quantity
, Product_Category
, ROW_NUMBER() OVER (PARTITION BY OrderID, Quantity, Product_Category ORDER BY OrderID) AS row_num
FROM orders
)

SELECT OrderID
, Quantity
, Product_Category
FROM temp_orders
WHERE row_num > 1;

This query will also return exactly the same output. So the choice is yours.

To learn more about the ROW_NUMBER(), CTE, and GROUP BY, don’t forget to check out interesting resources at the end of this read!



Source link

Leave a Comment