Monday, May 22, 2023

3 Ways to Remove Duplicates from a table in SQL - Query Example

There are a couple of ways to remove duplicate rows from a table in SQL e.g. you can use temp tables or a window function like row_number() to generate artificial ranking and remove the duplicates. By using a temp table, you can first copy all unique records into a temp table and then delete all data from the original table and then copy unique records again to the original table. This way, all duplicate rows will be removed, but with large tables, this solution will require additional space of the same magnitude as the original table. The second approach doesn't require extra space as it removes duplicate rows directly from the table. It uses a ranking function like row_number() to assign a row number to each row.

By using partition by clause you can reset the row numbers on a particular column. In this approach, all unique rows will have row number = 1 and duplicate rows will have row_number > 1, which gives you an easy option to remove those duplicate rows. You can do that by using a common table expression (see T-SQL Fundamentals) or without it on Microsoft SQL Server.

No doubt that SQL queries are an integral part of any programming job interview which requires database and SQL knowledge. The queries are also very interesting to check the candidate's logical reasoning ability.

Earlier, I have shared a list of frequently asked SQL queries from interviews and this article is an extension of that. I have shared a lot of good SQL-based problems on that article and users have also shared some excellent problems in the comments, which you should look at.

Btw, this is the follow-up question of another popular SQL interview question, how do you find duplicate records in a table, which we have discussed earlier. This is an interesting question because many candidates confuse themselves easily.

Some candidate says that they will find duplicate by using group by and printing name which has counted more than 1, but when it comes to deleting this approach doesn't work, because if you delete using this logic both duplicate and unique row will get deleted.

This little bit of extra detail like row_number makes this problem challenging for many programmers who don't use SQL on a daily basis. Now, let's see our solution to delete duplicate rows from a table in SQL Server.

By the way, if you are new to Microsoft SQL Server and T-SQL then I also suggest you join a comprehensive course to learn SQL Server fundamentals and how to work with T-SQL. If you need a recommendation then I suggest you go through the Microsoft SQL for Beginners online course by Brewster Knowlton on Udemy. It's a great course to start with T-SQL and SQL queries in SQL Server.





3 Ways to Remove duplicate values from a table using SQL Query

Before exploring a solution, let's first create the table and populate it with test data to understand both problems and solutions better. I am using a temp table to avoid leaving test data into the database once we are done. Since temp tables are cleaned up once you close the connection to the database, they are best suited for testing.

In our table, I have just one column for simplicity, if you have multiple columns then the definition of duplicate depends on whether all columns should be equal or some key columns e.g. name and city can be the same for two unique persons. In such cases, you need to extend the solution by using those columns on key places e.g. on a distinct clause in the first solution and on the partition by in the second solution.

Anyway, here is our temp table with test data, it is carefully constructed to have duplicates, you can see that C++ is repeated thrice while Java is repeated twice in the table.

-- create a temp table for testing
create table #programming (name varchar(10));

-- insert data with duplicate, C++ is repeated 3 times, while Java 2 times
insert into #programming values ('Java');
insert into #programming values ('C++');
insert into #programming values ('JavaScript');
insert into #programming values ('Python');
insert into #programming values ('C++');
insert into #programming values ('Java');
insert into #programming values ('C++');

-- cleanup
drop table #programming




1. How to remove duplicate in SQL using temp table  - Example

Yes, this is the most simple but logical way to remove duplicate elements from a table and it will work across databases like MySQL, Oracle, or SQL Server. The idea is to copy unique rows into a temp table. You can find unique rows by using a distinct clause. 

Once unique rows are copied, delete everything from the original table and then copy unique rows again. This way, all the duplicate rows have been removed as shown below.

-- removing duplicate using copy, delete and copy
select distinct name into #unique from #programming
delete from #programming;
insert into #programming select * from #unique

-- check after
select * from #programming

name
Java
C++
JavaScript
Python

You can see the duplicate occurrences of Java and C++ have been removed from the #programming temp table. 




2. Delete Duplicates using row_number() and derived table - Example

The row_number() is one of several ranking functions provided by SQL Server, It also exists in the Oracle database. You can use this function to provide ranking to rows. You can further use partition to tell SQL server that what would be the window. 

This way row number will restart as soon as a different name comes up but for the same name, all rows will get sequential numbers e.g. 1, 2, 3, etc. Now, it's easy to spot the duplicates in the derived table as shown in the following example:

select * from (select *, row_number() 
OVER ( partition by name order by name) as rn
 from #programming) dups 
name rn
C++ 1
C++ 2
C++ 3
Java 1
Java 2
JavaScript 1
Python 1

Now, you can remove all the duplicates which are nothing but rows with rn > 1, as done by following SQL query:

delete dups 
from (select *, row_number() 
over ( partition by name order by name) as rn
from #programming) 
dups 
WHERE rn > 1

(3 row(s) affected)

now, if you check the #programming table again there won't be any duplicates.

select * from #programming
name
Java
C++
JavaScript
Python

This is by far the simplest solution and also quite easy to understand but it doesn't come to your mind without practicing. I suggest solving some SQL puzzles from Joe Celko's classic book, SQL Puzzles, and Answers, Second Edition to develop your SQL sense. It's a great practice book to learn and master SQL logic.




3. How to remove duplicates using CTE (Common Table Expression) - Example

The CTE stands for common table expression, which is similar to a derived table and used to the temporary result set that is defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement. Similar to a derived table, CTE is also not stored as an object and lasts only for the duration of the query. You can rewrite the previous solution using CTE as shown below:

;with cte
as (select row_number() 
over (partition by name order by(select 0)) rn
from #programming)
delete from cte where rn > 1

The logic is exactly similar to the previous example and I am using select 0 because it's arbitrary which rows to preserve in the event of a tie as both contents the same data. If you are new to CTE then I suggest reading T-SQL Fundamentals, one of the best books to learn SQL Server fundamentals.

Here is a nice summary of all three ways to remove duplicates from a  table using SQL:

How to remove duplicate rows of  table in SQL



That's all about how to remove duplicate rows from a table in SQL. As I said, this is one of the frequently asked SQL queries, so be prepared for that when you go for your programming job interview. 

I have tested the query in SQL Server 2008 and they work fine and you might need to tweak them a little bit depending upon the database you are going to use like MySQL, Oracle, or PostgreSQL. Feel free to post, if you face any issue while removing duplicates in Oracle, MySQL, or any other database.


Other Frequently asked SQL queries from Interviews
  • How to find the 2nd highest salary of an employee in SQL? (answer)
  • How to join three tables in one SQL query? (solution)
  • How do find all table names in a database? (query)
  • What is the difference between View and Materialized View in Database? (answer)
  • How do you create a backup of the table or copy of the table using SQL? (answer)
  • How do you find all customers who have never ordered? (solution)
  • Can you write a pagination query for Oracle using row_number? (query)
  • How do you find Nth highest salary of an employee using the correlated query? (solution)
  • 10 Frequently asked SQL Query interview questions (solution)
  • Write a SQL query to find all table names on a database in MySQL (solution)
  • Top 5 Websites to learn SQL online for FREE? (resource)
  • 5 Courses to learn Oracle and Microsoft SQL Server database (courses)
  • 4 ways to find the Nth highest salary in SQL (solution)
  • Difference between Self and Equi Join in SQL? (answer)
  • 5 Free Courses to learn Oracle and SQL Server? (courses)
  • Top 5 Courses to learn MySQL Database for Beginners (Courses)
  • Difference between clustered and non-clustered indexes in SQL? (answer)
  • Write a SQL query to copy or backup a table in MySQL (solution)
  • Difference between Primary and Candidate key in table? (answer)
  • 5 Free Courses to learn T-SQL and SQL Server for Beginners (Courses)
  • Difference between Unique and Primary key in table? (answer)
  • What is the difference between UNION and UNION ALL in SQL? (answer)

Thanks for reading this article so far. If you like this SQL tutorial to remove duplicates then please share it with your friends and colleagues If you have any questions or feedback then please drop a note. 

P.S. - If you are new to the SQL world and looking for free SQL and database courses to learn SQL fundamentals then you can also check out my list of free Udemy courses to learn SQL. These are really great SQL courses that are available for free on Udemy and Coursera and you can use them to build your SQL skills. 


3 comments: