How to find duplicate records in a table is a popular SQL interview questions which have been asked as many times as difference
between truncate and delete in SQL or finding second highest salary of
employee. Both of these SQL queries are must know for any one who is appearing
on any programming an interview where some questions on database and SQL are expected. In order
to find duplicate records in the database
table you need to confirm the definition of duplicates, for example in below contact table
which is suppose to store name and phone number of the contact,
a record is considered to be duplicate if both name and phone number is the same
but unique if either of them varies.
Problems of duplicates in databases arise when you don't have a primary key or unique key on the database and that's why it's recommended to have a key column in table. Anyway its easy to find duplicate records in the table by using group by clause of ANSI SQL.
Group by clause is used to group data based upon any column or a number of columns. Here in order to locate duplicate records, we need to use group by clause on both name and phone as shown in the second SQL SELECT query example.
You can see in the first query that it listed Ruby as a duplicate record even though both Ruby has a different phone number because we only performed group by on name. Once you have grouped data you can filter out duplicates by using having clause.
Problems of duplicates in databases arise when you don't have a primary key or unique key on the database and that's why it's recommended to have a key column in table. Anyway its easy to find duplicate records in the table by using group by clause of ANSI SQL.
Group by clause is used to group data based upon any column or a number of columns. Here in order to locate duplicate records, we need to use group by clause on both name and phone as shown in the second SQL SELECT query example.
You can see in the first query that it listed Ruby as a duplicate record even though both Ruby has a different phone number because we only performed group by on name. Once you have grouped data you can filter out duplicates by using having clause.
Having clause is the counterpart of where
clause for aggregation queries. Just remember to provide a temporary name to count() data in
order to use them in having clause.
SQL Query to find duplicate records in a table in MySQL
In this section we will see SQL query which can be used to locate
duplicate records in table. As explained in previous section, definition of
duplicate depends upon business rules which must be used in group by clause. In
following query we have used SELECT
query to select all records from Contacts table. Here
James, Johnny, Harry and Ron are duplicated four times.
mysql> select * from Contacts;
+-------+----------+
| name |
phone |
+-------+----------+
| James | 80983243 |
| Johnny | 67543212 |
| Harry | 12341234 |
| Ron
| 44446666 |
| James | 80983243 |
| Johnny | 67543212 |
| Harry | 12341234 |
| Ron
| 44446666 |
| James | 80983243 |
| Johnny | 67543212 |
| Harry | 12341234 |
| Ron |
44446666 |
| James | 80983243 |
| Johnny | 67543212 |
| Harry | 12341234 |
| Ron |
44446666 |
| Ruby | 8965342 |
| Ruby | 6888342 |
+-------+----------+
18 rows in set (0.00 sec)
Following SELECT query will only find duplicates records based on the name
which might not be correct if two contacts of same but different numbers are
stored, as in the following result set Ruby is shown as duplicate which is
incorrect.
mysql> select name, count(name) from contacts group by name;
+-------+-------------+
| name | count(name) |
+-------+-------------+
| Harry | 4 |
| James | 4 |
| Johnny | 4 |
| Ron | 4 |
| Ruby | 2 |
+-------+-------------+
5 rows in set (0.00 sec)
This is the correct way of finding duplicate contacts at it look both
name and phone number and only print duplicate if both name and phone is
same.
mysql> select name, count(name) from contacts group by name, phone;
+-------+-------------+
| name | count(name) |
+-------+-------------+
| Harry | 4 |
| James | 4 |
| Johnny | 4 |
| Ron | 4 |
| Ruby | 1 |
| Ruby | 1 |
+-------+-------------+
having
clause in SQL query will filter duplicate records from unique records. As in the following query, it prints all duplicate records and how many
times they are duplicated in the table.
mysql> select name, count(name) as times
from contacts
group by name, phone having times>1;
+-------+-------+
| name | times |
+-------+-------+
| Harry | 4 |
| James | 4 |
| Johnny | 4 |
| Ron | 4 |
+-------+-------+
4 rows in set (0.00 sec)
That's all on how to find duplicate records in a table. These SQL queries
will work on all database like MySQL,
Oracle,
SQL Server and Sybase as it only uses ANSI SQL and doesn't use any database specific
feature. Another interesting SQL query interview question is "How to
delete duplicate records from the table" which we will see in another
post.
Other SQL and database articles you may find useful
Hi Javin.. Can you please write something about DB performance tuning as well as its being repetitvely asked to me and am unable to convince a great deal as I havent really worked on such scenarios
ReplyDeleteThere are plenty of other, more useful duplicate searches one can do for this dataset:
ReplyDeleteThe example: Duplicate by Name
Given example:
SELECT NAME
FROM CONTACTS
GROUP BY NAME
HAVING COUNT(1) > 1
Better option
SELECT *
FROM CONTACTS
WHERE NAME IN
(SELECT NAME
FROM CONTACTS
GROUP BY NAME
HAVING COUNT(1) > 1)
Any Column Duplicate:
SELECT C.NAME, NAME_DUPS.NAME_DUP,
C.PHONE, PHONE_DUPS.PHONE_DUP
FROM CONTACTS C
INNER JOIN
(SELECT NAME, COUNT(1) AS PHONE_DUP
FROM CONTACTS
GROUP BY NAME) PHONE_DUPS
ON C.NAME = PHONE_DUPS.NAME
INNER JOIN
(SELECT PHONE, COUNT(1) AS NAME_DUP
FROM CONTACTS
GROUP BY PHONE) NAME_DUPS
ON C.PHONE = NAME_DUPS.PHONE
WHERE NAME_DUPS.NAME_DUP > 1
OR PHONE_DUPS.PHONE_DUP > 1
Any Row Duplicate:
SELECT C.NAME,
C.PHONE,
COUNT(1) AS ROW_DUP
FROM CONTACTS
GROUP BY C.NAME,
C.PHONE
HAVING COUNT(1) > 1
If you just want unique data, there's two ways to write it:
1:
SELECT DISTINCT C.NAME,
C.PHONE
FROM CONTACTS
2:
SELECT C.NAME,
C.PHONE
FROM CONTACTS
GROUP BY C.NAME,
C.PHONE
Option two is easier on the server, and better for large queries, although people usually opt for the first option due to the simplicity of writing the code.
I really do feel like the post should be more complete (and documented) about the types and details of hunting duplicates.
I was on the fence about if I wanted to start my own blog about SQL, but this post has confirmed my ambitions to make more informed users.
/end SQL code rant
/drops mic
By Sam Hetchler
Nice post.
ReplyDeleteI have a question:
How to know if ALL values in a column are repeated
Hi,
ReplyDeleteHow can I delete the duplicate records and left with single record for each entry?
well But i have table with columns empid,emp_sal. it contains duplicates. i need only duplicate records whose salory > 5000.
ReplyDeletecan any one help me
To Rahul,
ReplyDeleteGreat question! If you'd rather never have duplicate data, you can go for distinct columns or a PRIMARY KEY.
But coming to your question, I do not know the exact query but you can store the one of the duplicates from one group into a view, delete all and go back to push the query from the view. There is something known as PL/SQL which even I need to learn, which I believe using loops can help you out in your scenario.
Anonymous, delete from employee where emp_sal>5000;
ReplyDeleteDuh!
To know whether all values in a column are repeated you can go for something like..
ReplyDeleteselect exists( select count(col) c from table where c = (select count(col) from table group by col));
@Anonymous, can you explain bit more about your solution? how does it will find if all values in a column are repeated?
ReplyDeletewhat about java programe that do the same
ReplyDelete