Tuesday, August 4, 2009

Deleting Duplicate Records - Single Query

I had a requirement where I wanted to delete all the duplicate email ids from my table.

My table structure is :

* EMAIL_ID uniqueidentifier PRIMARY KEY
* EMAIL varchar (128) Stores the email address


I write the following query to find out the duplicate records :

SELECT EMAIL, COUNT(*) DUPLICATES
FROM EMAILS
GROUP BY EMAIL
HAVING COUNT(*) > 1
ORDER BY DUPLICATES


Here's the delete Query:

Here's the delete Query:

DELETE FROM EMAILS
WHERE EMAIL_ID IN
(
SELECT EMAIL_ID FROM
(
SELECT EMAIL, EMAIL_ID
, RANK() OVER (PARTITION BY EMAIL ORDER BY EMAIL_ID) AS EMAIL_RANK
FROM EMAILS
WHERE EMAIL IN
(
SELECT EMAIL
FROM EMAILS
GROUP BY EMAIL
HAVING COUNT(*) > 1
)
) A
WHERE A.EMAIL_RANK != 1
)

There were lot of other approaches which I found, most of them suggested using cursors or using Set Row_Count 1 and then running the delete query iteratively. But I wanted to do it in one shot.

It took me less than a second to delete almost 1000 duplicate rows from around 10,000 rows in my table.

I am sure there would be better approaches, surely to avoid multiple IN clauses. Just in case you have one put in the comments.

I got to figure out one for Oracle as well.

No comments: