Saturday, February 25, 2012

Removing duplicate data

I have a table that I need to remove duplicates from and the table includes
an identity column. The table contains 12 fields but our business rules are
that only 3 of the fields can make the record a duplicate. We would like to
keep the record with the min identity column. I have done this in the past,
but it was about 5 years ago and I did not insert the dups into another
table, I did it strictly with a query. Can anybody please help me out with
some code to take care of this or any suggestions on a better way of doing
this? So something along the lines of grouping the data by the fields that
we are interested in and then deleting the records where the identity column
is greater than the min identity column.
Thanks.This is the duplicates along with the
minimum id per group
SELECT MIN(id),Col1,Col2,Col3
FROM sometable
GROUP BY Col1,Col2,Col3
HAVING COUNT(*)>1
Join this to the original table
to give the ids of the duplicates
excluding the minimum id.
SELECT a.id
FROM sometable a
INNER JOIN (
SELECT MIN(id),Col1,Col2,Col3
FROM sometable
GROUP BY Col1,Col2,Col3
HAVING COUNT(*)>1) b(id,Col1,Col2,Col3) ON b.id<>a.id
AND b.Col1=a.Col1
AND b.Col2=a.Col2
AND b.Col3=a.Col3
Put it all together to delete them
DELETE
FROM sometable
WHERE id IN (
SELECT a.id
FROM sometable a
INNER JOIN (
SELECT MIN(id),Col1,Col2,Col3
FROM sometable
GROUP BY Col1,Col2,Col3
HAVING COUNT(*)>1) b(id,Col1,Col2,Col3) ON b.id<>a.id
AND b.Col1=a.Col1
AND b.Col2=a.Col2
AND b.Col3=a.Col3
)|||Do you need to export the removed duplicates to another table?
Try something like this:
INSERT INTO RemDup (...) SELECT ... FROM MyTab AS t1 WHERE EXISTS (SELECT
NULL FROM MyTab AS t2 WHERE t1.Col1=t2.Col1 AND t1.Col2=t2.Col2 AND
t1.Col2=t2.Col2 AND t1.ID>t2.ID)
DELETE FROM MyTab AS t1 WHERE EXISTS (SELECT NULL FROM MyTab AS t2 WHERE
t1.Col1=t2.Col1 AND t1.Col2=t2.Col2 AND t1.Col2=t2.Col2 AND t1.ID>t2.ID)
HTH,
Axel Dahmen
"Andy" <Andy@.discussions.microsoft.com> wrote in message
news:32ACE9AA-67CE-454C-9711-956595CB7A8D@.microsoft.com...
> I have a table that I need to remove duplicates from and the table
includes
> an identity column. The table contains 12 fields but our business rules
are
> that only 3 of the fields can make the record a duplicate. We would like
to
> keep the record with the min identity column. I have done this in the
past,
> but it was about 5 years ago and I did not insert the dups into another
> table, I did it strictly with a query. Can anybody please help me out
with
> some code to take care of this or any suggestions on a better way of doing
> this? So something along the lines of grouping the data by the fields
that
> we are interested in and then deleting the records where the identity
column
> is greater than the min identity column.
> Thanks.|||Here's an example. This will delete all but 1 of the dupes. You will need to
change table / col name accordingly
DELETE FROM _Holding_table
WHERE EXISTS(SELECT NULL FROM _Holding_table s1
WHERE s1.PhoneNumber= _Holding_table.PhoneNumber
and s1.PK_Holding_TableID > _Holding_table.PK_Holding_TableID)
HTH. Ryan
"Andy" <Andy@.discussions.microsoft.com> wrote in message
news:32ACE9AA-67CE-454C-9711-956595CB7A8D@.microsoft.com...
>I have a table that I need to remove duplicates from and the table includes
> an identity column. The table contains 12 fields but our business rules
> are
> that only 3 of the fields can make the record a duplicate. We would like
> to
> keep the record with the min identity column. I have done this in the
> past,
> but it was about 5 years ago and I did not insert the dups into another
> table, I did it strictly with a query. Can anybody please help me out
> with
> some code to take care of this or any suggestions on a better way of doing
> this? So something along the lines of grouping the data by the fields
> that
> we are interested in and then deleting the records where the identity
> column
> is greater than the min identity column.
> Thanks.|||This is untested, perhaps you wrap this in a transaction first:
DELETE FROM SomeTable S
WHERE Idcolumn NOT IN
(
SELECT MIN(idcolumn)
FROM SOMETABLE S2
GROUP BY FirstColumn1,FirstColumn2,FirstColumn3
)
HTH, Jens Suessmeyer.

No comments:

Post a Comment