Removing a server: Removing duplicates

I have inherited a web app with the following table structure, and need to
produce a table without any duplicates. Email seems like the best unique
identifier - so only one of each e-mail address should be in the table.

Following http://www.sqlteam.com/item.asp?ItemID=3331 I have been able to
get a duplicate count working:

select Email, count(*) as UserCount
from dbo.Members
group by Email
having count(*) > 1
order by UserCount desc

But the methods for create a new table without duplicates fail. My code for
the 2nd method is:

sp_rename 'Members', 'temp_Members'

select distinct *
into Members
from temp_Members

Table...

CREATE TABLE [dbo].[Members] (
[MemberID] [int] IDENTITY (1, 1) NOT NULL ,
[Username] [varchar] (10) COLLATE Latin1_General_CI_AS NOT NULL ,
[Password] [varchar] (10) COLLATE Latin1_General_CI_AS NOT NULL ,
[Email] [varchar] (50) COLLATE Latin1_General_CI_AS NOT NULL ,
[Title] [varchar] (10) COLLATE Latin1_General_CI_AS NOT NULL ,
[FirstName] [varchar] (50) COLLATE Latin1_General_CI_AS NOT NULL ,
[Surname] [varchar] (50) COLLATE Latin1_General_CI_AS NOT NULL ,
[Address1] [varchar] (35) COLLATE Latin1_General_CI_AS NOT NULL ,
[Address2] [varchar] (35) COLLATE Latin1_General_CI_AS NOT NULL ,
[City] [varchar] (25) COLLATE Latin1_General_CI_AS NOT NULL ,
[Country] [varchar] (25) COLLATE Latin1_General_CI_AS NOT NULL ,
[Profession] [varchar] (50) COLLATE Latin1_General_CI_AS NOT NULL ,
[Publication] [varchar] (40) COLLATE Latin1_General_CI_AS NOT NULL ,
[DateAdded] [smalldatetime] NOT NULL ,
[SendMail] [smallint] NOT NULL
) ON [PRIMARY]
GO

Thanks B.Can I assume that memberid is unique? If so, try this:

INSERT INTO NewTable
(username, password, email, title, firstname, surname, address1,
address2, city, country, profession, publication, dateadded, sendmail)
SELECT username, password, email, title, firstname, surname, address1,
address2, city, country, profession, publication, dateadded, sendmail
FROM Members AS M1
JOIN
(SELECT MIN(memberid) AS memberid
FROM Members
GROUP BY email) AS M2
ON M1.memberid = M2.memberid

--
David Portas
----
Please reply only to the newsgroup
--|||Microsoft has a great article on this at
http://support.microsoft.com/defaul...4&Product=sql2k

I use a code block for doing this that I load as required, I only run
it when I have to. Once you've cleaned up your data, put a unique
constraint to prevent dupes from getting back in.

This is the code that I use for cleaning up dupes, you'll have to
modify the table and field names obviously. This config assumes you
are allowing Select/Into's.

/*
--
--WARNING! EXECUTE THIS SCRIPT WITH EXTREME CAUTION!,
--AND THEN ONLY ONE STEP AT A TIME!
--
--In the TotalTransit (DDS) database, this needs to run against
--ddsTrip and ttVoucher_Trip. Since the two tables have different
--layouts, the table that holds dupe rows must be dropped and
--recreated every time this process is run.
--
--There are now two seperate scripts, one for each table.
--
--Run the process one step at a time by highlighting each step.
--
--Wayne West, 10/13/03
-- WW 10/17/03 -- split into two procs, added more comments
--
------------------------
--
--this is here to help find if there are any dupes
select tripid, count(*)
from ddsTrip
group by tripid
having count(*) > 1

select tripid, count(*)
from ttVoucher_Trip
group by tripid
having count(*) > 1
--
------------------------
*/
/*
--Step 1 -- Verify there are dupes in the table
select tripid, count(*)
from ddsTrip
group by tripid
having count(*) > 1

--Step 2a -- Drop table that will hold dupe key values
drop table zzzholdkey

--Step 2b -- Collect key values, save in zzzHoldKey
select RecKey = tripid, KeyCount = count(*)
into zzzHoldKey
from ddsTrip
group by tripid
having count(*) > 1

--Step 3a -- Drop table that will hold one instance of duplicate rows
drop table zzzholddupes

--Step 3b -- Collect one instance of duplicate rows
select DISTINCT t.*
into zzzHoldDupes
from ddsTrip t, zzzholdkey hld
where t.tripid = hld.reckey

--Step 4 -- See if more than one field in addition to TripID is
duplicated
--This will indicate additional steps must be taken to clean up the
table.
select count(*), tripid
from zzzholddupes
group by tripid
having count(*) > 1

--Step 5 -- Delete ALL rows based on key value of duplicate records
delete ddsTrip
from ddsTrip t, zzzholdkey hld
where t.tripid = hld.reckey

--Step 6 -- Reinsert the row captured in Step 3b
insert ddsTrip
select *
from zzzholddupes

--You probably ought to rerun Step 1 to make sure the file is clean.
*/

Wednesday, March 7, 2012

Removing duplicates

No comments:

Post a Comment

Removing a server

Blog Archive

About Me