ORACLE删除重复数据的几种方法
这几天整理数据库中记录的图片路径信息,为了方便操作,将40多张表的DOC,VOL,EFILE三个字段的信息整合到一张表中,由于部分表之间这三个字段数据相同,这样就产生了一些重复的数据,(之前图片存储在文件系统上,数据库只存了图片的路径,整理之后要将图片存在数据库里),我整理了一下关于删除重复数据的方法。
第一种方法也是网上很常见的方法,如果数据量稍大一些,速度超慢,而且这中方法在ORACLE11g中通常会引发direct path read/write temp等待事件。
delete from test t where (t.doc,t.vol,t.efile) in (select doc,vol,efile from test group by doc,vol,efile having count(*)>1) and rowid not in(select min(rowi d) from test group by doc,vol,efile having count(*)>1);
这种方法的SQL执行计划COST通常都超大(数据量超大的话),以下是我在实验环境中摘取的SQL执行计划(以下所有测试都使用同一张TEST表,3764878条记录,大小为280M,在DOC,VOL,EFILE三个字段上建立了复合索引,EFILE字段有空值,DOC和VOL字段没有NOT NULL约束)。由于这个SQL跑了好几个小时都没跑完,这是中断后在AWR中摘取的没跑完执行计划,COST超大。
-------------------------------------------------------------------------- |Id| Operation | Name | Rows | Bytes |TempSpc| Cost(%CPU)| Time | -------------------------------------------------------------------------- | 0| DELETE STATEMENT| | | | | 149M(100) | | | 1| DELETE | TEST | | | | | | | 2| FILTER | | | | | | | | 3|HASH JOIN RIGHT SEMI| | 285K | 50M | 26M | 3861 (1)|00:00:47| | 4| VIEW |VW_NSO_1| 1 | 285K | 23M | 532 (4) |00:00:07| |*5| FILTER | | | | | | | | 6| SORT GROUP BY | | 285K| 23M| | 532 (4) |00:00:07| |*7|TABLE ACCESS FULL| TEST | 285K| 23M| | 515 (1) |00:00:07| | 8|TABLE ACCESS FULL| TEST | 285K| 26M| | 516 (1) |00:00:07| | 9| FILTER | | | | | | | |10| HASH GROUP BY | | 285K | 26M | | 532 (4)|00:00:07| |11|TABLE ACCESS FULL| TEST | 285K | 26M | | 515 (1)|00:00:07| ------------------------------------------------------------------------
第二种方法是第一种方法的简化版,但是需要执行N-1次(N是同一条记录的相同个数)
delete from test where rowid in(select min (rowid) from test group by doc,vol,efile having count(*)>1);
这种方法的执行事件缩短到了1分钟,和上一条SQL对比来说,可以说是飞快。但是这中方法的缺点是,可能需要执行多次。比如同一条记录有3条一样的,就需要执行N-1次也就是2次,下面是这条SQL的执行计划。
-------------------------------------------------------------------------- |Id| Operation | Name | Rows | Bytes |TempSpc| Cost(%CPU)| Time | -------------------------------------------------------------------------- | 0| DELETE STATEMENT| | 8833 | 577K| | 40632 (1)|00:08:08| | 1| DELETE | TEST | | | | | | | 2| NESTED LOOPS | | 8833 | 577K| | 40632 (1)|00:08:08| | 3| VIEW |VW_NSO_1| 176K| 2070K| | 30981 (2)|00:06:12| | 4| SORT UNIQUE | | 442| 9488K| | 30981 (2)|00:06:12| |*5| FILTER | | | | | | | | 6| SORT GROUP BY | | 442| 9488K| 271M| 30981 (2)|00:06:12| | 7|TABLE ACCESS FULL| TEST | 4168K| 218M| | 9508 (1)|00:01:55| | 8|TABLE ACCESS BY | TEST | 1| 55 | | 1 (0)|00:00:01| | USER ROWID| --------------------------------------------------------------------------
第三种方法的执行时间提升到了50秒,而且只需执行一次即可。
delete FROM TEST WHERE ROWID IN (SELECT A.ROWID FROM Test A,(SELECT doc,vol ,efile,MIN(ROWID) R_ID FROM Test GROUP BY doc,vol,efile HAVING COUNT(*) > 1 ) b WHERE A.doc = B.doc AND A.vol = B.vol and a.efile = b.efile AND A.ROWID > R_ID);
这个SQL是相同环境下我在执行计划里唯一看到INDEX的SQL,下面是执行计划:
-------------------------------------------------------------------------- |Id| Operation | Name | Rows | Bytes |TempSpc| Cost(%CPU)| Time | -------------------------------------------------------------------------- | 0| DELETE STATEMENT| | 1 | 70 | | 50781 (1)|00:10:10| | 1| DELETE | TEST | | | | | | | 2| NESTED LOOPS | | 1 | 70 | | 50781 (1)|00:10:10| | 3| VIEW |VW_NSO_1| 8833 | 103K| | 50779 (1)|00:10:10| | 4| SORT UNIQUE | | 1 | 1345K| | | | |*5| HASH JOIN | | 8833 | 1345K| 18M| 50779 (1)|00:10:10| | 6| VIEW | | 176K| 16M| | 28700 (2)|00:05:45| |*7| FILTER | | | | | | | | 8| SORT GROUP BY | | 176K| 9M| 244M| 28700 (2)|00:05:45| | 9|TABLE ACCESS FULL| TEST | 3533K| 195M| | 9500 (1)|00:01:55| |10|INDEX FAST |TEST_INDEX_1| 3533K| 195M| | 9415 (1)|00:01:53| |FULL SCAN | |11|TABLE ACCESS BY | TEST | 1 | 58 | | 1 (0)|00:00:01| | USER ROWID | --------------------------------------------------------------------------
第四种方法虽然也只需要执行一次,但是时间用了1分零2秒。
DELETE FROM Test A WHERE ROWID IN (SELECT ROWID FROM (SELECT DOC,VOL,EFILE, ROW_NUMBER() OVER (PARTITION BY DOC,VOL,EFILE ORDER BY ROWID) RN FROM TES T) B WHERE RN > 1);
这个SQL和上面的SQL相比,COST大了一些。
-------------------------------------------------------------------------- |Id| Operation | Name | Rows | Bytes |TempSpc| Cost(%CPU)| Time | -------------------------------------------------------------------------- | 0| DELETE STATEMENT| | 1 | 70 | | 75391 (1)|00:15:05| | 1| DELETE | TEST | | | | | | | 2| NESTED LOOPS | | 1 | 70 | | 75391 (1)|00:15:05| | 3| VIEW |VW_NSO_1| 3533K| 40M| | 59133 (1)|00:11:50| | 4| SORT UNIQUE | | 1 | 84M| | | | |*5| VIEW | | 3533K| 84M| | 59133 (1)|00:11:50| | 6| WINDOW SORT | | 3533K| 195M| 244M| 59133 (1)|00:11:50| | 7|TABLE ACCESS FULL| TEST | 3533K| 195M| | 9500 (1)|00:01:55| | 8|TABLE ACCESS | TEST | 1 | 58 | | 1 (0)|00:00:01| |BY USER ROWID | --------------------------------------------------------------------------
几乎以上操作都没有用到索引,这可能是因为这张表的这三个字段都不是NOT NULL的缘故,而且我在这3个字段上建立了复合索引,索引和表一样都是280M。
第五种方法和以上四种方法都不一样,可以用create table的方法实现。
SQL> create table tttt as select distinct * from test; 表已创建。 已用时间: 00: 01: 00.17 SQL> alter table test rename to test_2011-11-02.bak; 表已删除。 已用时间: 00: 00: 01.04 SQL> alter table tttt rename to test; 表已更改。 已用时间: 00: 00: 00.01
此种方法虽然看着挺繁琐,需要执行3条SQL,但是每条SQL都很简单,对于像本人一样对SQL不是很熟悉的人来说,利用此种方法可以很容易的实现删除重复数据的需求,此种方法正式环境不推荐使用。