GreenPlum数据加载之外部表及gpfdist工具
之前介绍过使用copy工具向GreenPlum数据库中加载数据,现在介绍下另一种数据加载方式外部表,GP的外部表和ORACLE的外部表一样,都是数据 存储在数据库之外的表。GP的外部表除了可以加载本地的数据,还可以通过gpfdist工具并行加载数据。下面先测试下本地方式的外部表加载,本地方式的加载由于效率相对低下,基本已经被gpfdist的加载方式给取代,很少用了。先创建普通堆表和外部表。
dbdream=> create table med_ord_pgm_d(id integer,ord_id integer,brd_pgm_schd_id integer,prd_id integer,unt_prd_id integer,ord_acp_dtm timestamp without time zone,ord_acp_crr integer,cust_id integer) Distributed by (id); CREATE TABLE
必须拥有support权限的用户才能创建外部表,否则会遇到下面的错误。
dbdream=> create external table med_ord_pgm_d_ext dbdream-> (id integer,ord_id integer,brd_pgm_schd_id integer,prd_id integer,unt_prd_id integer,ord_acp_dtm timestamp without time zone,ord_acp_crr integer,cust_id integer) dbdream-> location('file://10.9.15.20:40000/home/gpadmin/MED_ORD_PGM_D.txt') dbdream-> format 'TEXT' (DELIMITER as ',' null as ''ESCAPE as 'OFF') dbdream-> ENCODING 'GB18030' LOG ERRORS INTO med_ord_pgm_d_err SEGMENT REJECT LIMIT 10; NOTICE: Error table "med_ord_pgm_d_err" does not exist. Auto generating an error table with the same name ERROR: must be superuser to create an external table with a file protocol
本地文件方式加载,file后面必须使用主机名,使用IP创建外部表可以成功,但是查询时会报错。下面简单分解下上面的SQL。
create external table med_ord_pgm_d_ext:这部分声明创建的是外部表及表的名字。
(id integer,ord_id integer,brd_pgm_schd_id integer,prd_id integer,unt_prd_id integer,ord_acp_dtm timestamp without time zone,ord_acp_crr integer,cust_id integer):这部分是声明表的字段结构。
location(‘file://10.9.15.20:40000/home/gpadmin/MED_ORD_PGM_D.txt’):这部分声明外部表的文件地址。
format ‘TEXT’ (DELIMITER as ‘,’ null as ”ESCAPE as ‘OFF’):这部分声明的是文件的类型,分隔符 以及空值的转义等信息。
ENCODING ‘GB18030’ LOG ERRORS INTO med_ord_pgm_d_err SEGMENT REJECT LIMIT 10:这部分声明的是文件的字符编码,错误日志表及最大允许的错误数据数,这个日志表如果不存在,会自动创建,并不需要认为创建。
dbdream=# create external table med_ord_pgm_d_ext dbdream-# (id integer,ord_id integer,brd_pgm_schd_id integer,prd_id integer,unt_prd_id integer,ord_acp_dtm timestamp without time zone,ord_acp_crr integer,cust_id integer) dbdream-# location('file://10.9.15.24:4000/home/gpadmin/MED_ORD_PGM_D.txt') dbdream-# format 'TEXT' (DELIMITER as ',' null as ''ESCAPE as 'OFF') dbdream-# ENCODING 'GB18030' LOG ERRORS INTO med_ord_pgm_d_err SEGMENT REJECT LIMIT 10; NOTICE: Error table "med_ord_pgm_d_err" does not exist. Auto generating an error table with the same name CREATE EXTERNAL TABLE dbdream=# select count(*) from med_ord_pgm_d_ext; ERROR: Could not assign a segment database for "file://10.9.15.24:4000/home/gpadmin/MED_ORD_PGM_D.txt". There isn't a valid primary segment database on host "10.9.15.24"
将IP改成主机名就不会遇到这个问题。
dbdream=# DROP EXTERNAL TABLE med_ord_pgm_d_EXT; DROP EXTERNAL TABLE dbdream=# create external table med_ord_pgm_d_ext dbdream-# (id integer,ord_id integer,brd_pgm_schd_id integer,prd_id integer,unt_prd_id integer,ord_acp_dtm timestamp without time zone,ord_acp_crr integer,cust_id integer) dbdream-# location('file://sdw1:40000/home/gpadmin/MED_ORD_PGM_D.txt') dbdream-# format 'TEXT' (DELIMITER as ',' null as ''ESCAPE as 'OFF') dbdream-# ENCODING 'GB18030' LOG ERRORS INTO med_ord_pgm_d_err SEGMENT REJECT LIMIT 10; CREATE EXTERNAL TABLE dbdream=# SELECT * FROM med_ord_pgm_d_EXT LIMIT 10; id | ord_id | brd_pgm_schd_id | prd_id | unt_prd_id | ord_acp_dtm | ord_acp_crr | cust_id ----+-----------+-----------------+--------+------------+---------------------+-------------+---------- 1 | 326517037 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:04 | 75 | 5763803 2 | 326517038 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:04 | 12644 | 8671242 3 | 326517038 | 5368447 | 350173 | 124756 | 2014-12-17 14:08:04 | 12644 | 8671242 4 | 326517038 | 5368447 | 349311 | 122596 | 2014-12-17 14:08:04 | 12644 | 8671242 5 | 326517039 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:09 | 8675 | 3351629 6 | 326517040 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:09 | 4012 | 10999173 7 | 326517041 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:10 | 12644 | 5883748 8 | 326517042 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:10 | 12644 | 11485331 9 | 326517042 | 5368447 | 350173 | 124756 | 2014-12-17 14:08:10 | 12644 | 11485331 10 | 326517042 | 5368447 | 349311 | 122596 | 2014-12-17 14:08:10 | 12644 | 11485331 (10 rows)
上面的测试,主机名后面跟的端口号,一直没弄清楚,按我的理解,既然这个文件是放到Segment节点上面,那么就应该是这个Segment节点的端口号,可是测试发现,这个端口号写成Master节点的端口号也行,不写也行,写成错的(随便蒙一个)也行。这样的话最合理的解释就是 这个端口号实际是从环境变量获取的,而不是从这个建表的命令获取的,可是经过查看,无论是Master节点还是Segment节点,都没有设置端口号相关的环境变量。那么就应该是从配置文件中获取的端口号信息。实验如下:
使用Master的端口号5432:
dbdream=# DROP EXTERNAL TABLE med_ord_pgm_d_EXT; DROP EXTERNAL TABLE dbdream=# create external table med_ord_pgm_d_ext dbdream-# (id integer,ord_id integer,brd_pgm_schd_id integer,prd_id integer,unt_prd_id integer,ord_acp_dtm timestamp without time zone,ord_acp_crr integer,cust_id integer) dbdream-# location('file://sdw1:5432/home/gpadmin/MED_ORD_PGM_D.txt') dbdream-# format 'TEXT' (DELIMITER as ',' null as ''ESCAPE as 'OFF') dbdream-# ENCODING 'GB18030' LOG ERRORS INTO med_ord_pgm_d_err SEGMENT REJECT LIMIT 10; CREATE EXTERNAL TABLE dbdream=# SELECT * FROM med_ord_pgm_d_EXT LIMIT 10; 1 | 326517037 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:04 | 75 | 5763803 2 | 326517038 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:04 | 12644 | 8671242 3 | 326517038 | 5368447 | 350173 | 124756 | 2014-12-17 14:08:04 | 12644 | 8671242 4 | 326517038 | 5368447 | 349311 | 122596 | 2014-12-17 14:08:04 | 12644 | 8671242 5 | 326517039 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:09 | 8675 | 3351629 6 | 326517040 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:09 | 4012 | 10999173 7 | 326517041 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:10 | 12644 | 5883748 8 | 326517042 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:10 | 12644 | 11485331 9 | 326517042 | 5368447 | 350173 | 124756 | 2014-12-17 14:08:10 | 12644 | 11485331 10 | 326517042 | 5368447 | 349311 | 122596 | 2014-12-17 14:08:10 | 12644 | 11485331 (10 rows)
不写端口号信息:
dbdream=# DROP EXTERNAL TABLE med_ord_pgm_d_EXT; DROP EXTERNAL TABLE dbdream=# create external table med_ord_pgm_d_ext dbdream-# (id integer,ord_id integer,brd_pgm_schd_id integer,prd_id integer,unt_prd_id integer,ord_acp_dtm timestamp without time zone,ord_acp_crr integer,cust_id integer) dbdream-# location('file://sdw1/home/gpadmin/MED_ORD_PGM_D.txt') dbdream-# format 'TEXT' (DELIMITER as ',' null as ''ESCAPE as 'OFF') dbdream-# ENCODING 'GB18030' LOG ERRORS INTO med_ord_pgm_d_err SEGMENT REJECT LIMIT 10; CREATE EXTERNAL TABLE dbdream=# SELECT * FROM med_ord_pgm_d_EXT LIMIT 10; 1 | 326517037 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:04 | 75 | 5763803 2 | 326517038 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:04 | 12644 | 8671242 3 | 326517038 | 5368447 | 350173 | 124756 | 2014-12-17 14:08:04 | 12644 | 8671242 4 | 326517038 | 5368447 | 349311 | 122596 | 2014-12-17 14:08:04 | 12644 | 8671242 5 | 326517039 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:09 | 8675 | 3351629 6 | 326517040 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:09 | 4012 | 10999173 7 | 326517041 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:10 | 12644 | 5883748 8 | 326517042 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:10 | 12644 | 11485331 9 | 326517042 | 5368447 | 350173 | 124756 | 2014-12-17 14:08:10 | 12644 | 11485331 10 | 326517042 | 5368447 | 349311 | 122596 | 2014-12-17 14:08:10 | 12644 | 11485331 (10 rows)
使用错误的端口号信息:
dbdream=# DROP EXTERNAL TABLE med_ord_pgm_d_EXT; DROP EXTERNAL TABLE dbdream=# create external table med_ord_pgm_d_ext dbdream-# (id integer,ord_id integer,brd_pgm_schd_id integer,prd_id integer,unt_prd_id integer,ord_acp_dtm timestamp without time zone,ord_acp_crr integer,cust_id integer) dbdream-# location('file://sdw1:50000/home/gpadmin/MED_ORD_PGM_D.txt') dbdream-# format 'TEXT' (DELIMITER as ',' null as ''ESCAPE as 'OFF') dbdream-# ENCODING 'GB18030' LOG ERRORS INTO med_ord_pgm_d_err SEGMENT REJECT LIMIT 10; CREATE EXTERNAL TABLE dbdream=# SELECT * FROM med_ord_pgm_d_EXT LIMIT 10; 1 | 326517037 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:04 | 75 | 5763803 2 | 326517038 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:04 | 12644 | 8671242 3 | 326517038 | 5368447 | 350173 | 124756 | 2014-12-17 14:08:04 | 12644 | 8671242 4 | 326517038 | 5368447 | 349311 | 122596 | 2014-12-17 14:08:04 | 12644 | 8671242 5 | 326517039 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:09 | 8675 | 3351629 6 | 326517040 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:09 | 4012 | 10999173 7 | 326517041 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:10 | 12644 | 5883748 8 | 326517042 | 5368447 | 314340 | 77636 | 2014-12-17 14:08:10 | 12644 | 11485331 9 | 326517042 | 5368447 | 350173 | 124756 | 2014-12-17 14:08:10 | 12644 | 11485331 10 | 326517042 | 5368447 | 349311 | 122596 | 2014-12-17 14:08:10 | 12644 | 11485331 (10 rows)
下面测试下加载,打开时间记录器,看看需要多长时间可以加载完成。
dbdream=# truncate table med_ord_pgm_d; TRUNCATE TABLE dbdream=# \timing on Timing is on. dbdream=# insert into med_ord_pgm_d select * from med_ord_pgm_d_ext; INSERT 0 1000000 Time: 4040.802 ms
下面看下数据分布情况:
dbdream=# select gp_segment_id,count(*) from med_ord_pgm_d group by 1; gp_segment_id | count ---------------+-------- 2 | 333385 1 | 333259 0 | 333356 (3 rows)
数据分布相对来讲非常均匀,GP数据库是自动提交,不需要触发commit命令进行提交(想要回滚相对来讲就很麻烦,所以尽量不要在GP数据库做DML测试性的操作)。
在看下gpfdist工具,gpfdist工具可以实验并行加载,需要先启动gpfdist进程及监听端口,这个命令在Master和Segment节点的GPHOME/bin目录下,如果配置了GP的环境变量,可以直接使用,如果在没有安装GP的服务器上使用gpfdist工具,只需要将gpfdist命令的文件拷贝到相应的服务器上即可使用。
[gpadmin@mdw ~]$ nohup gpfdist –d /home/gpadmin –p 1234 >/tmp/gpfdist.log & [1] 8532
如上命令,启动gpfdist进程,扫描路径为/home/gpadmin,监听端口为1234,下面创建一张基于gpfdist工具的外部表。
dbdream=# DROP EXTERNAL TABLE med_ord_pgm_d_EXT; DROP EXTERNAL TABLE dbdream=# create external table med_ord_pgm_d_ext dbdream-# (id integer,ord_id integer,brd_pgm_schd_id integer,prd_id integer,unt_prd_id integer,ord_acp_dtm timestamp without time zone,ord_acp_crr integer,cust_id integer) dbdream-# location('gpfdist://10.9.15.24:1234/MED_ORD_PGM_D.txt') dbdream-# format 'TEXT' (DELIMITER as ',' null as ''ESCAPE as 'OFF') dbdream-# ENCODING 'GB18030' LOG ERRORS INTO med_ord_pgm_d_err SEGMENT REJECT LIMIT 10; CREATE EXTERNAL TABLE
gpfdist就没那么多限制,用IP地址也可以。
dbdream=# select count(*) from med_ord_pgm_d_ext; count --------- 1000000 (1 row)
下面测试加载速度,看看使用gpfdist工具加载同样的数据需要多长时间。
dbdream=# insert into med_ord_pgm_d select * from med_ord_pgm_d_ext; INSERT 0 1000000 Time: 2998.982 ms
通过测试可以看到,使用本地文件的方式加载,需要4040.802 ms,而使用gpfdist工具加载数据,需要2998.982 ms,如果测试数据更多,加载的文件更大,速度还会更加明显。
下面看看数据分布情况。
dbdream=# select gp_segment_id,count(*) from med_ord_pgm_d group by 1; gp_segment_id | count ---------------+-------- 2 | 333385 1 | 333259 0 | 333356 (3 rows)
数据分布情况和本地文件方式加载的分布情况一样。再看看查询10条数据情况。
dbdream=# select * from med_ord_pgm_d limit 10; id | ord_id | brd_pgm_schd_id | prd_id | unt_prd_id | ord_acp_dtm | ord_acp_crr | cust_id -------+-----------+-----------------+--------+------------+---------------------+-------------+---------- 15796 | 323420157 | 5245462 | 318671 | 82720 | 2014-07-18 10:57:50 | 2877 | 4921801 15798 | 323420158 | 5245462 | 318671 | 82720 | 2014-07-18 10:57:50 | 12644 | 4958159 15805 | 323420593 | 5245462 | 318671 | 82720 | 2014-07-18 11:15:58 | 1439 | 4195890 15807 | 323420602 | 0 | 313812 | 77000 | 2014-07-18 11:16:47 | 40356 | 11715315 15812 | 323420606 | 5245462 | 316211 | 79754 | 2014-07-18 11:17:11 | 11625 | 8827608 15813 | 327372795 | 5389503 | 346931 | 121116 | 2015-01-19 19:06:39 | 5539 | 13444588 15814 | 327372795 | 5389503 | 350173 | 124756 | 2015-01-19 19:06:39 | 5539 | 13444588 15821 | 327372801 | 0 | 308537 | 69934 | 2015-01-19 19:06:45 | 44052 | 13238129 15823 | 327372802 | 5389459 | 349311 | 122596 | 2015-01-19 19:06:45 | 12644 | 355978 15824 | 327372803 | 5389459 | 319813 | 83797 | 2015-01-19 19:06:46 | 45562 | 2563974 (10 rows)
GP数据库查询数据,先扫描到的数据会直接返回,也就是多次查询的结果可能是不一样的,但是使用gpfdist工具加载,查询结果基本不会是像本地文件加载那样直接从第一天开始有序返回,因为本地加载没有使用并行,在加载的时候数据是从第一条开始有序插入的,而gpfdist工具加载数据是并行加载的,最先插入到数据库的数据并不一定是从第一条数据开始的。