GreenPlum数据库创建表及表的数据分布
GreenPlum是一个分布式数据库,表的数据当然是分布到所有Segment节点上,那么怎么控制表中数据的分布呢?下面就来介绍下GreenPlum数据库的建表和表的数据分布。
GreenPlum数据库支持hash分布(DISTRIBUTED BY)和随机分布(DISTRIBUTED RANDOMLY)两种分布策略,hash分布会计算分布键的hash值,相似的hash值的数据会放到同一个Segment节点上,和hash分区算法有些类似,这样,同样的分布键的数据肯定会分布到同一个Segment节点,这样在多表关联的时候,如果关联字段都是分布键,就可以在每个Segment节点关联后,Segment节点把结果发送到Master节点,再由Master节点汇总,将最终的结果返还客户端。而随机分布则不能保证同样分布键的数据分布在同一个Segment节点上,这样在表关联的时候,就需要将数据发送到所有Segment节点去做运算,这样网络传输和大量数据运算都需要较长的时间,性能非常低下,GreenPlum数据库不建议需要多表关联的表使用随机分布,也不推荐使用随机分布。还有一点,因为每个Segment节点都是独立的PostgreSql数据库,只能保证在单个Segment几点的数据唯一性,而随机分布不能保证整体数据的唯一性,而hash分布,相同分布键的数据会分配到同一个Segment节点,这样在单个Segment节点保证数据唯一性,也就保证了整体数据的唯一性。hash分布也是GreenPlum数据库的默认分布方式。下面请看实验演示。
如果在建表时不指定数据的分布策略,默认使用hash分布。
dbdream=> create table t_hash (name varchar(10),id int); NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'name' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. CREATE TABLE
从上面的信息就可以看出,建表时如果不指定分布键和分布策略,默认使用第一个字段作为分布键,并且使用hash分布策略,也可以通过\d命令来查看表的分布信息。
dbdream=> \d t_hash Table "public.t_hash" Column | Type | Modifiers --------+-----------------------+----------- name | character varying(10) | id | integer | Distributed by: (name) Tablespace: "tbs1"
如果建表时不指定分布键和分布策略,也并不是都使用第一个字段作为分布键,如果表中包含主键,则默认会选择主键为分布键。
dbdream=> create table t_hash_1 (name varchar(10),id int primary key); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "t_hash_1_pkey" for table "t_hash_1" CREATE TABLE dbdream=> \d t_hash_1 Table "public.t_hash_1" Column | Type | Modifiers --------+-----------------------+----------- name | character varying(10) | id | integer | not null Indexes: "t_hash_1_pkey" PRIMARY KEY, btree (id), tablespace "tbs1" Distributed by: (id) Tablespace: "tbs1"
如果表上有唯一约束但没有主键,那么默认选择唯一字段作为分布键。
dbdream=> create table t_hash_2 (name varchar(10),id int unique); NOTICE: CREATE TABLE / UNIQUE will create implicit index "t_hash_2_id_key" for table "t_hash_2" CREATE TABLE dbdream=> \d t_hash_2 Table "public.t_hash_2" Column | Type | Modifiers --------+-----------------------+----------- name | character varying(10) | id | integer | Indexes: "t_hash_2_id_key" UNIQUE, btree (id), tablespace "tbs1" Distributed by: (id) Tablespace: "tbs1"
那么就会有人问,如果同时存在主键和唯一键会选择哪个字段作为分布键呢?GreenPlum数据库不支持同时存在两个唯一的字段(复合主键除外)。
dbdream=> create table t_hash_3 (name varchar(10) unique,id int primary key); ERROR:Greenplum Database does not allow having both PRIMARY KEY and UNIQUE constraints
如果建表时指定的分布键不是主键,那么GreenPlum会把表的分布键改为主键字段,而不是指定的字段,但是SQL语句中指定的分布键必须包含主键字段,否则SQL会报错。
dbdream=> create table t_hash_3 (name varchar(10),id int primary key) DISTRIBUTED BY(name); ERROR: PRIMARY KEY and DISTRIBUTED BY definitions incompatible HINT: When there is both a PRIMARY KEY, and a DISTRIBUTED BY clause, the DISTRIBUTED BY clause must be equal to or a left-subset of the PRIMARY KEY
如果建表时指定的分布键不是主键,那么SQL中的分布键必须包含主键字段,SQL才能成功运行,并且把分布键改成主键字段,而不是SQL指定的分布键字段。而且主键字段必须在SQL指定的分布键的第一列出现才可以,否则也会遇到错误导致SQL无法成功运行。
dbdream=> create table t_hash_3 (name varchar(10),id int primary key) DISTRIBUTED BY(name,id); ERROR: PRIMARY KEY and DISTRIBUTED BY definitions incompatible HINT: When there is both a PRIMARY KEY, and a DISTRIBUTED BY clause, the DISTRIBUTED BY clause must be equal to or a left-subset of the PRIMARY KEY
上述SQL将ID(主键列)放到前面,SQL即可成功运行。
dbdream=> create table t_hash_3 (name varchar(10),id int primary key) DISTRIBUTED BY(id,name); NOTICE: updating distribution policy to match new primary key NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "t_hash_3_pkey" for table "t_hash_3" CREATE TABLE dbdream=> \d t_hash_3 Table "public.t_hash_3" Column | Type | Modifiers --------+-----------------------+----------- name | character varying(10) | id | integer | not null Indexes: "t_hash_3_pkey" PRIMARY KEY, btree (id), tablespace "tbs1" Distributed by: (id) Tablespace: "tbs1"
GreenPlum数据库标准的建表SQL应该明确指定分布键,如下:
dbdream=> create table t_hash_4 (name varchar(10),id int primary key) DISTRIBUTED BY(id); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "t_hash_4_pkey" for table "t_hash_4" CREATE TABLE dbdream=> \d t_hash_4 Table "public.t_hash_4" Column | Type | Modifiers --------+-----------------------+----------- name | character varying(10) | id | integer | not null Indexes: "t_hash_4_pkey" PRIMARY KEY, btree (id), tablespace "tbs1" Distributed by: (id) Tablespace: "tbs1"
下面再演示下随机分布,要想使用随机分布,必须明确指定分区模式,随机分布模式数据都是随机分布,所以,不需要也没有分布键。
dbdream=> create table t_randomly (id int,name varchar(10)) DISTRIBUTED RANDOMLY; CREATE TABLE dbdream=> \d t_randomly Table "public.t_randomly" Column | Type | Modifiers --------+-----------------------+----------- id | integer | name | character varying(10) | Distributed randomly Tablespace: "tbs1"
随机分布不支持主键和唯一键,因为随机分布保证不了整体数据的唯一性。
dbdream=> create table t_randomly_1 (id int unique,name varchar(10)) DISTRIBUTED RANDOMLY; ERROR: UNIQUE and DISTRIBUTED RANDOMLY are incompatible dbdream=> alter table t_randomly add constraint pk_id primary key (id); ERROR: PRIMARY KEY and DISTRIBUTED RANDOMLY are incompatible
这里顺便提一下GreenPlum数据库修改字段类型,这和ORACLE的语法完全不一样,下面演示将t_randomly表的name字段长度从10调整到20。
dbdream=> alter table t_randomly alter column name type varchar(20); ALTER TABLE dbdream=> \d t_randomly Table "public.t_randomly" Column | Type | Modifiers --------+-----------------------+----------- id | integer | name | character varying(20) | Distributed randomly Tablespace: "tbs1"
下面在介绍下其他建表的方法,GreenPlum数据库支持CTAS(create table as select)的方式根据结果集来创建表,如果不指定分布键,那么GreenPlum会根据结果集来自行选择,基本都和参照本的分布键相同。下面实验以t_hash表为参照表,先向t_hash表中插入几条数据,GreenPlum数据库支持MYSQL数据库那样一条SQL插入多条数据。
dbdream=> insert into t_hash values ('a',1),('a',2); INSERT 0 2
下面使用CTAS在不指定分布键的情况下创建一张表。
dbdream=> create table t_ctas as select * from t_hash; NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'name' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. SELECT 2 dbdream=> \d t_ctas Table "public.t_ctas" Column | Type | Modifiers --------+-----------------------+----------- name | character varying(10) | id | integer | Distributed by: (name) Tablespace: "tbs1" dbdream=> select * from t_ctas_1; name | id ------+---- a | 1 a | 2 (2 rows)
当然也可以在建表时指定和参照表不同的分布键。
dbdream=> create table t_ctas_1 as select * from t_hash DISTRIBUTED BY(id); SELECT 2 dbdream=> \d t_ctas_1 Table "public.t_ctas_1" Column | Type | Modifiers --------+-----------------------+----------- name | character varying(10) | id | integer | Distributed by: (id) Tablespace: "tbs1"
GreenPlum数据库还支持select into的方式来建表,这种方式和CTAS一样都是根据结果集创建表,但是这种方式不能指定分布键。
dbdream=> select * into t_into from t_hash; NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'name' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. SELECT 2 dbdream=> \d t_into Table "public. t_into " Column | Type | Modifiers --------+-----------------------+----------- name | character varying(10) | id | integer | Distributed by: (name) Tablespace: "tbs1"
GreenPlum数据库还可以通过like的方式来建表,但是这种方式只会根据参照表的结构创建表,并没有数据,也不会带过来参照表的唯一键、压缩、只追加等属性,只是会创建一张和参照表字段结构一模一样的表而已。如果不指定分布键,则使用和参照表一样的分布键。
dbdream=> create table t_like (like t_hash); NOTICE: Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table CREATE TABLE dbdream=> \d t_like Table "public.t_like" Column | Type | Modifiers --------+-----------------------+----------- name | character varying(10) | id | integer | Distributed by: (name) Tablespace: "tbs1" dbdream=> select * from t_like; name | id ------+---- (0 rows)
当然,也可以在建表时指定和参照表不一样的分布键。
dbdream=> create table t_like_1 (like t_hash) DISTRIBUTED BY(id); CREATE TABLE dbdream=> \d t_like_1 Table "public.t_like_1" Column | Type | Modifiers --------+-----------------------+----------- name | character varying(10) | id | integer | Distributed by: (id) Tablespace: "tbs1"