GreenPlum数据库创建表及表的数据分布

当前位置：首页 > GreenPlum > 正文

Jan202016

作者：dbdream 发布：2016-01-20 18:38 分类：GreenPlum 抢沙发

GreenPlum是一个分布式数据库，表的数据当然是分布到所有Segment节点上，那么怎么控制表中数据的分布呢？下面就来介绍下GreenPlum数据库的建表和表的数据分布。

GreenPlum数据库支持hash分布(DISTRIBUTED BY)和随机分布(DISTRIBUTED RANDOMLY)两种分布策略，hash分布会计算分布键的hash值，相似的hash值的数据会放到同一个Segment节点上，和hash分区算法有些类似，这样，同样的分布键的数据肯定会分布到同一个Segment节点，这样在多表关联的时候，如果关联字段都是分布键，就可以在每个Segment节点关联后，Segment节点把结果发送到Master节点，再由Master节点汇总，将最终的结果返还客户端。而随机分布则不能保证同样分布键的数据分布在同一个Segment节点上，这样在表关联的时候，就需要将数据发送到所有Segment节点去做运算，这样网络传输和大量数据运算都需要较长的时间，性能非常低下，GreenPlum数据库不建议需要多表关联的表使用随机分布，也不推荐使用随机分布。还有一点，因为每个Segment节点都是独立的PostgreSql数据库，只能保证在单个Segment几点的数据唯一性，而随机分布不能保证整体数据的唯一性，而hash分布，相同分布键的数据会分配到同一个Segment节点，这样在单个Segment节点保证数据唯一性，也就保证了整体数据的唯一性。hash分布也是GreenPlum数据库的默认分布方式。下面请看实验演示。

如果在建表时不指定数据的分布策略，默认使用hash分布。

dbdream=> create table t_hash (name varchar(10),id int);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'name' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE

从上面的信息就可以看出，建表时如果不指定分布键和分布策略，默认使用第一个字段作为分布键，并且使用hash分布策略，也可以通过\d命令来查看表的分布信息。

dbdream=> \d t_hash
           Table "public.t_hash"
 Column |         Type          | Modifiers 
--------+-----------------------+-----------
 name   | character varying(10) | 
 id     | integer               | 
Distributed by: (name)
Tablespace: "tbs1"

如果建表时不指定分布键和分布策略，也并不是都使用第一个字段作为分布键，如果表中包含主键，则默认会选择主键为分布键。

dbdream=> create table t_hash_1 (name varchar(10),id int primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "t_hash_1_pkey" for table "t_hash_1"
CREATE TABLE

dbdream=> \d t_hash_1
          Table "public.t_hash_1"
 Column |         Type          | Modifiers 
--------+-----------------------+-----------
 name   | character varying(10) | 
 id     | integer               | not null
Indexes:
    "t_hash_1_pkey" PRIMARY KEY, btree (id), tablespace "tbs1"
Distributed by: (id)
Tablespace: "tbs1"

如果表上有唯一约束但没有主键，那么默认选择唯一字段作为分布键。

dbdream=> create table t_hash_2 (name varchar(10),id int unique);
NOTICE:  CREATE TABLE / UNIQUE will create implicit index "t_hash_2_id_key" for table "t_hash_2"
CREATE TABLE

dbdream=> \d t_hash_2
          Table "public.t_hash_2"
 Column |         Type          | Modifiers 
--------+-----------------------+-----------
 name   | character varying(10) | 
 id     | integer               | 
Indexes:
    "t_hash_2_id_key" UNIQUE, btree (id), tablespace "tbs1"
Distributed by: (id)
Tablespace: "tbs1"

那么就会有人问，如果同时存在主键和唯一键会选择哪个字段作为分布键呢？GreenPlum数据库不支持同时存在两个唯一的字段(复合主键除外)。

dbdream=> create table t_hash_3 (name varchar(10) unique,id int primary key);
ERROR：Greenplum Database does not allow having both PRIMARY KEY and UNIQUE constraints

如果建表时指定的分布键不是主键，那么GreenPlum会把表的分布键改为主键字段，而不是指定的字段，但是SQL语句中指定的分布键必须包含主键字段，否则SQL会报错。

dbdream=> create table t_hash_3 (name varchar(10),id int primary key) DISTRIBUTED BY(name);
ERROR:  PRIMARY KEY and DISTRIBUTED BY definitions incompatible
HINT:  When there is both a PRIMARY KEY, and a DISTRIBUTED BY clause, the DISTRIBUTED BY clause must be equal to or a left-subset of the PRIMARY KEY

如果建表时指定的分布键不是主键，那么SQL中的分布键必须包含主键字段，SQL才能成功运行，并且把分布键改成主键字段，而不是SQL指定的分布键字段。而且主键字段必须在SQL指定的分布键的第一列出现才可以，否则也会遇到错误导致SQL无法成功运行。

dbdream=> create table t_hash_3 (name varchar(10),id int primary key) DISTRIBUTED BY(name,id);
ERROR:  PRIMARY KEY and DISTRIBUTED BY definitions incompatible
HINT:  When there is both a PRIMARY KEY, and a DISTRIBUTED BY clause, the DISTRIBUTED BY clause must be equal to or a left-subset of the PRIMARY KEY

上述SQL将ID(主键列)放到前面，SQL即可成功运行。

dbdream=> create table t_hash_3 (name varchar(10),id int primary key) DISTRIBUTED BY(id,name);
NOTICE:  updating distribution policy to match new primary key
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "t_hash_3_pkey" for table "t_hash_3"
CREATE TABLE

dbdream=> \d t_hash_3
          Table "public.t_hash_3"
 Column |         Type          | Modifiers 
--------+-----------------------+-----------
 name   | character varying(10) | 
 id     | integer               | not null
Indexes:
    "t_hash_3_pkey" PRIMARY KEY, btree (id), tablespace "tbs1"
Distributed by: (id)
Tablespace: "tbs1"

GreenPlum数据库标准的建表SQL应该明确指定分布键，如下：

dbdream=> create table t_hash_4 (name varchar(10),id int primary key) DISTRIBUTED BY(id);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "t_hash_4_pkey" for table "t_hash_4"
CREATE TABLE

dbdream=> \d t_hash_4
          Table "public.t_hash_4"
 Column |         Type          | Modifiers 
--------+-----------------------+-----------
 name   | character varying(10) | 
 id     | integer               | not null
Indexes:
    "t_hash_4_pkey" PRIMARY KEY, btree (id), tablespace "tbs1"
Distributed by: (id)
Tablespace: "tbs1"

下面再演示下随机分布，要想使用随机分布，必须明确指定分区模式，随机分布模式数据都是随机分布，所以，不需要也没有分布键。

dbdream=> create table t_randomly (id int,name varchar(10)) DISTRIBUTED RANDOMLY;
CREATE TABLE

dbdream=> \d t_randomly
         Table "public.t_randomly"
 Column |         Type          | Modifiers 
--------+-----------------------+-----------
 id     | integer               | 
 name   | character varying(10) | 
Distributed randomly
Tablespace: "tbs1"

随机分布不支持主键和唯一键，因为随机分布保证不了整体数据的唯一性。

dbdream=> create table t_randomly_1 (id int unique,name varchar(10)) DISTRIBUTED RANDOMLY;
ERROR:  UNIQUE and DISTRIBUTED RANDOMLY are incompatible

dbdream=> alter table t_randomly add constraint pk_id primary key (id);
ERROR:  PRIMARY KEY and DISTRIBUTED RANDOMLY are incompatible

这里顺便提一下GreenPlum数据库修改字段类型，这和ORACLE的语法完全不一样，下面演示将t_randomly表的name字段长度从10调整到20。

dbdream=> alter table t_randomly alter column name type varchar(20);
ALTER TABLE

dbdream=> \d t_randomly
         Table "public.t_randomly"
 Column |         Type          | Modifiers 
--------+-----------------------+-----------
 id     | integer               | 
 name   | character varying(20) | 
Distributed randomly
Tablespace: "tbs1"

下面在介绍下其他建表的方法，GreenPlum数据库支持CTAS(create table as select)的方式根据结果集来创建表，如果不指定分布键，那么GreenPlum会根据结果集来自行选择，基本都和参照本的分布键相同。下面实验以t_hash表为参照表，先向t_hash表中插入几条数据，GreenPlum数据库支持MYSQL数据库那样一条SQL插入多条数据。

dbdream=> insert into t_hash values ('a',1),('a',2);
INSERT 0 2

下面使用CTAS在不指定分布键的情况下创建一张表。

dbdream=> create table t_ctas as select * from t_hash;
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'name' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
SELECT 2

dbdream=> \d t_ctas
           Table "public.t_ctas"
 Column |         Type          | Modifiers 
--------+-----------------------+-----------
 name   | character varying(10) | 
 id     | integer               | 
Distributed by: (name)
Tablespace: "tbs1"

dbdream=> select * from t_ctas_1;
 name | id 
------+----
 a    |  1
 a    |  2
(2 rows)

当然也可以在建表时指定和参照表不同的分布键。

dbdream=> create table t_ctas_1 as select * from t_hash DISTRIBUTED BY(id);
SELECT 2

dbdream=> \d t_ctas_1
          Table "public.t_ctas_1"
 Column |         Type          | Modifiers 
--------+-----------------------+-----------
 name   | character varying(10) | 
 id     | integer               | 
Distributed by: (id)
Tablespace: "tbs1"

GreenPlum数据库还支持select into的方式来建表，这种方式和CTAS一样都是根据结果集创建表，但是这种方式不能指定分布键。

dbdream=> select * into t_into from t_hash;
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'name' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
SELECT 2

dbdream=> \d t_into
          Table "public. t_into "
 Column |         Type          | Modifiers 
--------+-----------------------+-----------
 name   | character varying(10) | 
 id     | integer               | 
Distributed by: (name)
Tablespace: "tbs1"

GreenPlum数据库还可以通过like的方式来建表，但是这种方式只会根据参照表的结构创建表，并没有数据，也不会带过来参照表的唯一键、压缩、只追加等属性，只是会创建一张和参照表字段结构一模一样的表而已。如果不指定分布键，则使用和参照表一样的分布键。

dbdream=> create table t_like (like t_hash);
NOTICE:  Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table
CREATE TABLE

dbdream=> \d t_like
           Table "public.t_like"
 Column |         Type          | Modifiers 
--------+-----------------------+-----------
 name   | character varying(10) | 
 id     | integer               | 
Distributed by: (name)
Tablespace: "tbs1"

dbdream=> select * from t_like;
 name | id 
------+----
(0 rows)

当然，也可以在建表时指定和参照表不一样的分布键。

dbdream=> create table t_like_1 (like t_hash) DISTRIBUTED BY(id);
CREATE TABLE

dbdream=> \d t_like_1
           Table "public.t_like_1"
 Column |         Type          | Modifiers 
--------+-----------------------+-----------
 name   | character varying(10) | 
 id     | integer               | 
Distributed by: (id)
Tablespace: "tbs1"

本文固定链接: https://www.dbdream.com.cn/2016/01/greenplum%e6%95%b0%e6%8d%ae%e5%ba%93%e5%88%9b%e5%bb%ba%e8%a1%a8%e5%8f%8a%e8%a1%a8%e7%9a%84%e6%95%b0%e6%8d%ae%e5%88%86%e5%b8%83/ | 信春哥，系统稳，闭眼上线不回滚！

该日志由 dbdream 于2016年01月20日发表在 GreenPlum 分类下，你可以发表评论，并在保留原文地址及作者的情况下引用到你的网站或博客。
原创文章转载请注明: GreenPlum数据库创建表及表的数据分布 | 信春哥，系统稳，闭眼上线不回滚！
关键字: greenplum

【上一篇】GreenPlum数据库创建用户、文件空间、表空间、数据库
【下一篇】使用数据泵加载数据时遇到ORA-31693、ORA-29913错误

GreenPlum数据库创建表及表的数据分布

您可能还会对这些文章感兴趣！

GreenPlum数据库创建表及表的数据分布：等您坐沙发呢！

发表评论

最新日志热评日志随机日志

最活跃的读者

标签云集

分类目录

文章归档

博客统计

用户登录