Pig的cogroup详解

250次阅读

没有评论

共计 5138 个字符，预计需要花费 13 分钟才能阅读完成。

从实例出发

%default file test.txt

A = load ‘$file’ as (date, web, name, food);

B = load ‘$file’ as (date, web, name, food);

C= cogroup A by $0, B by $1;

describe C;

illustrate C;

dump C;

cogroup 命令中 $0 和 $1，两个列的内容如果不一样，就是分别生成两个批次的 group，先按 A 值分组，在按 B 对应的值分组。按 A 的值分组时，B 对应的为空，则 group 中有一个空组{}；但如果内容一样，如 C = cogroup A by $1, B by $1; 就是生成一个批次的 group，其中包含 A 和 B 两个表中所有的等于该值的元组。

COGROUP 与 join 的区别：自己懒得写，摘自网络

Join 的操作结果是平面的（一组元组），而 COGROUP 的结果是有嵌套结构的。
运行以下命令：
r1 = cogroup r_student by classNo,r_teacher by classNo;
dump r1;
结果如下：
(C01,{(C01,N0103,65),(C01,N0102,59),(C01,N0101,82)},{(C01,Zhang)})
(C02,{(C02,N0203,79),(C02,N0202,82),(C02,N0201,81)},{(C02,Sun)})
(C03,{(C03,N0306,72),(C03,N0302,92),(C03,N0301,56)},{(C03,Wang)})
(C04,{},{(C04,Dong)})
由结果可以看出：
1）cogroup 和 join 操作类似。
2）生成的关系有 3 个字段。第一个字段为连接字段；第二个字段是一个包，值为关系 1 中的满足匹配关系的所有元组；第三个字段也是一个包，值为关系 2 中的满足匹配关系的所有元组。
3）类似于 Join 的外连接。比如结果中的第四个记录，第二个字段值为空包，因为关系 1 中没有满足条件的记录。实际上第一条语句和以下语句等同：
r1= cogroup r_student by classNo outer,r_teacher by classNo outer;
如果你希望关系 1 或 2 中没有匹配记录时不在结果中出现，则可以分别在关系中使用 inner 而关键字进行排除。
执行以下语句：
r1 = cogroup r_student by classNo inner,r_teacher byclassNo outer;
dump r1;
结果为：
(C01,{(C01,N0103,65),(C01,N0102,59),(C01,N0101,82)},{(C01,Zhang)})
(C02,{(C02,N0203,79),(C02,N0202,82),(C02,N0201,81)},{(C02,Sun)})
(C03,{(C03,N0306,72),(C03,N0302,92),(C03,N0301,56)},{(C03,Wang)})

flatten 执行命令：
r2 = foreach r1 generate flatten($1),flatten($2);
dump r2;
结果如下：
(C01,N0103,65,C01,Zhang)
(C01,N0102,59,C01,Zhang)
(C01,N0101,82,C01,Zhang)
(C02,N0203,79,C02,Sun)
(C02,N0202,82,C02,Sun)
(C02,N0201,81,C02,Sun)
(C03,N0306,72,C03,Wang)
(C03,N0302,92,C03,Wang)
(C03,N0301,56,C03,Wang)

可以看到，两个同时 flatten，会自动映射生成多列。

针对 cogroup，我测试了一下，核心代码如下：

industry_existed_Data = LOAD ‘$industryPath’ USING PigStorage(‘,’) AS (industryId:chararray,guid:chararray,sex:chararray,log_type:chararray);

sample_data = limit industry_existed_Data 20;
–STORE sample_data INTO ‘/user/wizad/tmp/industry_existed_Data’ USING PigStorage(‘,’);

–merge with history data
cogroupIndustryExistCurrentByGuid = COGROUP industry_existed_Data by guid, industry_current_data by guid;
mydata = sample cogroupIndustryExistCurrentByGuid 0.1;
dump mydata;
describe cogroupIndustryExistCurrentByGuid;
–dump cogroupIndustryExistCurrentByGuid;

–STORE mycogroupdata INTO ‘/user/wizad/tmp/cogroupIndustryExistCurrentByGuid’ USING PigStorage(‘,’);

look_for_cogroup = FOREACH cogroupIndustryExistCurrentByGuid GENERATE $0,$2;
describe look_for_cogroup;

IndustryStorageDataTmp = FOREACH cogroupIndustryExistCurrentByGuid GENERATE FLATTEN($2);
IndustryStorageData = DISTINCT IndustryStorageDataTmp;
describe IndustryStorageData;

显示结果：
三个数据的结构如下
cogroupIndustryExistCurrentByGuid:
{
group: chararray,
industry_existed_Data:{industryId: chararray,guid: chararray,sex: chararray,log_type: chararray},
industry_current_data: {joined_ad_campaign_data::industryId: chararray,joined_Orgin_sex_data::distinct_origin_historical_sex::guid: chararray,joined_Orgin_sex_data::social_sex::sex: chararray,joined_Orgin_sex_data::distinct_origin_historical_sex::log_type: chararray}
}

look_for_cogroup:
{
group: chararray,
industry_current_data: {joined_ad_campaign_data::industryId: chararray,joined_Orgin_sex_data::distinct_origin_historical_sex::guid: chararray,joined_Orgin_sex_data::social_sex::sex: chararray,joined_Orgin_sex_data::distinct_origin_historical_sex::log_type: chararray}
}

IndustryStorageData:
{
industry_current_data::joined_ad_campaign_data::industryId: chararray,
industry_current_data::joined_Orgin_sex_data::distinct_origin_historical_sex::guid: chararray,
industry_current_data::joined_Orgin_sex_data::social_sex::sex: chararray,
industry_current_data::joined_Orgin_sex_data::distinct_origin_historical_sex::log_type: chararray
}

可以看出三个数据的结构很复杂，因为前面做关联所以包含了对象名（或者叫域名），指明属于哪个对象。可以只看最后一列名字和格式。
第三个是 flatten（$2）的结果。

cogroup 有空集问题，就是对应 group 中的每个值（cogroup 用来关联的 key 的取值），两个集合各自按 key 值进行 group 后，某些 key 对应的集合为空。
上面的 pig 代码的实际数据如下，guid 作为关联 key，可以看出很多空集 {}，出现在某些 guid 的取值对应集合后。
所以取数据时要注意，只 flatten 某一列，会造成其他列数据丢失，因为对应着该 flatten 列的空集。

((-1,),{(74,9051235c-a391-4dae-ab22-f93d24a12636,-1,-1,),(75,053e9f48-03bf-4b39-9455-ff412a725a3c,-1,-1,),(74,21ca723c-ec2b-4242-8108-b95436f10e3e,-1,-1,),(74,fec1932a-b0e4-4bf0-b504-8ed8f3c159e7,-1,-1,),(74,d74374ec-8cf4-4c4a-b598-9631f6972cbb,-1,-1,),(74,6780962a-bf75-4c4c-a557-94a7de5a3e36,-1,-1,),(74,14517915-ee3f-4d34-943f-d6f1813afdef,-1,-1,),(74,c5547aca-3b8b-4108-93ba-bf365c106cdd,-1,-1,),(74,e9a986c1-6868-4f7f-baf6-69d8c302583e,-1,-1,),(74,9c1341cf-45b8-48c6-b699-33b1a4215c66,-1,-1,),(74,f16e6222-a84b-4758-ae71-0613c8f34b29,-1,-1,),(74,47cc25ef-05bc-47f4-a32b-3cddaf0ac22b,-1,-1,),(74,d5c1b6b0-38c3-464b-8cb9-70ced875be5f,-1,-1,),(74,6a4f782a-1f5c-45c0-bb3a-4df25c436be3,-1,-1,),(74,23bb2f0c-d629-479d-800e-b86fc3d6e45c,-1,-1,)})
((a50a17bde79ac018,),{(74,863010025134441,a50a17bde79ac018,863010025134441,)})
((a51779f736cd3f54,),{(74,862949029595753,a51779f736cd3f54,862949029595753,)})
((c7ae5867-3b77-4987-b082-ed3867b5c384,),{(74,353627055387065,c7ae5867-3b77-4987-b082-ed3867b5c384,353627055387065,)})

Pig 的安装与测试 http://www.linuxidc.com/Linux/2014-07/104039.htm

Pig 安装与配置教程 http://www.linuxidc.com/Linux/2013-04/82785.htm

Pig 安装部署及 MapReduce 模式下测试 http://www.linuxidc.com/Linux/2013-04/82786.htm

Pig 安装及本地模式测试, 体验 http://www.linuxidc.com/Linux/2013-04/82783.htm

Pig 的安装配置与基本使用 http://www.linuxidc.com/Linux/2013-02/79928.htm

Hadoop Pig 进阶语法 http://www.linuxidc.com/Linux/2013-02/79462.htm

正文完

星哥玩云-微信公众号