由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
DataSciences版 - [Pig Progamming] Pig Latin join problem
相关主题
a question about pig latin (转载)【免费讲座】如何面试大数据开发职位(4/7 8PM CDT)
学习Pig Latin建了个散户自动交易俱乐部,欢迎有志之士加入 (转载)
请教一下SQL的资料Facebook DS onsite 求建议!顺便发个phone interview question攒人品
刚电面一个,fail了求助一道sql问题,谢谢 (转载)
请教一道比较funky的joinAetna position: Digital Media Analyst - SAS, R, Python, Tableau in New York, New York
Career talk --你问我答-Next Tuesday 8PM CDT(May 26) (转载)各位说的编程,到底是啥
下周二讲座, 主数据管理, 如何撰写和管理简历 (转载)F家DS,analytics电面面经,贡献一个sql相关 (转载)
Free Session: Big Data Real User Case in Financing啥叫Domain knowledge
相关话题的讨论汇总
话题: pig话题: join话题: domain话题: problem话题: progamming
进入DataSciences版参与讨论
1 (共1页)
c***z
发帖数: 6348
1
Hi all,
Just wondering if any of you had the same problem and if you know the cause.
I have a dataset of site-visitor pairs, which records daily visits to
websites.
While filtering the data using domain "espn.com", there are 306 unique
visitors; while joining the data with the list of domain names, I only got
176 unique visitors to "espn.com".
This is weird since conceptually both filtering and joining use hash tables
the same way.
PS: pig didn't drop bags during the join, at least it didn't tell me about
dropping bags.
Thanks a lot!
l*******m
发帖数: 1096
2
join has several kinds: inner, left/right, outer. The default join is the
inner one, which means any null values are excluded. It is usually safe to
use left join, in which domain is the left key there.

cause.
tables

【在 c***z 的大作中提到】
: Hi all,
: Just wondering if any of you had the same problem and if you know the cause.
: I have a dataset of site-visitor pairs, which records daily visits to
: websites.
: While filtering the data using domain "espn.com", there are 306 unique
: visitors; while joining the data with the list of domain names, I only got
: 176 unique visitors to "espn.com".
: This is weird since conceptually both filtering and joining use hash tables
: the same way.
: PS: pig didn't drop bags during the join, at least it didn't tell me about

p****o
发帖数: 1340
3
nice thought...
another possibility is that whether the data (especially the domain field)
is cleaned: like upper case or lower case could make a big difference.

【在 l*******m 的大作中提到】
: join has several kinds: inner, left/right, outer. The default join is the
: inner one, which means any null values are excluded. It is usually safe to
: use left join, in which domain is the left key there.
:
: cause.
: tables

d****n
发帖数: 12461
4
lz说了filter了espn.com啊,那么说至少这个是没有null value的。

【在 l*******m 的大作中提到】
: join has several kinds: inner, left/right, outer. The default join is the
: inner one, which means any null values are excluded. It is usually safe to
: use left join, in which domain is the left key there.
:
: cause.
: tables

c***z
发帖数: 6348
5
Thank you all for your inputs! I did FILTER by NULL and then JOIN, and the
problem is solved.
1 (共1页)
进入DataSciences版参与讨论
相关主题
啥叫Domain knowledge请教一道比较funky的join
MS Analysis service高手看过来Career talk --你问我答-Next Tuesday 8PM CDT(May 26) (转载)
Pig UDF written in Python下周二讲座, 主数据管理, 如何撰写和管理简历 (转载)
你们用的都是pig吗?Free Session: Big Data Real User Case in Financing
a question about pig latin (转载)【免费讲座】如何面试大数据开发职位(4/7 8PM CDT)
学习Pig Latin建了个散户自动交易俱乐部,欢迎有志之士加入 (转载)
请教一下SQL的资料Facebook DS onsite 求建议!顺便发个phone interview question攒人品
刚电面一个,fail了求助一道sql问题,谢谢 (转载)
相关话题的讨论汇总
话题: pig话题: join话题: domain话题: problem话题: progamming