c***z 发帖数: 6348 | 1 Hi all,
Just wondering if any of you had the same problem and if you know the cause.
I have a dataset of site-visitor pairs, which records daily visits to
websites.
While filtering the data using domain "espn.com", there are 306 unique
visitors; while joining the data with the list of domain names, I only got
176 unique visitors to "espn.com".
This is weird since conceptually both filtering and joining use hash tables
the same way.
PS: pig didn't drop bags during the join, at least it didn't tell me about
dropping bags.
Thanks a lot! | l*******m 发帖数: 1096 | 2 join has several kinds: inner, left/right, outer. The default join is the
inner one, which means any null values are excluded. It is usually safe to
use left join, in which domain is the left key there.
cause.
tables
【在 c***z 的大作中提到】 : Hi all, : Just wondering if any of you had the same problem and if you know the cause. : I have a dataset of site-visitor pairs, which records daily visits to : websites. : While filtering the data using domain "espn.com", there are 306 unique : visitors; while joining the data with the list of domain names, I only got : 176 unique visitors to "espn.com". : This is weird since conceptually both filtering and joining use hash tables : the same way. : PS: pig didn't drop bags during the join, at least it didn't tell me about
| p****o 发帖数: 1340 | 3 nice thought...
another possibility is that whether the data (especially the domain field)
is cleaned: like upper case or lower case could make a big difference.
【在 l*******m 的大作中提到】 : join has several kinds: inner, left/right, outer. The default join is the : inner one, which means any null values are excluded. It is usually safe to : use left join, in which domain is the left key there. : : cause. : tables
| d****n 发帖数: 12461 | 4 lz说了filter了espn.com啊,那么说至少这个是没有null value的。
【在 l*******m 的大作中提到】 : join has several kinds: inner, left/right, outer. The default join is the : inner one, which means any null values are excluded. It is usually safe to : use left join, in which domain is the left key there. : : cause. : tables
| c***z 发帖数: 6348 | 5 Thank you all for your inputs! I did FILTER by NULL and then JOIN, and the
problem is solved. |
|