big set intersection in pig - DataSciences版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - big set intersection in pig

相关主题
● 你们用的都是pig吗？	● 如何学习Hadoop?
● 征集版标	● 求助：一个用Hive提取feature的问题
● 现在的大数据技术的价值和功用有些被夸大了	● hive table 转换成csv文件丢数据是什么情况？
● 请问如何用JDBC连接R和Hive (转载)	● Re: MapR Technologies continue hiring a lot of positions (转载)
● 三星samsung创新部门招大数据工程师 (转载)	● data scientist的五个方面
● Impala v Hive	● 贴个工作
● 请问大家有没有直接用java全程写mapreduce的程序的？	● data scientist position
● 初入data science的困惑	● 求Google 的 Data Science 有关的位置内推 (转载)

相关话题的讨论汇总
话题: big话题: pig话题: set话题: oracle

进入DataSciences版参与讨论

1

(共1页)

l*******m 发帖数: 1096	1 什么方法最快？
c***z 发帖数: 6348	2 inner join
l*******m 发帖数: 1096	3 跑完了，80M intersects 120M 花了100分钟，慢不慢？开了25 parallel 【在 c***z 的大作中提到】 : inner join
c***z 发帖数: 6348	4 cluster size?
l*******m 发帖数: 1096	5 100 【在 c***z 的大作中提到】 : cluster size?
c***z 发帖数: 6348	6 then it is slow can you post your code here?
D**u 发帖数: 288	7 Have you tried sort merge join? At least it is the fastest with Hive from my experience. first sort the sets then do join A by $1, B by $1 using 'merge';
r*****d 发帖数: 346	8 【在 l*******m 的大作中提到】 : 什么方法最快？
B*A 发帖数: 83	9 这要在SQLdatabase里就是几秒钟的事儿 ★ 发自iPhone App: ChineseWeb 8.1 【在 l*******m 的大作中提到】 : 跑完了，80M intersects 120M 花了100分钟，慢不慢？开了25 parallel
l*******m 发帖数: 1096	10 it could be true if the server has large enough ram. in the case, i would use hashset directly, which is faster 【在 B*A 的大作中提到】 : 这要在SQLdatabase里就是几秒钟的事儿 : : ★ 发自iPhone App: ChineseWeb 8.1
B*A 发帖数: 83	11 刚才用一个730 MILLION record 的 TABLE (60GB) intersect itself on my Oracle database It took 54 seconds. In most time Big Data does not mean solution for better performance, it means solution for less expensive software investments. 【在 l*******m 的大作中提到】 : it could be true if the server has large enough ram. in the case, i would : use hashset directly, which is faster
l*******m 发帖数: 1096	12 my case is the original data set having no duplicates. I am curious of Qracle performance... Oracle 【在 B*A 的大作中提到】 : 刚才用一个730 MILLION record 的 TABLE (60GB) intersect itself on my Oracle : database : It took 54 seconds. : In most time Big Data does not mean solution for better performance, it : means solution for less expensive software investments.
B*A 发帖数: 83	13 No duplicates here either. 【在 l*******m 的大作中提到】 : my case is the original data set having no duplicates. I am curious of : Qracle performance... : : Oracle

1

(共1页)

进入DataSciences版参与讨论

相关主题
● 求Google 的 Data Science 有关的位置内推 (转载)	● 三星samsung创新部门招大数据工程师 (转载)
● data scientist对sql要求高吗	● Impala v Hive
● Pig word count	● 请问大家有没有直接用java全程写mapreduce的程序的？
● 做big data一定要是Ph.d吗？	● 初入data science的困惑
● 你们用的都是pig吗？	● 如何学习Hadoop?
● 征集版标	● 求助：一个用Hive提取feature的问题
● 现在的大数据技术的价值和功用有些被夸大了	● hive table 转换成csv文件丢数据是什么情况？
● 请问如何用JDBC连接R和Hive (转载)	● Re: MapR Technologies continue hiring a lot of positions (转载)

相关话题的讨论汇总
话题: big话题: pig话题: set话题: oracle

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)