简单map reduce mean median，傻逼回答 - JobHunting版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

JobHunting版 - 简单map reduce mean median，傻逼回答

相关主题
● median of N^2 numbers across N machines	● #面试题#有100个database，每个存1 million data，如何求出median number of 这些数。
● 请教可以在线练习 map reduce 的地方？	● 电话面试一个design问题，看看怎么做
● 一道大数据题，求最优解。	● 请教MapReduce怎么找median
● 一道大数据题	● F家onsite面经
● 一个多线程的简单问题	● mapreduce 初级问题，请各位大牛指点
● 再问一道题	● map reduce word count
● 一道巨常见的题	● MapReduce的面试题
● 问道大数据的题	● Apple 数据科学家面经

相关话题的讨论汇总
话题: reducer话题: median话题: avg0话题: avg1话题: average

进入JobHunting版参与讨论

1

(共1页)

b**********5 发帖数: 7881	1 先问: map reduce mean 我说， mapper emit （number， 1），可以弄几个combiner， emit （partial sum ， partial N），然后最后一个reducer， add up sum divide by N 问：会有什么问题答： sum overflow，可以用 long，或者big integer？此处省略一千字原来是这样的：something called rolling average let's say avg0 is average for a0... aN0, avg1 is average for aN0 .... aN1.... so the total average is avg0(N0/N) + avg1(N1/N) + avg2*(N2/N).... so the combiner can emit (avg0, N0), (avg1, N1) ... pair and the reducer would calculate the total average ======================================================== mean: 先给了个one reducer的回答： numbers come to reducer already sorted, so just the middle element 然后给了bucket solution，可以用multiple reducer， partition based on the bucket，比如说 0-100 多少个， 100-200多少个， 200-1000 多少个，然后再求 median 问: 这样是不是exact？说：不是问：怎么样improve accuracy。。。答不出。给了一个明显hint后，答sampling！
g****v 发帖数: 971	2 第二题为什么bucket的方法不准确？第一遍mapreduce可以确定最多2个buckets，然后第二遍最多sort下不行么？能不能展开讲讲sampling? sum ... 【在 b*********5 的大作中提到】 : 先问: map reduce mean : 我说， mapper emit （number， 1），可以弄几个combiner， emit （partial sum : ， partial N），然后最后一个reducer， add up sum divide by N : 问：会有什么问题 : 答： sum overflow，可以用 long，或者big integer？ : 此处省略一千字 : 原来是这样的：something called rolling average : let's say avg0 is average for a0... aN0, avg1 is average for aN0 .... aN1.... : so the total average is avg0(N0/N) + avg1(N1/N) + avg2(N2/N).... : so the combiner can emit (avg0, N0), (avg1, N1) ... pair
b**********5 发帖数: 7881	3 数字很多，你去哪里sort？如果第二遍能把两个bucket都sort，和我只搞一个 reducer，数字全部去那里，没什么区别。数字太多，装不下， sort不了【在 g****v 的大作中提到】 : 第二题为什么bucket的方法不准确？ : 第一遍mapreduce可以确定最多2个buckets，然后第二遍最多sort下不行么？ : 能不能展开讲讲sampling? : : sum : ...
g****v 发帖数: 971	4 比如第一遍是100 billion的数字，我找到了2个bucket，这2个bucket只有1million个数字，第二遍我用mapreduce sort这1milion个数字，然后找到median不行么？你说的sampling是什么？【在 b**********5 的大作中提到】 : 数字很多，你去哪里sort？如果第二遍能把两个bucket都sort，和我只搞一个 : reducer，数字全部去那里，没什么区别。数字太多，装不下， sort不了
b**********5 发帖数: 7881	5 你找到了1个million里的median，有什么用？搞不懂。。。【在 g****v 的大作中提到】 : 比如第一遍是100 billion的数字，我找到了2个bucket，这2个bucket只有1million个 : 数字，第二遍我用mapreduce sort这1milion个数字，然后找到median不行么？ : 你说的sampling是什么？
g****v 发帖数: 971	6 那2个bucket已经有需要第几个是median的信息了，然后找到就行了。比如第一轮mapreduce告诉你说第2000个element in [bucket1,bucket2]是整个的 median，你就把[bucket1,bucket2]在第二遍mapreduce中排序，找到第2000个element which is the final median for the whole number array. 【在 b**********5 的大作中提到】 : 你找到了1个million里的median，有什么用？搞不懂。。。
b**********5 发帖数: 7881	7 第一轮mapreduce告诉你说第2000个element in [bucket1,bucket2]是整个的median 我觉得either you or me are totally confused on this。。。 how do u know that the 2000th element in those two buckets are the entire median? element 【在 g****v 的大作中提到】 : 那2个bucket已经有需要第几个是median的信息了，然后找到就行了。 : 比如第一轮mapreduce告诉你说第2000个element in [bucket1,bucket2]是整个的 : median，你就把[bucket1,bucket2]在第二遍mapreduce中排序，找到第2000个element : which is the final median for the whole number array.
b**********5 发帖数: 7881	8 我觉得你是不是在用multiple rounds of partition。。。我在面试过，提过这方法，被否认。这种multiple rounds，你估计要用很多很多rounds，我一提，就被否决了 element 【在 g****v 的大作中提到】 : 那2个bucket已经有需要第几个是median的信息了，然后找到就行了。 : 比如第一轮mapreduce告诉你说第2000个element in [bucket1,bucket2]是整个的 : median，你就把[bucket1,bucket2]在第二遍mapreduce中排序，找到第2000个element : which is the final median for the whole number array.

1

(共1页)

进入JobHunting版参与讨论

相关主题
● Apple 数据科学家面经	● 一个多线程的简单问题
● 问个大数据处理的面试题	● 再问一道题
● 要去google onsite的同学们	● 一道巨常见的题
● 问一道题目	● 问道大数据的题
● median of N^2 numbers across N machines	● #面试题#有100个database，每个存1 million data，如何求出median number of 这些数。
● 请教可以在线练习 map reduce 的地方？	● 电话面试一个design问题，看看怎么做
● 一道大数据题，求最优解。	● 请教MapReduce怎么找median
● 一道大数据题	● F家onsite面经

相关话题的讨论汇总
话题: reducer话题: median话题: avg0话题: avg1话题: average

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)