G家mapreduce一道题 - JobHunting版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

JobHunting版 - G家mapreduce一道题

相关主题
● hadoop面试和学习总结	● G家面经，求bless
● [hortonworks面经] senior hadoop engineer	● 一道大数据题，求最优解。
● [网flix]面经	● 请教可以在线练习 map reduce 的地方？
● 还有一周onsite，怎么看Hadoop.The.Definitive.Guide效率最高？	● Big data startup opportunity
● 那道求两大文件交集的G题	● 物理phd好迷茫啊
● Cloudera这个公司怎么样	● Hama是怎么一回事？
● G里面搞big data的是不是出来没市场？	● 简单map reduce mean median，傻逼回答
● 2015年硅谷最火的高科技创业公司都有哪些？	● 找Big Data的工作需要哪些技能？

相关话题的讨论汇总
话题: mapreduce话题: dic话题: filter话题: 5000话题: words

进入JobHunting版参与讨论

(共1页)

d********i
发帖数: 582

题目：MapReduce(filter a collection of documents, the words which occur more
than 5000 times)
小弟从来没学过mapreduce, 不知道从何下手写这个代码？有大牛帮忙吗？

c*****a
发帖数: 808

就是word count的变形，reduce时看interator size > 5000就行了吧

d********i
发帖数: 582

我连word count都写不出来。 Google paper太理论化了。。我写不出java code来。

s******c
发帖数: 1920

参考hadoop的mapreduce
https://developer.yahoo.com/hadoop/tutorial/module4.html

【在 d********i 的大作中提到】

: 我连word count都写不出来。 Google paper太理论化了。。我写不出java code来。

d********i
发帖数: 582

请问有不用hadoop libary的代码？面试也不会直接用到hadoop lib那么深吧。。

【在 s******c 的大作中提到】

: 参考hadoop的mapreduce
: https://developer.yahoo.com/hadoop/tutorial/module4.html

f******n
发帖数: 279

mark

c*****a
发帖数: 808

来个spark的
val file = spark.textFile("hdfs://documents")
val words = file.flatMap(l=> l.split(" ")).map(w => (w, 1)).groupByKey(10000
).filter(p => p._2.size>5000).map(_._1)

s******c
发帖数: 1920

用起来区别不大。
Hadoop mr就是山寨Google mr的

【在 d********i 的大作中提到】

: 请问有不用hadoop libary的代码？面试也不会直接用到hadoop lib那么深吧。。

s******t
发帖数: 229

先生成key-value pair,key=every word, value=1, 再把相同key的value都combine,
sum>5000的key都输出

f******n
发帖数: 279

mark

相关主题
● Cloudera这个公司怎么样	● G家面经，求bless
● G里面搞big data的是不是出来没市场？	● 一道大数据题，求最优解。
● 2015年硅谷最火的高科技创业公司都有哪些？	● 请教可以在线练习 map reduce 的地方？
进入JobHunting版参与讨论

m*********y
发帖数: 111

ding

o*****n
发帖数: 189

以前看过MR, 都不记得了。瞎写一个
#MapReduce(filter a collection of documents, the words which occur more than
5000 times)
n=5000
dic=dict()
with open('.\MapReduce_filter_repeating_words.txt', 'r') as f:
for line in f:
A=line.split()
for a in A:
if a in dic.keys():
dic[a] +=1
else: dic[a]=1
for k in dic.keys():
if dic[k] >= n: print('-', k,'-' ,'show up ' , dic[k],' times')

Z**0
发帖数: 1119

是问你mapreduce的idea。
map, reduce, filter/emit.

d********t
发帖数: 9628

G家给用python吗？

than

【在 o*****n 的大作中提到】

: 以前看过MR, 都不记得了。瞎写一个
: #MapReduce(filter a collection of documents, the words which occur more than
: 5000 times)
: n=5000
: dic=dict()
: with open('.\MapReduce_filter_repeating_words.txt', 'r') as f:
: for line in f:
: A=line.split()
: for a in A:
: if a in dic.keys():

m*****l
发帖数: 95

这题两年前我被面过，直接伪java代码过了。。。hadoop in action第一章就有样板。

s*****B
发帖数: 32

mark

(共1页)

进入JobHunting版参与讨论

相关主题
● 找Big Data的工作需要哪些技能？	● 那道求两大文件交集的G题
● Amazon组选择：EC2还是Elastic MapReduce	● Cloudera这个公司怎么样
● 电面被问到hadoop了	● G里面搞big data的是不是出来没市场？
● [apple面经] iOS software engineer	● 2015年硅谷最火的高科技创业公司都有哪些？
● hadoop面试和学习总结	● G家面经，求bless
● [hortonworks面经] senior hadoop engineer	● 一道大数据题，求最优解。
● [网flix]面经	● 请教可以在线练习 map reduce 的地方？
● 还有一周onsite，怎么看Hadoop.The.Definitive.Guide效率最高？	● Big data startup opportunity

相关话题的讨论汇总
话题: mapreduce话题: dic话题: filter话题: 5000话题: words

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天