关于mahout的讨论汇总 - 话题女王

a**********0
发帖数: 422

我在看一遍超烂的教程 mahout in action
他举了一个例子在hadoop环境下实现一个recommender system
具体就是需要自己实现若干mapper和reducer 唯一用到mahout的就是要用
RecommenderJob这个class 作为顶层的driver to glue mapper和reducers
我和你的疑问类似我既然用mahout 干什么还自己实现mapper和reducer啊！
我的理解是常见的用法是我们自己写非map reduce风格的代码也就是driver 这个
driver定义输入输出调用mahout的算法的api feed in输入输出
mahout就是个java library 我们在自己的代码（driver）中调用mahout的java class
比如recommender或者classifier 最后打成个包jar 用hadoop命令去run 不知道是不是啊

a**********0
发帖数: 422

来自主题: JobHunting版 - 关于mahout的一些问题

我看了一些mahout的例子自己尝试run了一下发现我仍然需要自己实现mapper和
reducer mathout的角色只是提供了一个driver to glue all the map and reduce
不知道我的理解对不对呢

j*******t
发帖数: 223

来自主题: JobHunting版 - 关于mahout的一些问题

mahout一直在更新，最近clustering的api都refactory了。所以不要看书了...对于这
种开源项目，书几乎就是落后的...可以加入user mailing list问问题。

n*****3
发帖数: 1584

来自主题: Programming版 - mahout现在还有人用不？

我觉得要是小数据，就 Python／R，
big，就直接spark， spark有 MLLIB
mahout 又不快，又不scale up

code

S******y
发帖数: 1123

来自主题: Statistics版 - Have anybody used Mahout ?

Have anybody used Mahout for data mining?
Does it require expertise in Java Programming?
Just curious...
Thanks!

l*********a
发帖数: 42

来自主题: Statistics版 - Who has "Mahout in Action" book?

Mahout in Action
Sean Owen (Author), Robin Anil (Author), Ted Dunning (Author) et al
Thanks!

l*********a
发帖数: 42

来自主题: CS版 - what does "@Override" mean in the Java code below? Thanks!

import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.DataModelBuilder;
import org.apache.mahout.cf.taste.eval.RecommenderBuilder;
import org.apache.mahout.cf.taste.eval.RecommenderEvaluator;
import org.apache.mahout.cf.taste.impl.common.FastByIDMap;
import org.apache.mahout.cf.taste.impl.eval.
AverageAbsoluteDifferenceRecommenderEvaluator;
import org.apache.mahout.cf.taste.impl.model.GenericBooleanPrefDataModel;
import org.apache.mahout.cf.taste.impl.mo... 阅读全帖

S******y
发帖数: 1123

来自主题: Statistics版 - 分享：从SAS 到 Python 与 R

谢谢大家鼓励。
----------------------------------------
Python 是analyst 和 data scientist 的好助手，早学早受益！
----------------------------------------
找了一下，Mahout和Python 现在也有整合 -
https://cwiki.apache.org/confluence/display/MAHOUT/Using+Mahout+with+Python+
via+JPype
不过我只用过command line 来 call Mahout。这个倒是以后可以试试。
----------------------------------------
关于 Spark, 湾区有个公司最近已经做出 R package(s) 来 call Spark engine. 看过
他们的demo. 挺不错的。

S******y
发帖数: 1123

来自主题: Statistics版 - 分享：从SAS 到 Python 与 R

f********x
发帖数: 99

来自主题: DataSciences版 - 求教! how to run python programs on a hadoop cluster

最好利用现有开源项目跑，不要自己从头去实现。比如，
1. Mahout http://mahout.apache.org/
e.g. https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/
mahout/knn/
2. GraphLab (www.graphlab.org)
e.g. http://docs.graphlab.org/clustering.html
3. Other projects (such, Facebook Giraph, Intel Graphbuilder and so on)

find
python
force

s****h
发帖数: 3979

来自主题: DataSciences版 - questions about SVD and ALSWR for collaborative filtering

two questions:
1.
For recommendation engine based on collaborative filtering, the result of
ALSWR in Mahout would be very similar to result of SVD in MLlib of spark,
right?
As the SVD with spark + MLlib performance is very good, can we forget about
ALSWR in Mahout?
2.
How to evaluate SVD?
My understanding: for a known user/item matrix M, we remove some of the
known user/item pair and get new matrix M1, then do the SVD for M1 and get
the reconstructed matrix M2. Comparing removed user/item pairs ... 阅读全帖

b***i
发帖数: 10018

来自主题: JobHunting版 - 三藩pre-ipo公司big data职位招聘

有两种职位，Data Engineer和Data Research Engineer。给办H1-b和绿卡。Junior也
欢迎。
有意者请把简历发到t*******[email protected]。谢谢。
Senior Data Engineer at Tapjoy in San Francisco, CA
Sr. Data Engineer
About Us:
Tapjoy is a mobile value exchange platform, driving personalized app
discovery for consumers, customer acquisition and engagement for app and
brand advertisers, and rich monetization for innovative developers. The
Tapjoy network spans over 20,000 apps and 800 million global consumers on
iOS, Android and Windows Phone. ... 阅读全帖

b***i
发帖数: 10018

来自主题: JobHunting版 - 三藩pre-ipo公司big data职位招聘

刚刚加了Data Research Scientist的Job description:
Research Scientist - Machine Learning at Tapjoy in San Francisco, CA
About Tapjoy
A SF-based Private company, Tapjoy (www.tapjoy.com) is the leader in
discovery, engagement, and monetization services for mobile applications.
The company's turnkey in-app advertising platform helps developers, agencies
and brands acquire cost-effective, high-value new users, drive engagement
within their applications, and create incremental income by providing an ad-
fu... 阅读全帖

a*****e
发帖数: 911

来自主题: JobHunting版 - 狗狗系列２

Q&A:
Q: Suggestion for projects
A: Project: depends on your passion. for example, if u r interested in big
data and machine learning, which is in trend, u can check apache open source
like hadoop, hbase, or mahout. Mahout has the smallest code base and a list
of
usages. I think you can start from there, do some research on how industry
use AI and what problems are they solving and then follow with an
implementation with a similar problem but of much smaller data set. You can
load your program to... 阅读全帖

l*n
发帖数: 529

来自主题: JobHunting版 - 求内推（C++, 数值方法，machine learning）

做ML的话，基本没人自己写各种更底层的numerical东东的。
http://acs.lbl.gov/software/colt/
http://sourceforge.net/projects/parallelcolt/
https://github.com/apache/mahout/tree/trunk/math
比如mahout，很多数学的东西就是用的colt
至于R和matlab等大家都知道可以直接用n年前就成熟的lapack、linpack等等。

用？

A*********t
发帖数: 64

来自主题: JobHunting版 - Hama是怎么一回事？

Hama是开源的Pregel，在HDFS上面做graph partition，然后通过message passing再做
local computation，周而往复，知道算出答案为止。抛弃了MapReduce。居然说在某些
方面比较MapReduce有优势。
那么，
比MapReduce有什么优势呢？他们吹嘘比Mahout算k-mean快很多。真的有这么一回事？
为什么那个project总是怪怪的。那个jira基本上是1个人在commit（！）是不是里面有
什么问题？基本上是：
1.我发现问题。
2.我给了patch。
3.我commit。
怎么没有些interactions？
为什么Mahout又那么火呢？不停有人刷mailing list，不停有人commit。
知道内幕的人说说？

d********w
发帖数: 363

来自主题: JobHunting版 - 2015年硅谷最火的高科技创业公司都有哪些？

硅谷最火的高科技创业公司都有哪些？
在硅谷大家非常热情的谈创业谈机会，我也通过自己的一些观察和积累，看到了不少最
近几年涌现的热门创业公司。我给大家一个列表，这个是华尔街网站的全世界创业公司
融资规模评选（http://graphics.wsj.com/billion-dollar-club/）。它本来的标题是billion startup club，我在去年国内讲座也分享过，不到一年的时间，截至到2015年1月17日，现在的排名和规模已经发生了很大的变化。首先，估值在10Billlon的达到了7家，而一年前一家都没有。其次，第一名是中国人家喻户晓的小米，第三，前20名中，绝大多数（8成在美国，在加州，在硅谷，在旧金山！）比如Uber, Airbnb, Dropbox, Pinterest. 第四里面也有不少相似模式成功的，比如Flipkart就是印度市场的淘宝，Uber与Airbnb都是共享经济的范畴。所以大家还是可以在移动(Uber)，大数据（Palantir），消费级互联网，通讯(Snapchat)，支付(Square)，O2O App里面寻找下大机会。这里面很多公司我都亲自面... 阅读全帖

e********2
发帖数: 495

来自主题: JobHunting版 - full stack track 和 backend track 哪个更有前途？

mahout algorithm：
https://mahout.apache.org/users/basics/algorithms.html

requires

w***g
发帖数: 5958

来自主题: Programming版 - 举两个用java搞算法的例子，供批判使用

mahout啊mahout。举例子就要举C++干不了的。在scale上用机器堆死你。

2013

d****i
发帖数: 4809

来自主题: Programming版 - Hadoop 和Python的数据分析包哪个更值得学习？

哈肚婆没有什么数学问题吧，你说的是Mahout吧，就算是Mahout，那一点点牵涉到的数
学也是非常的简单的superficial的。

B*****g
发帖数: 34098

来自主题: Programming版 - Hadoop 和Python的数据分析包哪个更值得学习？

亲，不会吧，mahout早就转spark了
http://mahout.apache.org/

S******y
发帖数: 1123

来自主题: Statistics版 - big data analysis in Revolution R

Interesting topic :-)
Many people think that there would be such a thing coming that user could
simply plug in R or SAS and make all existing functions/packages/procedures
to run on Hadoop-scaled data and "solve" the ultimate data size problem.
Unfortunately, there is no such thing. To achieve that, somebody has to
virtually rewrite every R package or every SAS/STAT procedure since most of
their underlying code/algorithms are simply not map-reduce compatible.
That is industry-scaled development... 阅读全帖

n*****3
发帖数: 1584

来自主题: Statistics版 - big data analysis in Revolution R

nice, thanks for sharing with us.
May I ask what if you want some other algorithms which
are NOt part of mahout? write the algorithm from scratch? will that be easy
in the mahout environment?

procedures
of
-
free

S******y
发帖数: 1123

来自主题: Statistics版 - 【旧文重发】 Python and R study guide

不少同学来信，询问读什么教材
Here is a guide I wrote earlier this year
FYI -
==============================================
Python and R study guide(a good list of resources I have compiled)
==============================================
Python
可以先决定走V2.7 还是 V3, then you stick to it （这样可以省去后面的麻烦：-）
http://www.python.org/
下载及安装
http://www.amazon.com/Beginning-Python-Professional-Experts-Pro
非常好的初中级教材，作者是欧洲一名CS 教授（涉及CS 概念）
http://developers.google.com/edu/python/
非常好的Tutorial. The part on Regular Expression... 阅读全帖

dy
发帖数: 12

来自主题: History版 - Re: 请问非洲象能否被驯服？

古代的非洲象有两种："forest elephants"(Loxodonta africana
cyclotis)，来自红海一带和阿特拉斯山地区的森林中，体格
比印度象小，可以被驯服，现在这种象已经几乎绝灭了。另
一种，我们现在看见的非洲象，所谓"bush elephants"(Loxod
onta africana)，在古代是没有发现的。托勒密王国和迦太基
由于离印度象产地较远，在军队中主要使用非洲象。在使用
方面，印度象可以背一个象夫(mahout）和一个象轿(howdah)，
在轿里面可以坐一些箭手和标枪手之类，而非洲象由于体格
较小，不能背象轿（这一点尚有争议）。
资料来源：
"elephants"词条，"Oxford Classics Dictionary", Oxford
University Press 1996
"Roman Warfare", by Adrian Goldsworthy, Cassell 2000

z*******3
发帖数: 13709

来自主题: Military版 - 中国怎么没有富豪支持开源软件啊

这是apache的top projects
如果你能看懂，就知道开源在干嘛了
不过如果你是做os这种的，估计你看不懂这些是做啥的
这些社区都很活跃，版本号都在更新
Abdera Accumulo ActiveMQ Ant Aries Apache HTTP Server APR Avro Axis
Bloodhound Buildr Camel Cassandra Cayenne Chemistry Click CloudStack Cocoon
Continuum Cordova CouchDB cTAKES CXF Deltacloud Derby Directory Empire-db
Felix Flex Forrest Geronimo Gora Gump Hadoop Hama Hive HBase Isis Jackrabbit
James JMeter Kafka Lenya Mahout Marmotta Maven MINA mod_perl MyFaces ODE
OFBiz OpenEJB OpenJPA OpenNLP OpenOffice POI Pivot... 阅读全帖

l*******e
发帖数: 55

来自主题: Classified版 - Looking for Data Scientist -- NYC only

We are a consulting company looking for part-time Data Scientists based in
NYC. Working remotely is fine but we do need to meet in person once a while
for project discussion.
We are looking for part-time or intern. Ideal candidates should have strong
knowledge in Machine Learning, Data Mining and hands on experience with
popular machine learning tools such as R, Matlab, Weka, Mallet, Hadoop,
Mahout and strong programming skills (c/c++, java, python, php).
If interested, please send an email abo... 阅读全帖

r*******i
发帖数: 14

来自主题: Classified版 - Hiring Sr Data Scientist

有意者请站内联系
-----------------------------
Expedia has a very exciting and challenging opening for a Sr. Data Scientist
. We work on the hotel sort algorithm for one the world’s largest travel
site and this is an opportunity to be an anchor member of the core team.
We are looking for someone with advanced training in machine learning and
computer science. You should be the type of person that attends (or wants
to attend) SIGKKD every year and may have competed on a top team in the KDD
cup. You sh... 阅读全帖

r*******i
发帖数: 14

来自主题: Classified版 - Hiring Sr Data Scientist

有意者请站内联系
--------------------------
Expedia has a very exciting and challenging opening for a Sr. Data Scientist
. We work on the hotel sort algorithm for one the world’s largest travel
site and this is an opportunity to be an anchor member of the core team.
We are looking for someone with advanced training in machine learning and
computer science. You should be the type of person that attends (or wants
to attend) SIGKKD every year and may have competed on a top team in the KDD
cup. You shoul... 阅读全帖

r*******i
发帖数: 14

来自主题: Classified版 - Expedia hiring Sr Data Scientist

r*******3
发帖数: 35

来自主题: JobHunting版 - Job openning-JAVA Engineer, Software Engineer,etc

Ask.com, Oakland, CA
Java Engineer
Responsibilities
Hands-on end-to-end development to create new technical solutions and evolve
our current question and answer site.
Provide architectural, design and engineering leadership to influence Java
and client-side solutions
Work closely with other engineering teams to define and develop solutions to
real-world problems
Perform research and development to evaluate new technologies, ideas and
communicate value for company
Required Experience
Java - 5 ye... 阅读全帖

b*****o
发帖数: 715

来自主题: JobHunting版 - RockMelt这公司前景怎么样啊?

最近有猎头联系我去，内容是和scalable systems, distributed computing, machine
learning (Hadoop Mahout)，recommendation systems有关的。
我从来没有用过rockmelt之类的浏览器，这种靠social network吃饭的start up有没有
前途啊？

n****t
发帖数: 241

来自主题: JobHunting版 - Adobe招人，有兴趣的朋友可以发简历给我

邮箱：qiruian@gmail
职位基本信息：
Position: Business Unit: Location:
Req ID:
Computer Scientist Digital Media
San Jose, CA 16588
要求：
Position Summary
Adobe is looking for a self-motivated development engineer to join the
globalization team. To expand international markets, the globalization team
explores emerging technologies and delivers cross-language solutions for
Adobe products. In particular, we offer internationalization and linguistic
web services for our cloud product offerings. The successful ca... 阅读全帖

n****t
发帖数: 241

来自主题: JobHunting版 - Adobe内部推荐的机会

再发一次，希望这次不是浪费时间。
现在最好的candidate是一个老印（machine learning的背景），已经被我顶在门外了
，现在老板愿意再多收一周简历，有兴趣，并且觉得自己背景合适的朋友给我发信。因
为是组里招人，不保证给我发信的人就一定推荐，我只能挑两个背景match的在组里推
荐。
个人邮箱：q*****[email protected]
职位是在san jose(bay area).
职位描述是：
Position Summary
Adobe is looking for a self-motivated development engineer to join the
globalization team. To expand international markets, the globalization team
explores emerging technologies and delivers cross-language solutions for
Adobe products. In particular, we offer internationalizati... 阅读全帖

s****r
发帖数: 24

来自主题: JobHunting版 - 芝加哥附近数据科学家工作机会

还在继续面试中，好多同学需要明年10月的h1b 才能工作，感觉挺可惜的。。在国
外是身份问题，在国内是户口，有人在江湖，身不由己的感觉。
JD上面说要PHD，其实很多都是可以商量，不过最好编程强点，其实只开发原型或者
做研究，不用非常规范的，有独立项目经验。但是因为这个办公室是新开的，所以作
为一个DS，可能没有配备工程师支持，很多清理数据之类的工作需要用java 等在
hadoop上面自己编，然后可以自己写实现算法，大数据的话可能得改成map/reduce的格
式或者用mahout，如果最后aggreated 数据不大的话，那直接用R就可以了。工作环
境还是比较轻松的，工作会接触很多公司目前的产品，然后看看哪里有数据分析挖掘
的机会，能创造利润等。。目前我们这个卫星办公室还是比较受技术部门支持的，缺
点是卫星办公室，离总部比较远。

C***o
发帖数: 68

来自主题: JobHunting版 - Career Path to G, F, A

I am not CS background but I am curious to know if my skill sets will fit
any type of roles in Google or other IT companies?
My background:
- Math/Stat background and strong analytic skill
- Data modeling/mining on Web based data for customer analytics
- Intermediate programming skill mostly VBA and SAS so far
- Advanced SQL
- Experience with Hadoop/Hive, MapReduce, Pig and Mahout
If you can think of any type of roles that need above skill set + some other
skills, please let me know what else sh... 阅读全帖

w********p
发帖数: 948

来自主题: JobHunting版 - Data Scientist/machine learning（Min.$140K）

正好看到这个，转过来，如果谁拿了这个offer, 别忘了过来发包子。
Craig Quintal, PHR has sent you a message.
Date: 1/29/2013
Subject: Referral Request - Data Scientist - machine learning
Hi!
I wanted to check in and see if you may know someone within your own
personal network who is looking to for a new opportunity. I am looking for a
data or research scientist with a good foundation in data mining and who
wants to be part of a very profitable and growing start-up opportunity in
the financial district of SF (150+employees). M... 阅读全帖

j*******t
发帖数: 223

来自主题: JobHunting版 - 怎样学Big Data / Hadoop （转行，从Marketing, Business Management）

PAC learnable, VC dimension 这些太理论了，而且PAC learnable的假设太强，和实
际出入太大。SVM倒是挺有用。不过如果不需要太深入的话，可以直接用mahout的。
要学Hadoop就至少要有一定的java基础，知道mapreduce的基本概念。当然，没有足够
的数据练手的话就只是纸上谈兵了。

r******g
发帖数: 13

来自主题: JobHunting版 - 求教machinelearning方面recommendation system的tech blog

RAIL,多谢哈，以为帖子每人回。以前听一个牛人说amzon过时乐，2006出了新的技术
，应该就是netflix prize引出的。顺便发一下自己查到的，greg linden的 Amazon
recommmendaton，比较老的report了，还有mahout, 好像比weka更实用，注重
scalability。再问下，有没有ranking problem的牛paper？google的ranking 技术保密

b*****t
发帖数: 296

来自主题: JobHunting版 - 说一下我最近面过的题吧

1.
naive method，
如果两个矩阵大小相差很大，把小的放在distributedCache里面，然后，
map stage:
input
output
reduce stage: (can only use 1 reducer)
output:
如何两个矩阵差不多大，mahout上有些矩阵相乘的源代码，或是参考hama，两个都是
apache的开源project。俺没看过，应该不难。

N******t
发帖数: 43

来自主题: JobHunting版 - 分享一些经验及心得

谢谢楼主，这个真的很Make Sense.
能请教一下楼主一般如何选择Open Source Project，是否跟自己工作相关的，还是说
现在很卖吃的，比如Apache的那些，Hadoop, Mahout等等。

N******t
发帖数: 43

来自主题: JobHunting版 - 分享一些经验及心得

c***z
发帖数: 6348

来自主题: JobHunting版 - 请教一下搞big data要学些什么？

unix + java + hadoop + hive + mahout + pig

c***z
发帖数: 6348

来自主题: JobHunting版 - 请教一下搞big data要学些什么？

unix + java + hadoop + hive + mahout + pig

w*********m
发帖数: 4740

来自主题: JobHunting版 - data analytics相关的opensource project

R mahout etc

w*********m
发帖数: 4740

来自主题: JobHunting版 - 敢问三爷现在学什么呐？

以前ML数据量小，或者数据量大，但可以sample了在用
现在维度太大，常常几十万维度，所以希望用大数据量来做training
mahout就是把ML实现到hadoop上的
但是由于hadoop设计上的缺点，machine之间缺乏communication，并不能很好地支持ML
于是又出现一些新的东西来解决这个问题，例如spark和graphlab
data mining这个词的定义很含糊。有人认为数数就是data mining。有人认为ML和优化
才是data mining。
数数，算variance/mean，找median，甚至matrix computation都可以用hadoop实现。
但ML算法很多是iterative多次，直到converge，还得往distributed cache里load一个
巨大的中间model，而且机器间不好交流，global information难以拿到（优化就是要
找关于所有数据的最优），结果只能trade off用stochastic的办法，communication的
cost和问题巨大。

c***z
发帖数: 6348

来自主题: JobHunting版 - machine learning有多大前途

Pattern不知道额
一般用mahout吧

j*******t
发帖数: 223

来自主题: JobHunting版 - 关于mahout的一些问题

你用的哪部分啊？很多算法都是实现了的啊。

j*******t
发帖数: 223

来自主题: JobHunting版 - Hama是怎么一回事？

Hama是基于BSP计算框架的（Pregel和对应的开源版本Giraph也是基于BSP的）。BSP框
架在80年代由Leslie Valiant等人提出（2010年图灵奖得主）。与MapReduce相比，BSP
更适用于迭代式计算。
一个典型的基于BSP的程序分为多个iteration，其中每个iteration包含Local
computation，Communication，以及Synchronization这几个阶段（关于细节可以参看
相关网站）。
相较于专门针对Graph计算的Google的Pregel和另一个开源版本Giraph，Hama是一种更
加宽泛的计算框架，它有Grpah API，同时也可以大家写更加宽泛的迭代算法，比如
KMeans，EM，PageRank等。此外，为了进一步提高计算效率，Hama目前正在考虑加入
GPU协作运算。
另一个很接近的框架是Spark，如果数据（RDD）被载入内存（cache），那么Spark在进
行迭代运算时效率也很高。
Hama目前社区还很小，所以显得比较冷清。Mahout社区要大很多，而且目前在考虑加入
基于Spark的算法，所以比... 阅读全帖

e*****s
发帖数: 121

来自主题: JobHunting版 - Hadoop Spark 学习小结[2014版]

还有mahout, 太难用了。

★ 发自iPhone App: ChineseWeb 8.7

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天