w**z 发帖数: 8232 | 1 On Monday, IBM announced it will invest about $300 million over the next few
years and assign 3,500 people to help develop an up-and-coming technology
known as Spark.
IBM called Spark "the most significant open source project of the next
decade."
This was very good news for a two-year-old startup called Databricks,
founded by the people that invented Spark, and who, today, officially
launched their commercial version of Spark.
Spark is a free and open source software program managed by the organization
that runs many open source projects, the Apache Foundation.
In the past year or so, it has become a phenom in the world of big data
computing, where companies collect huge amounts of information, store it on
low-cost commodity computers, and use a variety of free software to work
with the data.
Spark has gained attention because it crunches through vast amounts of data
super-fast. It won a contest in 2014 known the Gray Sort Benchmark, which
measures how fast a system can sort 100 TB of data, or 1 trillion records
— Spark took 23 minutes, smashing the previous record of 72 minutes. It is
also popular because Spark can be used with other big-data technologies,
especially the popular method of storing lots of data, Hadoop.
Spark basically replaces an older method of working with data stored in
Hadoop invented by Google, known as MapReduce. Spark is not the only free
and open source project that replaces MapReduce. Apache Storm is another.
View gallery
.IBM Beth Smith
(Beth Smith) General Manager of Analytics Platform at IBM Beth Smith leads
the investment into Spark
But, as IBM's huge commitment on Monday shows, Spark is becoming the rising
star in this world. That's because it crunches through data almost the
instant that data is collected. And it makes it easy for developers to write
apps that take advantage of real-time big data analysis.
Why IBM is jumping in with both feet
Spark was started in 2009 by Matei Zaharia, now CTO of Databricks, as part
of his PhD and developed as part of the UC Berkeley's AMPLab. Databricks'
CEO is Ion Stoica, a UC Berkeley professor and co-director of AMPLab.
AMPLab is a research center for computer algorithms and machine learning.
IBM is one of the four founding members of AMPLab, although a long list of
other big names in computer science contribute to the lab, too. (The other
founders are Amazon Web Services, Google, and SAP).
Interestingly, Spark won its famous speed contest by using Amazon Web
Services, IBM's nemesis cloud competitor. And, on Monday, Databricks
announced that its commercial version of Spark was officially open for
business, also on Amazon's cloud.
Clearly, IBM is not willing to let AWS run away with Spark.
On the contrary: IBM will be offering Spark on its cloud service IBM Bluemix
. It will also be baking Spark into its IBM's Watson Health Cloud and
developing a special machine-learning version of Spark known as IBM SystemML
. IBM will release SystemML as a free-and-open language, too, working with
Databricks to do so.
On top of that, IBM says it will train more than a million data scientists
and data engineers on Spark through extensive partnerships with AMPLab and
other online education outlets. And IBM will open a Spark Technology Center
in San Francisco to support Spark/machine learning projects.
While IBM's full-throttle commitment to Spark is enormous, it's not the only
big company doing this type of thing. In February, Intel announced a
partnership with Databricks to get Spark to work better on machines that run
on Intel processors.
Besides winning a contest, Spark has already been adopted by some big names
including Airbnb, eBay, Groupon, MyFitnessPal, OpenTable, Pinterest,
Independence Blue Cross of Philadelphia, and NASA, and the SETI Institute,
to name a few.
Since the free version of the project first launched in 2009, more than 400
developers have contributed to Spark, and this week, the Spark community is
holding a conference in San Francisco expected to attract 2,000 people,
nearly double the number from last year.
All of this has helped Databricks land an all-star line-up of backers and
board members. It raised $47 million in two rounds led by Andreessen
Horowitz (Ben Horowitz), with participation from New Enterprise Associates.
Its board also includes UC Berkeley professor Scott Shenker, known as the co
-founder and former CEO of Nicira, which sold for $1.26 billion to VMware in
2012. | N*****m 发帖数: 42603 | 2 数砖还要人吗?呵呵
few
organization
【在 w**z 的大作中提到】 : On Monday, IBM announced it will invest about $300 million over the next few : years and assign 3,500 people to help develop an up-and-coming technology : known as Spark. : IBM called Spark "the most significant open source project of the next : decade." : This was very good news for a two-year-old startup called Databricks, : founded by the people that invented Spark, and who, today, officially : launched their commercial version of Spark. : Spark is a free and open source software program managed by the organization : that runs many open source projects, the Apache Foundation.
| d******e 发帖数: 2265 | 3 这就对了。spark扩展一下,希望大大的。
few
organization
【在 w**z 的大作中提到】 : On Monday, IBM announced it will invest about $300 million over the next few : years and assign 3,500 people to help develop an up-and-coming technology : known as Spark. : IBM called Spark "the most significant open source project of the next : decade." : This was very good news for a two-year-old startup called Databricks, : founded by the people that invented Spark, and who, today, officially : launched their commercial version of Spark. : Spark is a free and open source software program managed by the organization : that runs many open source projects, the Apache Foundation.
| p*****2 发帖数: 21240 | 4 上次还把我请过去吃牛扒
few
organization
【在 w**z 的大作中提到】 : On Monday, IBM announced it will invest about $300 million over the next few : years and assign 3,500 people to help develop an up-and-coming technology : known as Spark. : IBM called Spark "the most significant open source project of the next : decade." : This was very good news for a two-year-old startup called Databricks, : founded by the people that invented Spark, and who, today, officially : launched their commercial version of Spark. : Spark is a free and open source software program managed by the organization : that runs many open source projects, the Apache Foundation.
| L****8 发帖数: 3938 | 5 Spark has an advanced DAG execution engine that supports cyclic data flow
and in-memory computing
操 不在内存里算 还在空气里算?
few
organization
【在 w**z 的大作中提到】 : On Monday, IBM announced it will invest about $300 million over the next few : years and assign 3,500 people to help develop an up-and-coming technology : known as Spark. : IBM called Spark "the most significant open source project of the next : decade." : This was very good news for a two-year-old startup called Databricks, : founded by the people that invented Spark, and who, today, officially : launched their commercial version of Spark. : Spark is a free and open source software program managed by the organization : that runs many open source projects, the Apache Foundation.
| c******o 发帖数: 1277 | 6 300M budget
and promise to educate more than 1 M data scientist.
so 300 $ per data scientist | T*******x 发帖数: 8565 | 7 少了点。
【在 c******o 的大作中提到】 : 300M budget : and promise to educate more than 1 M data scientist. : so 300 $ per data scientist
| c*********e 发帖数: 16335 | 8 ibm前段不是说要裁人吗?咋了,不裁了?
few
organization
【在 w**z 的大作中提到】 : On Monday, IBM announced it will invest about $300 million over the next few : years and assign 3,500 people to help develop an up-and-coming technology : known as Spark. : IBM called Spark "the most significant open source project of the next : decade." : This was very good news for a two-year-old startup called Databricks, : founded by the people that invented Spark, and who, today, officially : launched their commercial version of Spark. : Spark is a free and open source software program managed by the organization : that runs many open source projects, the Apache Foundation.
| z*******3 发帖数: 13709 | 9
cloud,big data是ibm的发展方向
把其他地方比如硬件的工程师裁掉,然后把位置腾给这些领域的人
不是刚倒贴钱卖掉了芯片么?
【在 c*********e 的大作中提到】 : ibm前段不是说要裁人吗?咋了,不裁了? : : few : organization
| B*****g 发帖数: 34098 | 10 开个flink比spark强的帖子他就来了,嘿嘿 | | | N*****m 发帖数: 42603 | 11 不会了,数钱去了
flink现在差距大了
【在 B*****g 的大作中提到】 : 开个flink比spark强的帖子他就来了,嘿嘿
| d*******r 发帖数: 3299 | 12 就是 flink 都不值一驳了?
zhaoce 说说看呢
【在 N*****m 的大作中提到】 : 不会了,数钱去了 : flink现在差距大了
| p*u 发帖数: 2454 | 13 genius, Spark has all the data in memory, instead of reading from/writing to
disks.
【在 L****8 的大作中提到】 : Spark has an advanced DAG execution engine that supports cyclic data flow : and in-memory computing : 操 不在内存里算 还在空气里算? : : few : organization
| L****8 发帖数: 3938 | 14 操 linux tmpfs 把数据存到内存的一个目录里 然后算就行了
土办法一样速度快
windows 可以用ranmdisk
to
【在 p*u 的大作中提到】 : genius, Spark has all the data in memory, instead of reading from/writing to : disks.
| z*******3 发帖数: 13709 | 15 怎么可能不值一驳
diversity好,软件产品尤其需要diversity
一家独大对谁来说都是不利的
现阶段flink还没有正式推出,有点像当年我们搞storm时候看spark的感觉
倒是如果你想contribute的话,这个时候是非常好的参与flink的机会
spark人满为患,这个时候再凑过去,顶多就是一个用户,人家也不需要你的贡献
spark有spark自己的问题,比如streaming就不怎样,设计上有缺陷
rdd是好东西,但是把所有的东西都搞成rdd,那又是另外一回事了
就像singlethreadness是容易,但是把所有东西都搞成single thread
那又是另外一回事了,flink的core就是streaming的,如果你对scala还有java敏感的话
应该可以感觉出来,streaming好像是future啊,streaming一捅到底那种感觉非常美妙
完全畅通无阻那种感觉,vert.x和flink都在强调streaming,还有scala那一堆东西
比起flink来说,vert.x的机会更大
vert.x替代akka应该是大势所趋,akka稍微复杂一点的real time就蛋疼了
多个actors串联,搞点work flow可以,但是搞一些gaming就很蛋疼了
还有比起data来说,我现在越来越觉得visualization前途更为广阔
tableau那个内推贴看了没?tableau开出的包裹超过同级flg哦
人的感觉80%来自视觉,剩下听觉就去了10%,其他触觉之类的
除了撸管以外,没啥大用,而无论视觉还是听觉,都追求艺术效果
data处理是数学,数学要想转换成生产力,很难,索南一般除了数学p都不懂
所以就知道搞数学,但是真正赚钱的,绝大多数都跟艺术有点关系
数学应用多数时候仅仅是处理一些必需品,而现在生产力过剩,艺术品开始逐步占领市场
你肯定听说过马桶盖的故事,为什么中国人都跑日本去买马桶盖呢?
因为日本人的马桶盖不像中国人的马桶盖那样仅仅是必需品,而更多的是艺术品
除了必需品的效果以外,还有质量上的考究,甚至美感,好吧我承认马桶盖美感很搞笑
我有把日本制的水果刀,用了快十年了,现在还锋利无比,我到哪都带着它
以后打算跑日本去买把katana
比如apple的产品,很多功能点android什么都有,但是就是apple卖得好
为啥?因为apple的产品除了必需品的功能以外,还有艺术品的价值在里面
不管怎样,至少外观好看吧?妹子一看就喜欢,至于里面怎样,内存多不多
妹子不懂的,然后妹子掏钱一顿买,索南辛辛苦苦忙活了半死数学物理,有用么?
最新新闻据说天朝43%的android用户打算转投apple阵营
vert.x配合swift等可以有效处理视觉要求比较高的各种应用
我同样隐约感觉streaming可能会成为这一块的主流
具体的细节想想都ok,但是缺乏实践
就是不知道netflix他们怎么用rxjava的,这有点类似
算了,先把单机搞定吧,网络的以后再说
不管怎样,vert.x在这一点上前景更为开阔是毫无疑问的
另外vert.x也可以contribute啊,我看他们还有很多东西要做
看vert.x3很有意思,他们用各种脚本用得很到位
比如一涉及web,js就用得多了起来,又比如涉及testing的部分,py就用得比较多
不说了,越扯越多,反正就是这个意思,自己去思考吧
这一块太新,有太多地方可以contribute,多了解多参与,对于自己的提高也很有好处
【在 d*******r 的大作中提到】 : 就是 flink 都不值一驳了? : zhaoce 说说看呢
| z*******3 发帖数: 13709 | 16 spark的streaming的对比看这个slides
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streami
flink还没推出,但是从设计上看,应该不会有类似的问题
我感觉最近streaming的需求越来越强烈
需要一个针对前后端都能够搞streaming的东东
vert.x是一个很不错的选择,但是vert.x对付c*之类的nosql,还显得工具偏少
另外mllib这些lib目前只能host在spark,flink这些上面,vert.x还缺少类似的libs
vert.x毕竟更为general一些,但其实你自己琢磨琢磨也没啥难的
无非那么一回事了,mapreduce那些api,跟rxjava有很大重叠
可以用rxjava实现一遍,主要是算法,mllib部分,clustering,svm etc.
api的话,什么flatmap,streaming之类的rx都有了,vert.x成熟之后大有可为
vert.x, rxjava, flink这些逐步走向成熟,过程值得学习和参考
当然spark之类已经取得巨大成功的更值得学习参考和抄袭
懂得抄才能更成功,最怕的就是傻逼不去抄非要自己搞一套,这种傻屌都是作死
streaming就跟hdfs什么关系不是那么大了,倒是跟kafka这种关系比较大
converngence最近问的那个问题就是streaming的问题
用redis什么可以是可以,但是就比较慢,因为persistence多了一层io | d*******r 发帖数: 3299 | 17 你这帖子发散得好开 :D
的话
【在 z*******3 的大作中提到】 : 怎么可能不值一驳 : diversity好,软件产品尤其需要diversity : 一家独大对谁来说都是不利的 : 现阶段flink还没有正式推出,有点像当年我们搞storm时候看spark的感觉 : 倒是如果你想contribute的话,这个时候是非常好的参与flink的机会 : spark人满为患,这个时候再凑过去,顶多就是一个用户,人家也不需要你的贡献 : spark有spark自己的问题,比如streaming就不怎样,设计上有缺陷 : rdd是好东西,但是把所有的东西都搞成rdd,那又是另外一回事了 : 就像singlethreadness是容易,但是把所有东西都搞成single thread : 那又是另外一回事了,flink的core就是streaming的,如果你对scala还有java敏感的话
| z*******3 发帖数: 13709 | 18
其实不发散,server内存计算从来都是一大块
不管用来做mllib还是用来搞游戏server
persistence未必算,但是mllib这些从本质上说就应该归类到内存运算中去
spark就强调内存计算而非存储嘛,就离一般的vert.x做的那些很近了
这个你看db历史就知道,最早server那些都被认为是db的映射
后来ejb什么出来改变了这个局面,现在nosql也是如此
最早都被认为是hdfs等的映射,现在慢慢脱离这个依赖
【在 d*******r 的大作中提到】 : 你这帖子发散得好开 :D : : 的话
| z*******3 发帖数: 13709 | | p*u 发帖数: 2454 | 20 u r gonna do this on ten thousand distributed servers? and who's gonna fix
it when something goes wrong?
【在 L****8 的大作中提到】 : 操 linux tmpfs 把数据存到内存的一个目录里 然后算就行了 : 土办法一样速度快 : windows 可以用ranmdisk : : to
|
|