学习Pig Latin - Statistics版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

相关主题
● 自学R, 谁能介绍个各种函数功能附带详细解释的网站？	● 新手求教：linux下怎么跑R文件？
● 在集成的cloudera hadoop中计算词频（wordcount)	● 做Patient Claims Data Report 对以后找业界工作有多大帮助？
● got data scientist offer, 以后要在本版多学习了	● big data－大纽约地区聚会 (转载)
● 【旧文重发】 Python and R study guide	● Cloudera Hadoop Data Analyst 培训视频 (转载)
● Job opportunity: Statistician/Modeler (转载)	● The next data scientists
● Job opportunity: Statistician/Modeler	● 在matlab中如何对均匀分布变量进行Latin hypercube sampling呢？
● Biostatistics前景如何？	● JD上看到的，啥叫 ad hoc statistical analyses？
● julia有前途吗？	● SAS or R处理大量数据

相关话题的讨论汇总
话题: pig话题: edge话题: node话题: scalding话题: use

进入Statistics版参与讨论

(共1页)

r*****d
发帖数: 346

请问大家：有什么好方法好资料帮助学习Pig Latin? Pig Latin我完全是新手。我在
amazon上找了下，好像没有书评特别好的。
谢谢！

D**u
发帖数: 288

I am currently learning Pig too, looks like more flexible than Hive.
This is the best book so far, Programming Pig.
Try this online version, pretty helpful.
http://chimera.labs.oreilly.com/books/1234000001811/index.html
The author also provided a cheat sheet, very nice too.
http://mortar-public-site-content.s3-website-us-east-1.amazonaw

D**u
发帖数: 288

The whole chapter related to Pig from "Hadoop in Action" is open too, very
nice
http://www.manning.com/lam/SampleCh10.pdf

D**u
发帖数: 288

That's more than enough I think.

r*****d
发帖数: 346

Morale lifted! Thank you Dinu!

c***z
发帖数: 6348

You need the following things:
1. An editor, I use sublime2, the cloudera package uses Gedit
2. A cluster with Pig installed at the edge nodes, you can use the VM in the
cloudera package
3. A file transfer to move Pig code from local drive (if you edit locally)
to edge node, I use Winscp, the cloudera package uses Hue
4. A way to run the code at edge node, I use putty, the cloudera package
uses Hue
My work flow: write Pig code locally using sublime2, upload code to edge
node using winscp, run code at edge node using putty.

【在 r*****d 的大作中提到】

: 请问大家：有什么好方法好资料帮助学习Pig Latin? Pig Latin我完全是新手。我在
: amazon上找了下，好像没有书评特别好的。
: 谢谢！

r*****d
发帖数: 346

多谢多谢。就知道你会回复：）

the

【在 c***z 的大作中提到】

: You need the following things:
: 1. An editor, I use sublime2, the cloudera package uses Gedit
: 2. A cluster with Pig installed at the edge nodes, you can use the VM in the
: cloudera package
: 3. A file transfer to move Pig code from local drive (if you edit locally)
: to edge node, I use Winscp, the cloudera package uses Hue
: 4. A way to run the code at edge node, I use putty, the cloudera package
: uses Hue
: My work flow: write Pig code locally using sublime2, upload code to edge
: node using winscp, run code at edge node using putty.

c***z
发帖数: 6348

Fishing post? :P
Thanks for the baozi.
Additionally, an job tracker might help, the cloudera package uses Hue, and
I use the company setting.

c***z
发帖数: 6348

Also, my work flow using Scala Scalding (i.e. Scala on Hadoop):
1. edit and compile in Intelij, or other IDE, or edit in any text editor and
compile in a terminal (CMD for windows)
setting up Intelij is complicated and out of my scope
2. upload the jar file to edge node, I use winscp
optionally, upload the code to github for version control
3. run the jar file at edge node using putty, with specific input path,
output path and other argument, I save them to an .sh file for reuse (you
can save to .txt file and then copy and paste)
the command to run the jar is something like
hadoop jar myjar.jar packagename.functionname --input "myinputpath/part*" --
output "myoutputpath" --hdfs
4. to get the output to a text file, use something like
hadoop fs -cat "myoutputpath/part*" > myresult.tsv
(I prefer .tsv over .csv because comma can appear in numbers like 133,010
and mess up things)

c***z
发帖数: 6348

The difference between the pig method and the scala method is that scala is
not installed at
the edge node, and we compile a jar to run at the edge node
http://nosql.mypopescu.com/post/18004413595/an-introduction-to-

相关主题
● Job opportunity: Statistician/Modeler	● 新手求教：linux下怎么跑R文件？
● Biostatistics前景如何？	● 做Patient Claims Data Report 对以后找业界工作有多大帮助？
● julia有前途吗？	● big data－大纽约地区聚会 (转载)
进入Statistics版参与讨论

s*********e
发帖数: 1051

pig can be run locally on pc.
check here https://cwiki.apache.org/confluence/display/PIG/PigTools

g**********l
发帖数: 214

请问 what is the "real" advantage of scalding over pig?
想找学scalding的动力, for analyst/data scientist。
i see one "real" advantage of pig over scalding is the utilization of
hcatalog.
公司用很多hive做dataware housing 的话，pig can access all those tables's
metadata directly. no need to parse anything. this is especially helpful
when you have hundreds or even thousands of columns.
pig can also write to hive via hcatalog so other coworkers (who uses mostly
hive) can use my result easily.
另外就是pig's UDF libraries, like datafu. 还有就是scripting language (like
python) support for UDF.
如果大牛们能给一些工作中的实例to show advantage of scalding over pig, 那就更
好了。

c***z
发帖数: 6348

还在学scala
据说是UDF

D**u
发帖数: 288

nice， share workflow最受用了。
顺便分享一下，我现在学习用Notepad++写Python的Pig UDF，然后用Jython 在 Linux
Putty里compile。
还有一种很流行的方式是用Maven+Eclipse 写java的 Pig UDF，稍微学习了一下，还没
有领悟精要。

and

【在 c***z 的大作中提到】

: Also, my work flow using Scala Scalding (i.e. Scala on Hadoop):
: 1. edit and compile in Intelij, or other IDE, or edit in any text editor and
: compile in a terminal (CMD for windows)
: setting up Intelij is complicated and out of my scope
: 2. upload the jar file to edge node, I use winscp
: optionally, upload the code to github for version control
: 3. run the jar file at edge node using putty, with specific input path,
: output path and other argument, I save them to an .sh file for reuse (you
: can save to .txt file and then copy and paste)
: the command to run the jar is something like

c***z
发帖数: 6348

学习了，多谢！

Linux

【在 D**u 的大作中提到】

: nice， share workflow最受用了。
: 顺便分享一下，我现在学习用Notepad++写Python的Pig UDF，然后用Jython 在 Linux
: Putty里compile。
: 还有一种很流行的方式是用Maven+Eclipse 写java的 Pig UDF，稍微学习了一下，还没
: 有领悟精要。
:
: and

h***x
发帖数: 586

Notepad++ is also a good IDE for SAS programming. I like to use it to edit
SAS codes and run in batch mode in PC windows, similar to the way in Unix.

Linux

【在 D**u 的大作中提到】

r*****d
发帖数: 346

"""
This is the best book so far, Programming Pig.
Try this online version, pretty helpful.
http://chimera.labs.oreilly.com/books/1234000001811/index.html
"""
正在读，确实是一本好书！

【在 D**u 的大作中提到】

: I am currently learning Pig too, looks like more flexible than Hive.
: This is the best book so far, Programming Pig.
: Try this online version, pretty helpful.
: http://chimera.labs.oreilly.com/books/1234000001811/index.html
: The author also provided a cheat sheet, very nice too.
: http://mortar-public-site-content.s3-website-us-east-1.amazonaw

s******0
发帖数: 1269

强贴留名，for future reference

(共1页)

进入Statistics版参与讨论

相关主题
● SAS or R处理大量数据	● Job opportunity: Statistician/Modeler (转载)
● ［包子问］统计新人问开学前的自学内容	● Job opportunity: Statistician/Modeler
● 什么是ad hoc analysis	● Biostatistics前景如何？
● 用proc glm 怎么看哪个trt better？ 3个包子答谢！！！！！！	● julia有前途吗？
● 自学R, 谁能介绍个各种函数功能附带详细解释的网站？	● 新手求教：linux下怎么跑R文件？
● 在集成的cloudera hadoop中计算词频（wordcount)	● 做Patient Claims Data Report 对以后找业界工作有多大帮助？
● got data scientist offer, 以后要在本版多学习了	● big data－大纽约地区聚会 (转载)
● 【旧文重发】 Python and R study guide	● Cloudera Hadoop Data Analyst 培训视频 (转载)

相关话题的讨论汇总
话题: pig话题: edge话题: node话题: scalding话题: use

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天