问个效率问题 SQL vs data step，大数据量 - Statistics版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - 问个效率问题 SQL vs data step，大数据量

相关主题
● Dataset merge的一个问题	● SAS Code 求助，如何把在另一个dataset的id找出来
● 求一段SAS code	● 珍惜生命，远离hsbc。
● which route in SAS is faster?	● 求救：SAS programmer面试题！
● 为什么合并这两个数据不能用sas merge 而要用SAS SQL	● Please help me with SAS! Thank you!
● 在SAS中不用proc sql的情况下实现inequality join	● 请教一个SAS SQL的问题
● 一个实际应用的SAS问题	● PROC SQL join data help
● 请教一个用SAS作DATA MERGE的问题	● 完全不懂SAS，急请教一个问题
● 贡献SAS Programmer 面试问题并求答案	● 关于SAS interview

相关话题的讨论汇总
话题: sql话题: rw话题: proc话题: data话题: sheepsheep

进入Statistics版参与讨论

(共1页)

f****r
发帖数: 1140

数据量比较大。所以得考虑效率。考虑清楚之前还不太想都试一下。实在不行就得试了。
要从数据库里面pull好多变量。还要赋值新变量。是一个巨长无比的sql语句快呢？（
包括select, 好几个case when语句，from好几个table, where不少condition）
还是先proc sql select诸多变量，然后用data step来赋值新变量快？前提是我必须用
proc sql要select变量。
问问有经验的大侠。不知道这种情况考虑到效率问题，是不是该避免proc sql? 不过
proc sql写起来容易些。多谢了。

R*********i
发帖数: 7643

My impression is that proc sql is more efficient than data steps. Good luck!

s*******d
发帖数: 3786

如果你的数据库是relational database,用SQL. 否则用data step 和hash更快。

s******r
发帖数: 1524

why not try it.
it depends on data source (data or database), whether it is index/sorted.
How much CPU and memory assigned.
It really depends.

了。

【在 f****r 的大作中提到】

: 数据量比较大。所以得考虑效率。考虑清楚之前还不太想都试一下。实在不行就得试了。
: 要从数据库里面pull好多变量。还要赋值新变量。是一个巨长无比的sql语句快呢？（
: 包括select, 好几个case when语句，from好几个table, where不少condition）
: 还是先proc sql select诸多变量，然后用data step来赋值新变量快？前提是我必须用
: proc sql要select变量。
: 问问有经验的大侠。不知道这种情况考虑到效率问题，是不是该避免proc sql? 不过
: proc sql写起来容易些。多谢了。

p********a
发帖数: 5352

赞同这个-
如果是读SQL DATABASE，用PROC SQL很快。
当然，如果是大的DATABASE JOIN的话，MERGE还是比PROC SQL JOIN快。

【在 s*******d 的大作中提到】

: 如果你的数据库是relational database,用SQL. 否则用data step 和hash更快。

d*******o
发帖数: 493

硬盘大用data step,内存大用proc sql.

s******r
发帖数: 1524

sort could use more memory than proc sql; can not guarantee right, from my
experience.

【在 d*******o 的大作中提到】

: 硬盘大用data step,内存大用proc sql.

s********p
发帖数: 637

Almost all SAS operations are disk-based, including sort(no hash) and proc
sql. from my experience.

【在 s******r 的大作中提到】

: sort could use more memory than proc sql; can not guarantee right, from my
: experience.

d*******o
发帖数: 493

貌似楼主是generate new variable 吧，不是merge，不用sort吧

my

【在 s******r 的大作中提到】

: sort could use more memory than proc sql; can not guarantee right, from my
: experience.

d*******o
发帖数: 493

Proc sql operation costs much less disk, since it does not create temporary
datasets.

proc

【在 s********p 的大作中提到】

: Almost all SAS operations are disk-based, including sort(no hash) and proc
: sql. from my experience.

相关主题
● 一个实际应用的SAS问题	● SAS Code 求助，如何把在另一个dataset的id找出来
● 请教一个用SAS作DATA MERGE的问题	● 珍惜生命，远离hsbc。
● 贡献SAS Programmer 面试问题并求答案	● 求救：SAS programmer面试题！
进入Statistics版参与讨论

x***x
发帖数: 3401

you need to consider Hash object when dealing with large dataset

s********p
发帖数: 637

Are you sure? no temporary datasets? Where is Cartesian Product stored? In memory?

temporary

【在 d*******o 的大作中提到】

: Proc sql operation costs much less disk, since it does not create temporary
: datasets.
:
: proc

P****D
发帖数: 11146

Personal experience: once I wrote a buggy PROC SQL and created a huge
Cartesian product dataset, I got some sort of "out of memory" error - cannot
remember the exact words of the error message.
So I guess it's stored in memory.

memory?

【在 s********p 的大作中提到】

: Are you sure? no temporary datasets? Where is Cartesian Product stored? In memory?
:
: temporary

s********p
发帖数: 637

Are you using PC SAS?
I do not remember clearly when I saw dapangmao's post.
I did a test using "proc sql" to left join two large dataset in UNIX.
Besides the new file I want to create, I also find a new temporary file
during the process in /saswork which is the directory to store temporary
files by SAS. That file increased its size gradually and I think it is the
Cartesian product file.
Even we assume proc sql use memory for cartesian product, for very large dataset operations in PC, Windows will use virtual memory when out of memory, that is also disk-operation

cannot

【在 P****D 的大作中提到】

: Personal experience: once I wrote a buggy PROC SQL and created a huge
: Cartesian product dataset, I got some sort of "out of memory" error - cannot
: remember the exact words of the error message.
: So I guess it's stored in memory.
:
: memory?

p********a
发帖数: 5352

Temp datasets are stored in memory (2G, 4G?)? What if the temp data is >=2G
4G? "Out of memory" will come out? That doesn't seem to be what happens.

cannot

【在 P****D 的大作中提到】

s********p
发帖数: 637

check this, proc sql use less memory than merge
http://www2.sas.com/proceedings/sugi28/096-28.pdf
SQL Hash is the best!

cannot

【在 P****D 的大作中提到】

l***a
发帖数: 12410

sql hash? how to apply that? I only know how to do data step hash

【在 s********p 的大作中提到】

: check this, proc sql use less memory than merge
: http://www2.sas.com/proceedings/sugi28/096-28.pdf
: SQL Hash is the best!
:
: cannot

P****D
发帖数: 11146

Yes, I was talking about PC SAS. Never got the opportunity to use UNIX SAS.
I agree with you that the job becomes disk operation when operating system
starts using virtual memory.

dataset
is

【在 s********p 的大作中提到】

: Are you using PC SAS?
: I do not remember clearly when I saw dapangmao's post.
: I did a test using "proc sql" to left join two large dataset in UNIX.
: Besides the new file I want to create, I also find a new temporary file
: during the process in /saswork which is the directory to store temporary
: files by SAS. That file increased its size gradually and I think it is the
: Cartesian product file.
: Even we assume proc sql use memory for cartesian product, for very large dataset operations in PC, Windows will use virtual memory when out of memory, that is also disk-operation
:
: cannot

P****D
发帖数: 11146

Cannot remember what happened exactly. I was too busy correcting my program
to pay attention to what was going on with the error... As sheepsheep just
pointed out, "out of memory" should NOT come out because the operating
system should start using virtual memory after it is out of physical memory.
Perhaps that error was not caused by SAS storing temporary datasets in
memory at all...

2G

【在 p********a 的大作中提到】

: Temp datasets are stored in memory (2G, 4G?)? What if the temp data is >=2G
: 4G? "Out of memory" will come out? That doesn't seem to be what happens.
:
: cannot

P****D
发帖数: 11146

Thanks for the paper. I myself am a PROC SQL fan.
What does that paper mean by "CPU Merge" versus "Memory Merge"?

【在 s********p 的大作中提到】

: check this, proc sql use less memory than merge
: http://www2.sas.com/proceedings/sugi28/096-28.pdf
: SQL Hash is the best!
:
: cannot

相关主题
● Please help me with SAS! Thank you!	● 完全不懂SAS，急请教一个问题
● 请教一个SAS SQL的问题	● 关于SAS interview
● PROC SQL join data help	● SAS Question
进入Statistics版参与讨论

s********p
发帖数: 637

I think it would be CPU/Memory cost using merge method

【在 P****D 的大作中提到】

: Thanks for the paper. I myself am a PROC SQL fan.
: What does that paper mean by "CPU Merge" versus "Memory Merge"?

s********p
发帖数: 637

The conclusion is from that paper.
I have not used it either, you can google it!

【在 l***a 的大作中提到】

: sql hash? how to apply that? I only know how to do data step hash

d*******o
发帖数: 493

If the question is 'which method is the best in merging efficiently ',
everybody knows: Hash > Array= Proc format>Proc SQL >sort-sort-merge.
No doubt about it.
However, flyerr的问题是" data step v.s. Proc sql in creating new
variables".
Such as:
/××××××××××××××××××××××××××××××××××××××××/
PROC SQL;
CREATE table one AS
SELECT
CASE WHEN age=11 THEN 'pig'
WHEN sex='F' THEN 'monkey'
ELSE 'cat'
END
AS animals
FROM sashelp.class
;
QUIT;
/××××××××××××××××××××××××××××××××××××××××/
flyerr当然要做很多try-and-error，才能拿到正确的code。
Proc SQL如果不用create table as的话，output可以直接看。dataset不会读写硬盘，里面
都是内存操作，不信把system option(fullstimer)打开，可以看到里面用了大量的内存，很容
易out of memory。
如果用data step的话，用内存极少，但是每run一次都要create new dataset. 如果是过
G的数据，disk I/O消耗的时间可想而知。但是每一步的改变都可以保存下来，她可以回头看。
所以内存大用proc SQL，硬盘大用data step.

s********p
发帖数: 637

I doubt it. :-)
I admit Hash is more efficient. For the left methods, really task-oriented
and what kind of data you are using. someone mentioned this before.

".

【在 d*******o 的大作中提到】

: If the question is 'which method is the best in merging efficiently ',
: everybody knows: Hash > Array= Proc format>Proc SQL >sort-sort-merge.
: No doubt about it.
: However, flyerr的问题是" data step v.s. Proc sql in creating new
: variables".
: Such as:
: /××××××××××××××××××××××××××××××××××××××××/
: PROC SQL;
: CREATE table one AS
: SELECT

P****D
发帖数: 11146

这ID起得真不吉利。

【在 d*******o 的大作中提到】

s********p
发帖数: 637

She is pulling data from large tables, in most case, she will create new
dataset containing pulled variable for further analysis.
I am not clear if temporary will be created when no new dataset needed. I
don't use system option(fullstimer), but just tried the following,
proc sql;
select p1.id from
data1 p1
inner join
data2 p2 on
p1.id=p2.id
;
quit;
and check if temporary files generated and found "#tf0005.sas7butl" created and size changed.
ll /saswork/SAS_workDD3300003446
total 7000
-rw-rw-r-- 1 sheepsheep data 57344 Oct 22 17:12 #tf0001.sas7butl
-rw-rw-r-- 1 sheepsheep data 1646592 Oct 22 17:13 #tf0005.sas7butl
-rw-rw-r-- 1 sheepsheep data 32768 Oct 22 17:12 regstry.sas7bitm
-rw-rw-r-- 1 sheepsheep data 0 Oct 22 17:12 sas.lck
-rw-rw-r-- 1 sheepsheep data 36864 Oct 22 17:12 sasmacr.sas7bcat
ll /saswork/SAS_workDD3300003446
total 7000
-rw-rw-r-- 1 sheepsheep data 57344 Oct 22 17:12 #tf0001.sas7butl
-rw-rw-r-- 1 sheepsheep data 3276800 Oct 22 17:13 #tf0005.sas7butl
-rw-rw-r-- 1 sheepsheep data 32768 Oct 22 17:12 regstry.sas7bitm
-rw-rw-r-- 1 sheepsheep data 0 Oct 22 17:12 sas.lck
-rw-rw-r-- 1 sheepsheep data 36864 Oct 22 17:12 sasmacr.sas7bcat
here is another test:
proc sql;
select avg(p1.salary) from
data1 p1
inner join
data2 p2 on
p1.id=p2.id
group by p1.class
;
quit;
ll /saswork/SAS_work66EB00005AC0
total 5784
-rw-rw-r-- 1 sheepsheep data 57344 Oct 22 17:49 #tf0001.sas7butl
-rw-rw-r-- 1 sheepsheep data 512 Oct 22 17:49 #tf0005.sas7butl
-rw-rw-r-- 1 sheepsheep data 1646592 Oct 22 17:49 #tf0006.sas7butl
-rw-rw-r-- 1 sheepsheep data 0 Oct 22 17:49 sas.lck
-rw-rw-r-- 1 sheepsheep data 36864 Oct 22 17:49 sasmacr.sas7bcat
ll /saswork/SAS_work66EB00005AC0
total 13432
-rw-rw-r-- 1 sheepsheep data 57344 Oct 22 17:49 #tf0001.sas7butl
-rw-rw-r-- 1 sheepsheep data 1417216 Oct 22 17:49 #tf0005.sas7butl
-rw-rw-r-- 1 sheepsheep data 3276800 Oct 22 17:49 #tf0006.sas7butl
-rw-rw-r-- 1 sheepsheep data 0 Oct 22 17:49 sas.lck
-rw-rw-r-- 1 sheepsheep data

××××/

【在 d*******o 的大作中提到】

d*******o
发帖数: 493

Sorry, I don't have experience in using UNIX SAS.
I guess that you think #tf0005.sas7but is the temporary dataset of
Cartesian Joins. Is that correct?
I checked a document from SAS
http://support.sas.com/techsup/technote/ts553.html
"
Cartesian Joins
In the general case, PROC SQL has no choice but to compare each row from
one table with all rows from the others. It must:
* read rows from one table into memory
* compare each row from the other table with those in memory to
decide
if the where clause is satisfied.
"
So I think that #tf0005.sas7but may be a cache file in the work
directory. UNIX didn't terminate this file after the joining.
Overall, Cartesian Join is still a memory operation.

s********p
发帖数: 637

I did not use create table as, so maybe not Cartesian product,
I just wanted to check if temporary files generated even only use "select"
in proc sql

Cartesian
one

【在 d*******o 的大作中提到】

: Sorry, I don't have experience in using UNIX SAS.
: I guess that you think #tf0005.sas7but is the temporary dataset of
: Cartesian Joins. Is that correct?
: I checked a document from SAS
: http://support.sas.com/techsup/technote/ts553.html
: "
: Cartesian Joins
: In the general case, PROC SQL has no choice but to compare each row from
: one table with all rows from the others. It must:
: * read rows from one table into memory

s********p
发帖数: 637

SAS has its smart way to deal with Cartesian product. But I don't believe Cartesian join is a completely memory operation. If so, why people bother to use hash for large datesets?
btw, you new 头像 is MUCH better though very fat!

【在 d*******o 的大作中提到】

d*******o
发帖数: 493

俺觉得基本是个算法效率的问题。
做join的话，data step hash, data step array, Proc format都是key-value pair
（key--hash function--bucket structure)。三者的区别就是用的hash function不同
而已. Hash table的最大优点是数据量大的情况下，速度非常快。
Proc sql用的Cartesian product算法效率很低，但优点是兼容性好。所以也只有proc
sql能做up to 16 tables join和many-to-many join.

相关主题
● SAS菜鸟请教如果使SAS的output的结果放到一个文件内？	● 求一段SAS code
● 求教 SAS base 123 Q 16	● which route in SAS is faster?
● Dataset merge的一个问题	● 为什么合并这两个数据不能用sas merge 而要用SAS SQL
进入Statistics版参与讨论

s********p
发帖数: 637

肥猫，你笑死俺了
：所以也只有proc
你先说说many-to-many有啥用吧？做up to 16 tables join，谁会做那个？16个join,就算每个dataset 10个obs, Cartesian product 的obs都有10^16了吧
你能解释下为啥Hash table操作大dataset速度快么？
还有，俺问你个问题。如果有5个dataset，每个有10millions obs，1000个变量,只用data step和sql,哪种方法最快?
1)proc sql
2)sort+merge d1 d2 d3 d4 d5;
3)sort+
data d; merge d1 d2;
data d; merge d d3;
data d; merge d d4;
data d; merge d d5;
pair
proc

p********a
发帖数: 5352

你这个头像真可爱，笑死我了

××××/

【在 d*******o 的大作中提到】

f****r
发帖数: 1140

没想到大家讨论这个问题这么热火朝天。多谢了。都有道理。我觉得我还是都试一下吧
：巨长无比的复杂proc sql,还有sql pull数据以后，多步data step.

d*******o
发帖数: 493

新头像，新气象。谢谢捧场，哈哈。
"你能解释下为啥Hash table操作大dataset速度快么？"
这个像Microsoft, Amazon首轮面试题。像linked list， array，tree, hash table之
间的比较，我也不知道,答出来估计要看看数据结构的书了。
"up to 16 tables join和many-to-many join"
如果front end是relational database，多表join还是很有用的。
many-to-many join就更重要了。比如healthcare行业，一个人有多个claim，一个
claim table里可能有million个claim。只有Proc sql能做 many-to-many join。
"如果有5个dataset，每个有10millions obs，1000个变量,只用data step和sql,哪种
方法最快?"
个人觉得，其实merge大数据，不管哪种方法，最耗资源的一步是sort. 就像找CS的工
作面试肯定要问bubble sort/selection sort/quick sort之类的。如果反复merge几个
数据，我的建议是还是用proc sql。但事先一定要对这几个数据create index with
common identifier，效果相当于sort。可以试试，我认为速度还是快于 data step
merge.

,就算每个dataset 10个obs, Cartesian product 的obs都有10^16了吧
用data step和sql,哪种方法最快?

【在 s********p 的大作中提到】

: 肥猫，你笑死俺了
: ：所以也只有proc
: 你先说说many-to-many有啥用吧？做up to 16 tables join，谁会做那个？16个join,就算每个dataset 10个obs, Cartesian product 的obs都有10^16了吧
: 你能解释下为啥Hash table操作大dataset速度快么？
: 还有，俺问你个问题。如果有5个dataset，每个有10millions obs，1000个变量,只用data step和sql,哪种方法最快?
: 1)proc sql
: 2)sort+merge d1 d2 d3 d4 d5;
: 3)sort+
: data d; merge d1 d2;
: data d; merge d d3;

P****D
发帖数: 11146

你不是在另一个帖里还说会做恶梦？

【在 p********a 的大作中提到】

: 你这个头像真可爱，笑死我了
:
: ××××/

o****o
发帖数: 8077

“只有Proc sql能做 many-to-many join”稍微武断了点

【在 d*******o 的大作中提到】

: 新头像，新气象。谢谢捧场，哈哈。
: "你能解释下为啥Hash table操作大dataset速度快么？"
: 这个像Microsoft, Amazon首轮面试题。像linked list， array，tree, hash table之
: 间的比较，我也不知道,答出来估计要看看数据结构的书了。
: "up to 16 tables join和many-to-many join"
: 如果front end是relational database，多表join还是很有用的。
: many-to-many join就更重要了。比如healthcare行业，一个人有多个claim，一个
: claim table里可能有million个claim。只有Proc sql能做 many-to-many join。
: "如果有5个dataset，每个有10millions obs，1000个变量,只用data step和sql,哪种
: 方法最快?"

p********a
发帖数: 5352

那是他换头像前那个鲨鱼牙齿的猫头

【在 P****D 的大作中提到】

: 你不是在另一个帖里还说会做恶梦？

D******n
发帖数: 2836

其实这个也挺恐怖的，真不知道她是喜欢猫还是讨厌猫的。。。。猫可爱的图片多的去
了。。。

【在 p********a 的大作中提到】

: 那是他换头像前那个鲨鱼牙齿的猫头

P****D
发帖数: 11146

这个写起程序来最方便。

【在 o****o 的大作中提到】

: “只有Proc sql能做 many-to-many join”稍微武断了点

R******d
发帖数: 1436

proc sql主要用cpu吧，据我观察

【在 d*******o 的大作中提到】

: 硬盘大用data step,内存大用proc sql.

(共1页)

进入Statistics版参与讨论

相关主题
● 关于SAS interview	● 在SAS中不用proc sql的情况下实现inequality join
● SAS Question	● 一个实际应用的SAS问题
● SAS菜鸟请教如果使SAS的output的结果放到一个文件内？	● 请教一个用SAS作DATA MERGE的问题
● 求教 SAS base 123 Q 16	● 贡献SAS Programmer 面试问题并求答案
● Dataset merge的一个问题	● SAS Code 求助，如何把在另一个dataset的id找出来
● 求一段SAS code	● 珍惜生命，远离hsbc。
● which route in SAS is faster?	● 求救：SAS programmer面试题！
● 为什么合并这两个数据不能用sas merge 而要用SAS SQL	● Please help me with SAS! Thank you!

相关话题的讨论汇总
话题: sql话题: rw话题: proc话题: data话题: sheepsheep

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天