s***1 发帖数: 343 | 1 接着上一个贴,还有两个问题。
(4)在SAS里面试了,出来结果是
Obs country city state
1 china veverly hill california
2 us veverly hill california
3 russia veverly hill california
4 canada veverly hill california
5 italy veverly hill california
A和C是完全可以排除的,紫气东来给的是D,但是他也注明了不确定,请大家说说如 |
|
s*******2 发帖数: 791 | 2 我有如下dataset Test
data Test;
input input $ outcome $ @@;
datalines;
A 0 A 0 A 0
A 1 A 1 A 1
A 2 A 2 A 2
B 0 B 0 B 0
B 1 B 1 B 1
B 2 B 2 B 2
;
怎么样可以得到下面的数据 (outcome按照0,1,2的顺序)?谢谢
Obs input outcome
1 A 0
2 A 1
3 A 2
4 A 0
5 |
|
s*******2 发帖数: 791 | 3 谢谢你。 我运行了你的这个code输出的结果就是我想要的。可是有一个问题。 我给出
的Test刚好是18个observations,所以通过proc sort去掉了duplicate rows, 就剩A 0
A 1 A 2 B 0 B 1 B 2.然后再stack dataset三次得到我想要的结果。可是如果
我给非3的倍数的observations,怎么办?
例如 16个observations:
data Test;
input input $ outcome $ @@;
datalines;
A 0 A 0
A 1 A 1 A 1
A 2 A 2 A 2
B 0 B 0 B 0
B 1 B 1
B 2 B 2 B 2
;
run;
得到的结果应该是
Obs input outcome
1 A 0
|
|
s*******f 发帖数: 148 | 4 Try this~ It should work well no matter you have 18 or 16 obs.
DATA temp;
SET test;
BY input outcome;
IF FIRST.outcome THEN n=1;
ELSE n+1;
RUN;
PROC SORT DATA=temp OUT=sorted(DROP=n);
BY input n outcome;
RUN; |
|
p********a 发帖数: 5352 | 5 赞~~~~~~~
俺也来附和一个
在 HEALTHCARE 做编程的一点点小总结
1) 俺的工作需要和COST打交道,老是要算DOLLAR AMOUNT,INPATIENT RATE。以前老有
算错的时候。其实俺的正确率已经是99%左右了,但那1%被老板揪住还是骂,然后老板
被老板的老板批评。。。没办法,直接和钱有关啊。后来俺就费心思写了个大的MACRO
,里面包括十几个小MACRO,把Identification, Stratification, Cost, Utilization
, Report都包括了。不管什么DATA一来,俺就套MACRO,三下五除二就干完了,从来没
出过错。当然,也有和MACRO无关的分析,俺为了避免错误,CODE一般自己读5遍, LOG
一般读1遍,然后写了个SAS程序从LOG里搜索容易出错但不报警的关键词,比方说obs=0
,warning,not initialized etc...
没了 OVER。 |
|
d*******1 发帖数: 854 | 6 _Name_=cats('col_',put(_N_,$10.));
这个只适用于每个var只有一个obs的情况,最好产生个一个给var标号的numberical
variable. |
|
l*********s 发帖数: 5409 | 7 Say you have 10 obs, you observe 5:5, you know the true p is close to 0.5,
but how close? 0.4 and 0.6 are all intuitively reasonable. In other words,
your precision is low.
Now, suppose true probability is 0.01,most probably you will observe 0 event
. You cannot conclude the true p is thus 0;but you can guess that it is
unlikely to big than 0.1. Your estimate has a narrower range.
值? |
|
A*******s 发帖数: 3942 | 8 用macro吧,要不也可以用file statement with filevar option,更麻烦点.
%macro abc;
%let fname=result;
%do i=1 %to 1000;
data &fname&i;
set result(firstobs=%eval((&i-1)*1599+1) obs=%eval(&i*1599));
run;
%end;
%mend;
%abc |
|
n*****s 发帖数: 10232 | 9 -__-//。。。你自己开始敲错了说越多越小,所以我才confused的。你咋改完自己原帖
又回头来说我
跳过这个,虽然stepwise我发现经常比lasso选择的variable数量多,但是并不一定表
明就是overfit吧。
这样obs少但是variables多的情况,我会用cross validation,不过在这之前,还是应
该clean up你的data base,尽可能先消除multi-col吧。我说的要点其实就是处理
multi-col阶段(还没到variable selection和cross validation),如何根据vif或者
condition index来确定每次去掉/保留哪个variable |
|
d******3 发帖数: 93 | 10 e.g. 1000 observations, 10 variables x1, x2, x3, ... x10. x1 is continuous,
x2-x10 could be continuous or categorical (e.g. age, gender, race...)
now I want to divide these 1000 obs into several groups (e.g. 3 groups), and
to maximize the difference in x1 across these 3 groups while minimize the
differences in x2-x10 across groups.
thanks |
|
t**********r 发帖数: 182 | 11 tried. not workable, as all obs are identified. |
|
p********a 发帖数: 5352 | 12 ☆─────────────────────────────────────☆
missshinla (missshinla) 于 (Thu Mar 18 03:40:55 2010, 美东) 提到:
一个数据文件, 里面10000个observations,
想要把它分成100个小文件,每个里面含100个observations,
observation 1 to 100 go to data1.txt
observation 101 to 200 go to data2.txt
...
observation 9901 to 10000 go to data100.txt
也就是顺序把每100个obs放到一个新的.txt (or .dat) file 里面
请教一下,想写一个macro or something else?
多谢
☆─────────────────────────────────────☆
tosi (夏虫语冰) 于 (Thu Mar 18 09:49:24 2010, 美东) 提到:
In Data Step, use Do Loop and |
|
p********a 发帖数: 5352 | 13 ☆─────────────────────────────────────☆
footprint08 (just do it) 于 (Sun Mar 21 02:44:22 2010, 美东) 提到:
id refid
1 NP_001003407/// NP_001003408 /// NP_002304 /// NP_006711
2 NP_001135417 /// NP_001604
3 NP_00499494
我想从一个record生成多个observations, 但是每个record对应的obs个数不等,特殊
符号是‘///’。请问应该怎末处理啊?
我的问题是打印出来的结果只有8位,但是每个值的长度不是固定的,一旦写成$12.,"/"也被读进去了 (比如说 ‘/// NP_00100’)。 怎末该这个code啊?多谢!
data new;
infile "C:............" missover dlm="///" ;
input id $ refid $ |
|
A*******s 发帖数: 3942 | 14 use summary function + group by clause to generate some summary statistics with the order number. i can use the number option to show a column named "obs", but I want to change the name. |
|
R******d 发帖数: 1436 | 15 举个例子看。我也想知道怎么用proc sql给出行号
with the order number. i can use the number option to show a column named "
obs", but I want to change the name. |
|
o****o 发帖数: 8077 | 16 group by _n_? what summarization do you want for individual record?
or do you want some odered cumulative statistics?
post some data step concepts first
with the order number. i can use the number option to show a column named "
obs", but I want to change the name. |
|
b********y 发帖数: 63 | 17 The following R code should be much faster. I am also curious to know
how long it takes to run through your data?
file.in = file("data_in.txt");
file.out = file("data_out.txt")
open(file.in, open = "r")
open(file.out, open = "wt")
# title
xtitle = scan(file.in, what = list(s1 = "", s2 = ""), nline = 1, quiet =
T)
cat(file = file.out, c(xtitle$s1, xtitle$s2), sep = ", ", append =
TRUE);
cat(file = file.out, "\n")
# first obs
xfmt = list(ID = 0, DD = 0) # readin format
x0 = scan(file.in, what = xf |
|
b**********e 发帖数: 531 | 18 data _NULL_;
if 0 then do;
set _temp_ nobs=nobs1;
end;
call symput ('num1', nobs1);
stop;
run;
_temp_ is a cecent dataset sas create, the "nobs" is the total number of
obs in the data ? |
|
f*****u 发帖数: 17 | 19 如下 数据
OBS X Y
1 2 3
2 4 .
3 5 .
4 6 .
大致书这样的。如何利用retain 命令把 Y 的missing 补完整? 规则 是
y_n= y_{n-1)+x_n
谢谢! |
|
f*****u 发帖数: 17 | 20 继续讨论数据处理
OBS X
1 2
2 4
3 5
4 6
得到新的变量 Y=x*lag(x) 连乘。retain 命令该怎么用? |
|
f*****u 发帖数: 17 | 21 继续讨论数据处理
OBS X
1 2
2 4
3 5
4 6
得到新的变量 Y=x*lag(x) 连乘。retain 命令该怎么用? |
|
w*******n 发帖数: 469 | 22 proc sort data=one; by id; run;
%macro search()
%do i=1 %to num;
%searchone(&i);
%end;
%mend;
%macro searchone(index);
data oneobs;
set DatA(firstobs=&index obs=&index);
run;
data one;
if 1=1 then delete;
run;
data dataB one;
merge oneobs(in=inone) dataB(in B);by id;
if inone & B then output one;
if inone & B then delete;
output dataB;
data match;
set match one; run;
%mend; |
|
A*******s 发帖数: 3942 | 23 我看了一下by statement的sas help document,原来
The maximum number of observations in any BY group in the input data set is
two;
所以我能想到的变通方法是
data test;
input id a b c;
cards;
1 0 0 0
1 1 1 0
1 1 1 1
2 0 0 0
2 0 0 0
;
run;
data test;
set test;
row=_N_;
run;
proc transpose data=test out=out(drop=row rename=(_Name_=type col1=response)
);
by row id;
proc print;
run;
Obs id type response
|
|
l******1 发帖数: 86 | 24 现在脑子一片空白,希望xdjm帮个忙。
两个entry data set,有unique id. variables 完全相同。需要寻找两个之中缺少的
id。
谢谢 |
|
A*******s 发帖数: 3942 | 25 我看了你的贴也是脑子一片空白,
你的data缺少啥id,我们怎么会知道呢?:)
什么是“两个之中缺少的id”? |
|
|
d*******o 发帖数: 493 | 27 我举手!!!
楼主用proc compare比较两个data sets就可以了 |
|
D******n 发帖数: 2836 | 28 lz means A-B and B-A i guess, (set subtraction),
can use sql except |
|
b*******r 发帖数: 152 | 29 proc sql, left/right join, then where... id is null.done. |
|
S******y 发帖数: 1123 | 30 # it is easy in Python
def unique(a):
return list(set(a))
def intersect(a, b):
return list(set(a) & set(b))
def union(a, b):
return list(set(a) | set(b))
def difference(a, b):
return list(set(b).difference(set(a)))
l1=[1,3,5,7,9]
l2=[2,4,6,8,10,1]
z=difference(l1,l2)
print z |
|
|
l******1 发帖数: 86 | 32 这就是我想要的结果,但是没用过python,如果有SAS的code就更好了 |
|
|
l******1 发帖数: 86 | 34 这个程序很好,但是因为我没表达清楚。因为不知道两个entry哪个是对的,所以想同
时保存,proc compare基本符合要求,但是不能告诉我缺少哪个id。
“没来的,站出来!!!!” |
|
l******1 发帖数: 86 | 35 嗯,看来只有proc compare和sql except 都用了 |
|
l******1 发帖数: 86 | 36 谢谢
StatsGuy, Actuaries,budmiller,还有被我confused的同学们!!!!! |
|
l***a 发帖数: 12410 | 37 I think first a power analysis needs to be done to decide the minimum sample
size, I am sure you know it :) Then, I think if you pay real attention to
take care of the multicollinearity and the number of selected predictors, it
will give you a very good chance to avoid overfitting. But remember there
is a rule of thumb that on average one predictor should have at least 10 obs
. Although I don't practically keep this rule all the time, it's still good
to keep it in mind.
training |
|
g******h 发帖数: 266 | 38 想用SAS计算一个likelihood公式,是人工选Box-Cox最佳lambda用的(intentionally
不用transreg procedure)。lambda应改是个循环变量。
Likelihood(lambda)=-n/2[1/nSum(X^lambda-mean(X^lambda)^2)]+(lambda-1)sum(lnX)
我不太知道怎么样把计算出来的多个变量统计数值放到macro variable中,然后循环调
用。对macro也不是很熟。 我的数据不是单变量。有X1,X2两个变量。要分别对这两个
变量找最佳lambda。我想从-5to5 with step 0.05 for trying out the best lambda.
哪位高手对macro熟的给个指导。非常感谢。
部分数据如下:
Obs X1 X2
1 47.4 2.05
2 35.8 1.02
3 32.9 2.53
4 1508.5 1.23
5 1217.4 |
|
D******n 发帖数: 2836 | 39 cool, it works.
what is open code?
actually what i want is
=========================
data a1;
input a $ cap;
datalines;
5.4 0.8
5.4 0.9
5.3 1.8
;
run;
data _null_;
set a1 end=eof;
if _n_=1 then do; call execute('data a2;'); end;
call execute('b = put('||cap||','||strip(a)||');output;');
if eof then call execute('run;');
run;
<----output------->
Obs b
1 |
|
S******y 发帖数: 1123 | 40 ########### Python ############
in_file = 'C:\\_original.txt' #oloolo s example data
f = open(in_file, 'r')
ls =[]
f.next() #skip header
for line in f:
obs, group_id, id1, ID = line.split()
if id1 in ls and ID in ls: #if both already in
pass
else: #if one of them is new
print group_id, id1, ID
ls.append(id1)
ls.append(ID)
ls = list(set(ls)) #dedupe
###################### END ################### |
|
D******n 发帖数: 2836 | 41 at least u should give a sample desired output, dude.
are u to just check 1 obs ahead? why did u mention date? i don't see
anything to do with the problem unless the date is not sorted. |
|
h**********e 发帖数: 44 | 42 First create a dataset:
data index;
input i;
datalines;
1
3
6
7
run;
Then
%do j=1 &to 4;
data a;
set index (obs=&j);
call symput('i',i);
run;
/*do whatever you want to do with &i here*/
...
%end;
This is what I usually do this kind of work. Remember macro in SAS is not as
easy as function in C. SAS is dataset oriented in most of the cases. |
|
l*********s 发帖数: 5409 | 43 ods option controls reading, not writing.
use where statement to subsetting output. |
|
p*****o 发帖数: 543 | 44 有没有一个简单的FUNCTION直接做来着?
能想到的就是给5个OBS随机产生一个数,然后排序后,选前2个。但是想知道有没有一
个FUNCTION可以直接从1,2,3,4,5中RETURN出来2个数来? |
|
g**a 发帖数: 2129 | 45 不是5个数的问题。这是要取5个obs。只是5个数的话,一个function就成了 |
|
s*****n 发帖数: 2174 | 46 还是不明白有什么区别?
从第1个到第n个ob里面取5个obs,
不就等同于从1到n里面任意取5个数作为index吗? |
|
o****o 发帖数: 8077 | 47 see my previous post
or maybe SAS has such one, but just I don't know. I double checked manual,
looks like I am correct so far
In fact, efficient sampling out of a sequence of obs for a given prob was and is a heated debate on SAS user forums. The one I gave is probablly the deemed best one
In SAS, our mind is one-value-per-iteration......you know~~~
so I personally think many of the stream algorithms will find their great support in the SAS community....in additional to hard-core algorithm guys |
|
s*****n 发帖数: 2174 | 48 感觉一些逻辑性很强的任务, 如果用SAS来做肯定复杂很多, code也会很长很啰嗦.
比如说我有一个csv文件, 里面有10万个obs.
1. 读入文件, 随机选1万行
2. 如果选出的1万行满足某些性质(比如某个variable均值>0),
则做一个简单线性回归; 如果不满足此性质, 则做一个GLM回归.
3. 如果是线性回归, 把回归参数提出来作为某个函数A的参数.
如果是GLM回归, 则把回归参数提出来作为函数B的参数.
4. 再选出1万行, 根据3的条件来应用函数A或者函数B.
5. 将2-4运行1000遍做simulation, 分别画LM和GLM两种情况的某种分布图.
这要是用SAS做, 还不得一会DATA STEP, 一会这个PROC, 一会又DATA STEP,
一会那个PROC, 搞不好还得弄个macro啥的? |
|
p*****o 发帖数: 543 | 49 数据如下:
VAR1
A
A
B
B
B
C
C
C
D
D
。。。。。
。。。。。
现在要做的就是给每个OBS都从1,2,3 (其实是1,2,3,。。。100,但是咱就先用3
个的来讨论吧)中随机选一个数,设为VAR2. 但是限制条件是:
VAR1 的值同样的情况下,VAR2的值必须不一样。
但是VAR1的值不同的情况下,VAR2的值可以一样。 |
|
p*****o 发帖数: 543 | 50 就是我要给每个OBS随机地从1,2,3,。。。,100中取一个数。但是对于组内而言(
比如VAR1 = A的组),他们不能取同样的数(即WITHOUT REPLACEMENT)。 |
|