第10页 - 关于obs的讨论汇总 - 话题女王

s***1
发帖数: 343

接着上一个贴，还有两个问题。
（4）在SAS里面试了，出来结果是
Obs country city state
1 china veverly hill california
2 us veverly hill california
3 russia veverly hill california
4 canada veverly hill california
5 italy veverly hill california
A和C是完全可以排除的，紫气东来给的是D，但是他也注明了不确定，请大家说说如

s*******2
发帖数: 791

来自主题: Statistics版 - [提问]怎样sort这个dataset?

我有如下dataset Test
data Test;
input input $ outcome $ @@;
datalines;
A 0 A 0 A 0
A 1 A 1 A 1
A 2 A 2 A 2
B 0 B 0 B 0
B 1 B 1 B 1
B 2 B 2 B 2
;
怎么样可以得到下面的数据 (outcome按照0，1，2的顺序)？谢谢
Obs input outcome
1 A 0
2 A 1
3 A 2
4 A 0
5

s*******2
发帖数: 791

来自主题: Statistics版 - [提问]怎样sort这个dataset?

谢谢你。我运行了你的这个code输出的结果就是我想要的。可是有一个问题。我给出
的Test刚好是18个observations，所以通过proc sort去掉了duplicate rows, 就剩A 0
A 1 A 2 B 0 B 1 B 2.然后再stack dataset三次得到我想要的结果。可是如果
我给非3的倍数的observations,怎么办？
例如 16个observations：
data Test;
input input $ outcome $ @@;
datalines;
A 0 A 0
A 1 A 1 A 1
A 2 A 2 A 2
B 0 B 0 B 0
B 1 B 1
B 2 B 2 B 2
;
run;
得到的结果应该是
Obs input outcome
1 A 0

s*******f
发帖数: 148

来自主题: Statistics版 - [提问]怎样sort这个dataset?

Try this~ It should work well no matter you have 18 or 16 obs.
DATA temp;
SET test;
BY input outcome;
IF FIRST.outcome THEN n=1;
ELSE n+1;
RUN;
PROC SORT DATA=temp OUT=sorted(DROP=n);
BY input n outcome;
RUN;

p********a
发帖数: 5352

来自主题: Statistics版 - 在 pharma 做编程的一点点小总结

赞～～～～～～～
俺也来附和一个
在 HEALTHCARE 做编程的一点点小总结
1) 俺的工作需要和COST打交道，老是要算DOLLAR AMOUNT，INPATIENT RATE。以前老有
算错的时候。其实俺的正确率已经是99%左右了，但那1%被老板揪住还是骂，然后老板
被老板的老板批评。。。没办法，直接和钱有关啊。后来俺就费心思写了个大的MACRO
，里面包括十几个小MACRO，把Identification, Stratification, Cost, Utilization
, Report都包括了。不管什么DATA一来，俺就套MACRO，三下五除二就干完了，从来没
出过错。当然，也有和MACRO无关的分析，俺为了避免错误，CODE一般自己读5遍， LOG
一般读1遍，然后写了个SAS程序从LOG里搜索容易出错但不报警的关键词，比方说obs=0
,warning,not initialized etc...
没了 OVER。

d*******1
发帖数: 854

来自主题: Statistics版 - 如何将SAS DATA中的变量名改名（不知道原变量名的前提下）

_Name_=cats('col_',put(_N_,$10.));
这个只适用于每个var只有一个obs的情况，最好产生个一个给var标号的numberical
variable.

l*********s
发帖数: 5409

来自主题: Statistics版 - 菜鸟问个算样本量的问题

Say you have 10 obs, you observe 5:5, you know the true p is close to 0.5,
but how close? 0.4 and 0.6 are all intuitively reasonable. In other words,
your precision is low.
Now, suppose true probability is 0.01,most probably you will observe 0 event
. You cannot conclude the true p is thus 0;but you can guess that it is
unlikely to big than 0.1. Your estimate has a narrower range.

值？

A*******s
发帖数: 3942

来自主题: Statistics版 - [Help] Dividing a SAS data set

用macro吧，要不也可以用file statement with filevar option,更麻烦点.
%macro abc;
%let fname=result;
%do i=1 %to 1000;
data &fname&i;
set result(firstobs=%eval((&i-1)*1599+1) obs=%eval(&i*1599));
run;
%end;
%mend;
%abc

n*****s
发帖数: 10232

来自主题: Statistics版 - 抓狂！为啥选出来的predictor都这么差

-__-//。。。你自己开始敲错了说越多越小，所以我才confused的。你咋改完自己原帖
又回头来说我
跳过这个，虽然stepwise我发现经常比lasso选择的variable数量多，但是并不一定表
明就是overfit吧。
这样obs少但是variables多的情况，我会用cross validation，不过在这之前，还是应
该clean up你的data base，尽可能先消除multi-col吧。我说的要点其实就是处理
multi-col阶段（还没到variable selection和cross validation），如何根据vif或者
condition index来确定每次去掉/保留哪个variable

d******3
发帖数: 93

来自主题: Statistics版 - 问个问题，高手请进

e.g. 1000 observations, 10 variables x1, x2, x3, ... x10. x1 is continuous,
x2-x10 could be continuous or categorical (e.g. age, gender, race...)
now I want to divide these 1000 obs into several groups (e.g. 3 groups), and
to maximize the difference in x1 across these 3 groups while minimize the
differences in x2-x10 across groups.
thanks

t**********r
发帖数: 182

来自主题: Statistics版 - 如何用SAS找几个单词？

tried. not workable, as all obs are identified.

p********a
发帖数: 5352

来自主题: Statistics版 - [合集] 请教一个SAS 数据分配问题

☆─────────────────────────────────────☆
missshinla (missshinla) 于 (Thu Mar 18 03:40:55 2010, 美东) 提到:
一个数据文件，里面10000个observations,
想要把它分成100个小文件，每个里面含100个observations,
observation 1 to 100 go to data1.txt
observation 101 to 200 go to data2.txt
...
observation 9901 to 10000 go to data100.txt
也就是顺序把每100个obs放到一个新的.txt (or .dat) file 里面
请教一下，想写一个macro or something else?
多谢
☆─────────────────────────────────────☆
tosi (夏虫语冰) 于 (Thu Mar 18 09:49:24 2010, 美东) 提到:
In Data Step, use Do Loop and

p********a
发帖数: 5352

来自主题: Statistics版 - [合集] 请教一个SAS数据input的问题

☆─────────────────────────────────────☆
footprint08 (just do it) 于 (Sun Mar 21 02:44:22 2010, 美东) 提到:
id refid
1 NP_001003407/// NP_001003408 /// NP_002304 /// NP_006711
2 NP_001135417 /// NP_001604
3 NP_00499494
我想从一个record生成多个observations, 但是每个record对应的obs个数不等，特殊
符号是‘///’。请问应该怎末处理啊？
我的问题是打印出来的结果只有8位，但是每个值的长度不是固定的，一旦写成$12.，"/"也被读进去了（比如说 ‘/// NP_00100’）。怎末该这个code啊？多谢！
data new;
infile "C:............" missover dlm="///" ;
input id $ refid $

A*******s
发帖数: 3942

来自主题: Statistics版 - [问题]怎么用proc sql获取row number的值

use summary function + group by clause to generate some summary statistics with the order number. i can use the number option to show a column named "obs", but I want to change the name.

R******d
发帖数: 1436

来自主题: Statistics版 - [问题]怎么用proc sql获取row number的值

举个例子看。我也想知道怎么用proc sql给出行号

with the order number. i can use the number option to show a column named "
obs", but I want to change the name.

o****o
发帖数: 8077

来自主题: Statistics版 - [问题]怎么用proc sql获取row number的值

group by _n_? what summarization do you want for individual record?
or do you want some odered cumulative statistics?
post some data step concepts first

with the order number. i can use the number option to show a column named "
obs", but I want to change the name.

b********y
发帖数: 63

来自主题: Statistics版 - 填充缺失值问题请教 (SAS, R, 所用软件不限)

The following R code should be much faster. I am also curious to know
how long it takes to run through your data?
file.in = file("data_in.txt");
file.out = file("data_out.txt")
open(file.in, open = "r")
open(file.out, open = "wt")
# title
xtitle = scan(file.in, what = list(s1 = "", s2 = ""), nline = 1, quiet =
T)
cat(file = file.out, c(xtitle$s1, xtitle$s2), sep = ", ", append =
TRUE);
cat(file = file.out, "\n")
# first obs
xfmt = list(ID = 0, DD = 0) # readin format
x0 = scan(file.in, what = xf

b**********e
发帖数: 531

来自主题: Statistics版 - help me to look at this code

data _NULL_;
if 0 then do;
set _temp_ nobs=nobs1;
end;
call symput ('num1', nobs1);
stop;
run;
_temp_ is a cecent dataset sas create, the "nobs" is the total number of
obs in the data ?

f*****u
发帖数: 17

来自主题: Statistics版 - sas 程序咨询数据补充完整

如下数据
OBS X Y
1 2 3
2 4 .
3 5 .
4 6 .
大致书这样的。如何利用retain 命令把 Y 的missing 补完整？规则是
y_n= y_{n-1)+x_n
谢谢！

f*****u
发帖数: 17

来自主题: Statistics版 - sas 程序咨询数据补充完整

继续讨论数据处理
OBS X
1 2
2 4
3 5
4 6
得到新的变量 Y=x*lag(x) 连乘。retain 命令该怎么用？

f*****u
发帖数: 17

来自主题: Statistics版 - sas 程序咨询数据补充完整

继续讨论数据处理
OBS X
1 2
2 4
3 5
4 6
得到新的变量 Y=x*lag(x) 连乘。retain 命令该怎么用？

w*******n
发帖数: 469

来自主题: Statistics版 - SAS help

proc sort data=one; by id; run;
%macro search()
%do i=1 %to num;
%searchone(&i);
%end;
%mend;
%macro searchone(index);
data oneobs;
set DatA(firstobs=&index obs=&index);
run;
data one;
if 1=1 then delete;
run;
data dataB one;
merge oneobs(in=inone) dataB(in B);by id;
if inone & B then output one;
if inone & B then delete;
output dataB;
data match;
set match one; run;
%mend;

A*******s
发帖数: 3942

来自主题: Statistics版 - SAS -proc transpose 急问！

我看了一下by statement的sas help document，原来
The maximum number of observations in any BY group in the input data set is
two;
所以我能想到的变通方法是
data test;
input id a b c;
cards;
1 0 0 0
1 1 1 0
1 1 1 1
2 0 0 0
2 0 0 0
;
run;
data test;
set test;
row=_N_;
run;
proc transpose data=test out=out(drop=row rename=(_Name_=type col1=response)
);
by row id;
proc print;
run;
Obs id type response

l******1
发帖数: 86

来自主题: Statistics版 - 弱问SAS:如何找出两个data sets中missing的obs

现在脑子一片空白，希望xdjm帮个忙。
两个entry data set，有unique id. variables 完全相同。需要寻找两个之中缺少的
id。
谢谢

A*******s
发帖数: 3942

来自主题: Statistics版 - 弱问SAS:如何找出两个data sets中missing的obs

我看了你的贴也是脑子一片空白，
你的data缺少啥id，我们怎么会知道呢？:)
什么是“两个之中缺少的id”？

a***r
发帖数: 420

来自主题: Statistics版 - 弱问SAS:如何找出两个data sets中missing的obs

没来的请举手?@@

d*******o
发帖数: 493

来自主题: Statistics版 - 弱问SAS:如何找出两个data sets中missing的obs

我举手！！！
楼主用proc compare比较两个data sets就可以了

D******n
发帖数: 2836

来自主题: Statistics版 - 弱问SAS:如何找出两个data sets中missing的obs

lz means A-B and B-A i guess, (set subtraction),
can use sql except

b*******r
发帖数: 152

来自主题: Statistics版 - 弱问SAS:如何找出两个data sets中missing的obs

proc sql, left/right join, then where... id is null.done.

S******y
发帖数: 1123

来自主题: Statistics版 - 弱问SAS:如何找出两个data sets中missing的obs

# it is easy in Python
def unique(a):
return list(set(a))
def intersect(a, b):
return list(set(a) & set(b))
def union(a, b):
return list(set(a) | set(b))
def difference(a, b):
return list(set(b).difference(set(a)))
l1=[1,3,5,7,9]
l2=[2,4,6,8,10,1]
z=difference(l1,l2)
print z

d*******o
发帖数: 493

来自主题: Statistics版 - 弱问SAS:如何找出两个data sets中missing的obs

你头像和我的好像啊

l******1
发帖数: 86

来自主题: Statistics版 - 弱问SAS:如何找出两个data sets中missing的obs

这就是我想要的结果，但是没用过python，如果有SAS的code就更好了

l******1
发帖数: 86

来自主题: Statistics版 - 弱问SAS:如何找出两个data sets中missing的obs

还是我比较可爱

l******1
发帖数: 86

来自主题: Statistics版 - 弱问SAS:如何找出两个data sets中missing的obs

这个程序很好，但是因为我没表达清楚。因为不知道两个entry哪个是对的，所以想同
时保存，proc compare基本符合要求，但是不能告诉我缺少哪个id。
“没来的，站出来！！！！”

l******1
发帖数: 86

来自主题: Statistics版 - 弱问SAS:如何找出两个data sets中missing的obs

嗯，看来只有proc compare和sql except 都用了

l******1
发帖数: 86

来自主题: Statistics版 - 弱问SAS:如何找出两个data sets中missing的obs

谢谢
StatsGuy, Actuaries,budmiller,还有被我confused的同学们！！！！！

l***a
发帖数: 12410

来自主题: Statistics版 - sample size vs. number of regressors

I think first a power analysis needs to be done to decide the minimum sample
size, I am sure you know it :) Then, I think if you pay real attention to
take care of the multicollinearity and the number of selected predictors, it
will give you a very good chance to avoid overfitting. But remember there
is a rule of thumb that on average one predictor should have at least 10 obs
. Although I don't practically keep this rule all the time, it's still good
to keep it in mind.

training

g******h
发帖数: 266

来自主题: Statistics版 - 如何用SAS Macro来计算这个公式？

想用SAS计算一个likelihood公式，是人工选Box-Cox最佳lambda用的（intentionally
不用transreg procedure）。lambda应改是个循环变量。
Likelihood(lambda)=-n/2[1/nSum(X^lambda-mean(X^lambda)^2)]+(lambda-1)sum(lnX)
我不太知道怎么样把计算出来的多个变量统计数值放到macro variable中，然后循环调
用。对macro也不是很熟。我的数据不是单变量。有X1，X2两个变量。要分别对这两个
变量找最佳lambda。我想从-5to5 with step 0.05 for trying out the best lambda.
哪位高手对macro熟的给个指导。非常感谢。
部分数据如下：
Obs X1 X2
1 47.4 2.05
2 35.8 1.02
3 32.9 2.53
4 1508.5 1.23
5 1217.4

D******n
发帖数: 2836

来自主题: Statistics版 - [SAS] call execute gives me error

cool, it works.
what is open code?
actually what i want is
=========================
data a1;
input a $ cap;
datalines;
5.4 0.8
5.4 0.9
5.3 1.8
;
run;
data _null_;
set a1 end=eof;
if _n_=1 then do; call execute('data a2;'); end;
call execute('b = put('||cap||','||strip(a)||');output;');
if eof then call execute('run;');
run;
<----output------->
Obs b
1

S******y
发帖数: 1123

来自主题: Statistics版 - 问个比较具体的算法问题

########### Python ############
in_file = 'C:\\_original.txt' #oloolo s example data
f = open(in_file, 'r')
ls =[]
f.next() #skip header
for line in f:
obs, group_id, id1, ID = line.split()
if id1 in ls and ID in ls: #if both already in
pass
else: #if one of them is new
print group_id, id1, ID
ls.append(id1)
ls.append(ID)
ls = list(set(ls)) #dedupe
###################### END ###################

D******n
发帖数: 2836

来自主题: Statistics版 - 请问:query about checking consistency (转载)

at least u should give a sample desired output, dude.
are u to just check 1 obs ahead? why did u mention date? i don't see
anything to do with the problem unless the date is not sorted.

h**********e
发帖数: 44

来自主题: Statistics版 - %do questions

First create a dataset:
data index;
input i;
datalines;
1
3
6
7
run;
Then
%do j=1 &to 4;
data a;
set index (obs=&j);
call symput('i',i);
run;
/*do whatever you want to do with &i here*/
...
%end;
This is what I usually do this kind of work. Remember macro in SAS is not as
easy as function in C. SAS is dataset oriented in most of the cases.

l*********s
发帖数: 5409

来自主题: Statistics版 - [SAS] data set options (obs=) in output tables

ods option controls reading, not writing.
use where statement to subsetting output.

p*****o
发帖数: 543

来自主题: Statistics版 - 如何在1，2，3，4，5中随机选出2个数来？

有没有一个简单的FUNCTION直接做来着？
能想到的就是给5个OBS随机产生一个数，然后排序后，选前2个。但是想知道有没有一
个FUNCTION可以直接从1，2，3，4，5中RETURN出来2个数来？

g**a
发帖数: 2129

来自主题: Statistics版 - 如何在1，2，3，4，5中随机选出2个数来？

不是5个数的问题。这是要取5个obs。只是5个数的话，一个function就成了

s*****n
发帖数: 2174

来自主题: Statistics版 - 如何在1，2，3，4，5中随机选出2个数来？

还是不明白有什么区别?
从第1个到第n个ob里面取5个obs,
不就等同于从1到n里面任意取5个数作为index吗?

o****o
发帖数: 8077

来自主题: Statistics版 - 如何在1，2，3，4，5中随机选出2个数来？

see my previous post
or maybe SAS has such one, but just I don't know. I double checked manual,
looks like I am correct so far
In fact, efficient sampling out of a sequence of obs for a given prob was and is a heated debate on SAS user forums. The one I gave is probablly the deemed best one
In SAS, our mind is one-value-per-iteration......you know~~~
so I personally think many of the stream algorithms will find their great support in the SAS community....in additional to hard-core algorithm guys

s*****n
发帖数: 2174

来自主题: Statistics版 - 如何在1，2，3，4，5中随机选出2个数来？

感觉一些逻辑性很强的任务, 如果用SAS来做肯定复杂很多, code也会很长很啰嗦.
比如说我有一个csv文件, 里面有10万个obs.
1. 读入文件, 随机选1万行
2. 如果选出的1万行满足某些性质(比如某个variable均值>0),
则做一个简单线性回归; 如果不满足此性质, 则做一个GLM回归.
3. 如果是线性回归, 把回归参数提出来作为某个函数A的参数.
如果是GLM回归, 则把回归参数提出来作为函数B的参数.
4. 再选出1万行, 根据3的条件来应用函数A或者函数B.
5. 将2-4运行1000遍做simulation, 分别画LM和GLM两种情况的某种分布图.
这要是用SAS做, 还不得一会DATA STEP, 一会这个PROC, 一会又DATA STEP,
一会那个PROC, 搞不好还得弄个macro啥的?

p*****o
发帖数: 543

来自主题: Statistics版 - 再来问一个SAS问题

数据如下：
VAR1
A
A
B
B
B
C
C
C
D
D
。。。。。
。。。。。
现在要做的就是给每个OBS都从1，2，3 （其实是1，2，3，。。。100，但是咱就先用3
个的来讨论吧）中随机选一个数，设为VAR2. 但是限制条件是：
VAR1 的值同样的情况下，VAR2的值必须不一样。
但是VAR1的值不同的情况下，VAR2的值可以一样。

p*****o
发帖数: 543

来自主题: Statistics版 - 再来问一个SAS问题

就是我要给每个OBS随机地从1，2，3，。。。，100中取一个数。但是对于组内而言（
比如VAR1 = A的组），他们不能取同样的数（即WITHOUT REPLACEMENT）。

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天