imputation question?thanks - Statistics版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - imputation question?thanks

相关主题
● 这个是什么model模拟出来的,用R做的	● proc logistic遇到missing value怎么处理
● missing values imputation	● 求 imputation 后出来的iteration 的数据作用
● 面试时关于如何处理missing data的回答	● 问个missing data的问题，关于time series data
● 真心请教： data cleaning	● 请教一个sas问题
● 大家平时怎么处理missing data？	● 如何处理这样的missing value？
● Can normally distributed time series data are autocorrelated? Thanks.	● 请问如何处理RCBD 中 missing data.小样本
● 如何把model fitting statistics 读出来（R)	● 对于Mixed Linear Model, 如何处理missing covariates?
● [合集] Missing data	● [Q]One method with missing value

相关话题的讨论汇总
话题: r3话题: y1话题: y2话题: delta话题: y3

进入Statistics版参与讨论

1

(共1页)

c**********5 发帖数: 653	1 Hi,Everyone, I am new with this topic.Can anybody help me out? in the pilot study there were around 100 sample size ,almost half of the them carry missing value. I would like to use the multiple imputation to deal with the missing data problem. The current model is : Outcome1(post measurement1-premeasure1)=pre measurement1+group Outcome2(post measurement2-premeasure2)=pre measurement2+group ……. There are a lot outcomes We are interested. I have the following question: 1. How can I build the imputation model? Which variables should I include in the imputation model in my case?( dependent variable and independent variable...and others..) Notes: Missing data are not only within the outcome but also in the independent variable(very small portion) here 2. how many imputation times do you recomended?(usually,5-10,however,if the proportion of the missing value is huge,maybe we need more imputation times (50))??? Thanks.
c**********5 发帖数: 653	2 ding
w******a 发帖数: 25	3 Here is an R example to impute one missing data in each record,half of the code is to make data sample, you probably only need second half,but including them here helps you understand what is going on: The data will look like col1 col2 x x x x x x x x ... library(Rlab) alp = 1 Prob_R1 = 0.5 Prob_R0 = 1 - Prob_R1 len_Y1 = 200 K_delta = 2 Y1 = rnorm(len_Y1,mean=0,sd=1) R1 = rbinom(n=len_Y1, size=1, prob=Prob_R1) Y2 = rnorm(n=len_Y1, mean=alpY1, sd=1) Y2[R1==0] = NA data = data.frame(cbind(Y1,Y2)) reg = glm(Y2~Y1,family=gaussian,data) sigma = sd(reg$residuals) delta_grid = K_delta (-2:2/2) # interval from -K_ delta to K_delta delta = sigma * delta_grid # interval from -K* sigma to Ksigma E_Y2 = NULL for(i in 1:length(delta)) { Y2[R1==0] = NA Y2.pred = delta[i] + predict(reg,newdata=data) Y2[R1==0] = 0 Y2.hat = Y2R1 + Y2.pred(1-R1) par(mfrow=c(1,2)) plot(Y1[R1==1],Y2[R1==1]) points(Y1[R1==0],Y2.hat[R1==0],pch="+") hist(Y2.hat, xlim=c(-4,4)) E_Y2[i] = mean(Y2.hat) } par(mfrow=c(1,1)) plot(delta,E_Y2) #lm(formula = E_Y2 ~ delta) E_Y2=0.06531+0.54500delta
w******a 发帖数: 25	4 Here is an R example to impute one or two missing data in each record: The data will look like col1 col2 col3 x x x x x x x x x x x x x x x ... library(Rlab) alp = 1 K_delta = 2 len_Y1 = 200 #Sample setting: #Measurment N_ patient Percent # 1 12 0.18 # 1 2 4 0.05 # 1 2 3 22 0.78 #Convert the above info into missing rate: #N_measurement 1 2 3 #Occupy_rate 0.78+0 .05+0.18 0.78+0.05 0.78 #Missing_rate 1-(0. 78+0.05+0.18) 1-(0.78+0.05) 1-0.78 #missing rate for each measurement at time points 1,2,3 Prob_R1 = 0 Prob_R2 = 1-0.78-0.05 Prob_R3 = 1-0.78 #measurements at time points 1,2,3 Y1 = rnorm(n=len_Y1, mean=0,sd=1) Y2 = rnorm(n=len_Y1, mean=alpY1, sd=1) # mean(Y2)=-0.03, sum(Y1)/200=0.024 Y3 = rnorm(n=len_Y1, mean=alpY1, sd=1) #R:response indicator 1=observed;0=missing R1 = rep(1,len_Y1) R2 = rbinom(n=len_Y1, size=1, prob=1-Prob_R2) R3 = rbinom(n=len_Y1, size=1, prob=1-(Prob_R3-Prob_R2)) Y2[R2==0] = NA R3[R2==0] = 0 Y3[R3==0] = NA data = data.frame(cbind(Y1,Y2,Y3,R1,R2,R3)) #Estimate Y2 reg = glm(Y2~Y1,family=gaussian,data) sigma = sd(reg$residuals) delta_grid = K_delta * (-2:2/2) # interval from -K_delta to K_delta delta = sigma * delta_grid # interval from -Ksigma to Ksigma E_Y2 = NULL par(mfrow=c(4,3)) for(i in 1:length(delta)) { Y2[R2==0] = NA Y2.pred = delta[i] + predict(reg,newdata=data) Y2[R2==0] = 0 Y2.hat = Y2R2 + Y2.pred(1-R2) #par(mfrow=c(1,2)) plot(Y1[R2==1],Y2[R2==1]) points(Y1[R2==0],Y2.hat[R2==0],pch="+",col="red") hist(Y2.hat, xlim=c(-4,4)) E_Y2[i] = mean(Y2.hat) } par(mfrow=c(1,1)) plot(delta,E_Y2) #Estimate Y3 reg2 = glm(Y3~Y1+Y2.hat,family=gaussian,data) sigma2 = sd(reg2$residuals) delta_grid = K_delta * (-2:2/2) # interval from -K_delta to K_delta delta = sigma2 * delta_grid # interval from -Ksigma to Ksigma E_Y3 = NULL par(mfrow=c(4,5)) for(i in 1:length(delta)) { Y3[R3==0] = NA Y3.pred = delta[i] + predict(reg2,newdata=data) Y3[R3==0] = 0 Y3.hat = Y3R3 + Y3.pred(1-R3) plot(Y1[R3==1],Y3[R3==1]) points(Y1[R3==0],Y3.hat[R3==0],pch="+",col="red") } for(i in 1:length(delta)) { Y3[R3==0] = NA Y3.pred = delta[i] + predict(reg2,newdata=data) Y3[R3==0] = 0 Y3.hat = Y3R3 + Y3.pred(1-R3) plot(Y2.hat[R3==1],Y3[R3==1]) points(Y2.hat[R3==0],Y3.hat[R3==0],pch="+",col="red") } for(i in 1:length(delta)) { Y3[R3==0] = NA Y3.pred = delta[i] + predict(reg2,newdata=data) Y3[R3==0] = 0 Y3.hat = Y3R3 + Y3.pred(1-R3) hist(Y3.hat, xlim=c(-4,4)) E_Y3[i] = mean(Y3.hat) }
c**********5 发帖数: 653	5 Hi,Thanks a lot. I fotgot R for a while and I maybe can pick it up.I will study your code tonight.I am not authrized to install R to my working station. I know how to write the SAS code using Proc Mi(2 steps). I am stuggling for the questions above.
d******g 发帖数: 130	6 Not sure if you have read the good post on UCLA's SAS page on this topic. Here is the link: http://www.ats.ucla.edu/stat/sas/seminars/missing_data/part1.htm Hope this helps. the data 【在 c**********5 的大作中提到】 : Hi,Everyone, : I am new with this topic.Can anybody help me out? : in the pilot study there were around 100 sample size ,almost half of the : them carry missing value. : I would like to use the multiple imputation to deal with the missing data : problem. : The current model is : : Outcome1(post measurement1-premeasure1)=pre measurement1+group : Outcome2(post measurement2-premeasure2)=pre measurement2+group : …….
c**********5 发帖数: 653	7 Hi, Thanks.I have read it and it is my favorite web.不过还是好谢谢你。我从来没有用过这个方法，读完一些资料以后，感觉是如果是任意missing模式，当我们建立imputation model时，我们可以将所有与你感兴趣的变量放入这个model，不管是dependent variable 还是indpendent variable。不知我理解的对不对。谢谢
H******r 发帖数: 2879	8 Almost all existing imputation methods are based on MAR assumption - think about whether this assumption is true in your problem. Imputation model could be a "big" model, which includes all "useful" predictors and some "useless" predictors. 10 multiply-imputed datasets should be enough. You may check IVEware for MI - it works for non-normal model and you can specify bounds as well.

1

(共1页)

进入Statistics版参与讨论

相关主题
● [Q]One method with missing value	● 大家平时怎么处理missing data？
● SAS help needed, interpolating missing values	● Can normally distributed time series data are autocorrelated? Thanks.
● "Missing data" "intent-to-treat" "repeated measure"	● 如何把model fitting statistics 读出来（R)
● 关于 Risk model	● [合集] Missing data
● 这个是什么model模拟出来的,用R做的	● proc logistic遇到missing value怎么处理
● missing values imputation	● 求 imputation 后出来的iteration 的数据作用
● 面试时关于如何处理missing data的回答	● 问个missing data的问题，关于time series data
● 真心请教： data cleaning	● 请教一个sas问题

相关话题的讨论汇总
话题: r3话题: y1话题: y2话题: delta话题: y3

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)