请教一个multi colinearity的问题 - Statistics版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - 请教一个multi colinearity的问题

相关主题
● model的predictors之间有multi-colinearity怎么办？	● 关于lasso的variable selection问题
● 有80个候选Predictors,怎么从中选<10个	● 【大包子】Factor data analysis
● 抓狂！为啥选出来的predictor都这么差	● Gene expression =?= Variable selection
● logistic regression issue	● logistic regression结果释疑，解读
● 关于stepwise programming	● [合集] └ Re: 关于stepwise programming
● Sample size for clustering analysis	● 两组数据，2个variable 的correlation不一样，如果合并起来，他们的correlaton怎么变化
● 电话面试完了，肯定没戏，大家帮我看看题目，就算学习吧	● 一个人的数据可以做相关性分析吗?
● 请问：想fit gamma 并同时用lasso的方法做variable selection	● ##面试过了，请教问题##

相关话题的讨论汇总
话题: lasso话题: x1话题: x2话题: variable

进入Statistics版参与讨论

(共1页)

h*******n
发帖数: 458

问题是这样的：有X1到Xn一共n个independent variables, Y 是dependent variable。
现在想在建模之前去掉线性相关度比较高的一些colinearity。具体目标是：如果两个
变量X1,X2的correlation coefficient >=0.5, 那么去掉其中与Y相关性小的那个，以
达到variable reduction 的目的。假设n比较大，例如800个。数据量也大，例如是几
百万个observations.怎么做才能计算量不是太大呢？
我现在的做法是最笨的：先算出X1,X2....每一个与Y的corr,存在一个dataset里，然后
算X1和X2， X1和X3...每一对的corr,如果>=0.5,查dataset,去掉和Y相关小那个X；然
后在算X2/X3， X2/X4.....。程序花的时间很长。想问有没有更好的算法，或者现成的
东西可以实现这个目的。
希望我把问题说清楚了。请高手赐教。

f****s
发帖数: 3078

stepwise selection

【在 h*******n 的大作中提到】

: 问题是这样的：有X1到Xn一共n个independent variables, Y 是dependent variable。
: 现在想在建模之前去掉线性相关度比较高的一些colinearity。具体目标是：如果两个
: 变量X1,X2的correlation coefficient >=0.5, 那么去掉其中与Y相关性小的那个，以
: 达到variable reduction 的目的。假设n比较大，例如800个。数据量也大，例如是几
: 百万个observations.怎么做才能计算量不是太大呢？
: 我现在的做法是最笨的：先算出X1,X2....每一个与Y的corr,存在一个dataset里，然后
: 算X1和X2， X1和X3...每一对的corr,如果>=0.5,查dataset,去掉和Y相关小那个X；然
: 后在算X2/X3， X2/X4.....。程序花的时间很长。想问有没有更好的算法，或者现成的
: 东西可以实现这个目的。
: 希望我把问题说清楚了。请高手赐教。

m***c
发帖数: 118

just curious colinearity=correlation?

s*******e
发帖数: 1385

先做varclus

【在 h*******n 的大作中提到】

d******9
发帖数: 404

正解。
Google SAS Proc VarClus.

【在 s*******e 的大作中提到】

: 先做varclus

h*******n
发帖数: 458

谢谢几位。用VARCLUS能把 n 个variables分成几组，然后该怎么做呢？是每组内再根
据和Y的CORR选出几个variables吗？

【在 d******9 的大作中提到】

: 正解。
: Google SAS Proc VarClus.

h*******n
发帖数: 458

stepwise是建模中了，现在是想在建模前先减少X间的线性相关。而且stepwise的结果
并不保证X间相关性小。

【在 f****s 的大作中提到】

: stepwise selection

h*******n
发帖数: 458

我的理解是前者是多个X间的correlation。不过可能不对。我的统计理论早还给老师了。

【在 m***c 的大作中提到】

: just curious colinearity=correlation?

s*r
发帖数: 2757

不先上lasso吗

K***s
发帖数: 2063

作svd，把eigen value小的去掉，
or ridge regression

【在 h*******n 的大作中提到】

相关主题
● Sample size for clustering analysis	● 关于lasso的variable selection问题
● 电话面试完了，肯定没戏，大家帮我看看题目，就算学习吧	● 【大包子】Factor data analysis
● 请问：想fit gamma 并同时用lasso的方法做variable selection	● Gene expression =?= Variable selection
进入Statistics版参与讨论

w**********y
发帖数: 1691

LASSO的几大弱点之一就包括，如果几个x直接的correlation很强的话，很大概率只会
选其中的1，2个
group lasso是干这个事儿的。上面有人说的用SAS做clustering，应该也类似。

【在 s*r 的大作中提到】

: 不先上lasso吗

s*r
发帖数: 2757

这是弱点吗

【在 w**********y 的大作中提到】

: LASSO的几大弱点之一就包括，如果几个x直接的correlation很强的话，很大概率只会
: 选其中的1，2个
: group lasso是干这个事儿的。上面有人说的用SAS做clustering，应该也类似。

w**********y
发帖数: 1691

如果真实model是 y ~ x1 + x2 + x3 + ...
x1和x2，x3强相关，lasso死活都选不出x1，是不是弱点，你觉得呢?

【在 s*r 的大作中提到】

: 这是弱点吗

s*r
发帖数: 2757

if x1 and x2/x3 is highly correlated, i would consider them (different
measurement of) the same (latent) variable. It is obvious that if x2/x3 is
important, the highly correlated x1 is important as well.

【在 w**********y 的大作中提到】

: 如果真实model是 y ~ x1 + x2 + x3 + ...
: x1和x2，x3强相关，lasso死活都选不出x1，是不是弱点，你觉得呢?

o****o
发帖数: 8077

为啥要去掉？丢失了信息。直接做个ridge regression不久完了

【在 h*******n 的大作中提到】

w**********y
发帖数: 1691

“i would consider them (different measurement of) the same (latent)
variable” 你这个假设不正确, 真实变量和nuisance variable当然可以可能有比较强
的correlation
统计模型至少包括两大用途，一个是解释，一个是预测
你可以随便翻翻经一些典的lasso paper，lasso的重要理论之一是，在什么条件下可以
依概率1 选择到真实的解释变量。而这个条件不满足的情况下，该怎么解决
官方叫法是selection consistency
怎么解决这个问题本来就是high dimension variable selection里的一大方向
一堆变量直接拿来做lasso是很省事，也很危险的事情。特别是样本小的情况下

【在 s*r 的大作中提到】

: if x1 and x2/x3 is highly correlated, i would consider them (different
: measurement of) the same (latent) variable. It is obvious that if x2/x3 is
: important, the highly correlated x1 is important as well.

A*******s
发帖数: 3942

我怎么感觉这些理论还是离实际太远。
举例说商行里最常见的loan default model
从经济学上来说，最“真实”的变量就是customer每个月的收入和支出的差
但是这个变量不可能直接观测得到，
所有可以用的variable无非都只是和这个真实的latent variable相关而已，
这样说来，model有没有oracle property真的重要么？

【在 w**********y 的大作中提到】

: “i would consider them (different measurement of) the same (latent)
: variable” 你这个假设不正确, 真实变量和nuisance variable当然可以可能有比较强
: 的correlation
: 统计模型至少包括两大用途，一个是解释，一个是预测
: 你可以随便翻翻经一些典的lasso paper，lasso的重要理论之一是，在什么条件下可以
: 依概率1 选择到真实的解释变量。而这个条件不满足的情况下，该怎么解决
: 官方叫法是selection consistency
: 怎么解决这个问题本来就是high dimension variable selection里的一大方向
: 一堆变量直接拿来做lasso是很省事，也很危险的事情。特别是样本小的情况下

s*r
发帖数: 2757

no, this is not an assumption; this is my interpretation of the results. All
views are wrong, but some are useful.
thanks for the information. in my practice, the problem is usually we do not
include the true signal in the design matrix; we just hope (and wish does
not always become true) certain columns in the design matrix are correlated
with the true signal(s).
any operation in small samples is dangerous.with regularization, lasso is
much safer than usual (stepwise) regression, or usual (unsupervised)
clustering approach.

【在 w**********y 的大作中提到】

w**********y
发帖数: 1691

你描述的问题不是LASSO要解决的问题
In practice, 至少应该 exploring/pre-screening + variable transformation/
combination + variable selection
你说的问题主要在前两部，Lasso的目的主要在第三步，正是因为lasso的缺点，才有
了后来的adaptive lasso, group lasso, 已经最近几年peter buhlmann做的一个
clustering and sparse estimation
peter buhlmann的方法恰好我四五年前用过，out-sample的效果很好，特别是在你的数
据的确是有group/cluster的structure下

h*******n
发帖数: 458

Update 一下：其实后来解决的方法很简单，不是两个两个的算CORR，而是一起把几百
个变量的全用PROC CORR算出来，然后在结果表格里面按行列看每一个值，排除变量。
SAS算几百个和两个用的时间差别不大，但是算很多次就不一样了。
楼上几位说的LASSO，我还没用过。有空的时候会看看。

相关主题
● logistic regression结果释疑，解读	● 一个人的数据可以做相关性分析吗?
● [合集] └ Re: 关于stepwise programming	● ##面试过了，请教问题##
● 两组数据，2个variable 的correlation不一样，如果合并起来，他们的correlaton怎么变化	● Clustering analysis with categorical variables
进入Statistics版参与讨论

d******e
发帖数: 7844

对于Lasso的评价没错。
不过group lasso必须已知group structure才能做，换言之，你需要给出highly
correlated variables，而且必须是dijoint partition。
这种不知道structure的情况下处理multicolinearity一般使用elastic net(ridge
penalty+l1 penalty)。说到底还是需要ridge penalty来precodition loss function。

【在 w**********y 的大作中提到】

D******n
发帖数: 2836

这typo。。。。

function。

【在 d******e 的大作中提到】

: 对于Lasso的评价没错。
: 不过group lasso必须已知group structure才能做，换言之，你需要给出highly
: correlated variables，而且必须是dijoint partition。
: 这种不知道structure的情况下处理multicolinearity一般使用elastic net(ridge
: penalty+l1 penalty)。说到底还是需要ridge penalty来precodition loss function。

d******e
发帖数: 7844

除了一个disjoint丢了个s，还有什么typo?

【在 D******n 的大作中提到】

: 这typo。。。。
:
: function。

(共1页)

进入Statistics版参与讨论

相关主题
● ##面试过了，请教问题##	● 关于stepwise programming
● Clustering analysis with categorical variables	● Sample size for clustering analysis
● 请教一个correlation的问题。	● 电话面试完了，肯定没戏，大家帮我看看题目，就算学习吧
● [合集] 电话面试完了，肯定没戏，大家帮我看看题目，就算学习吧	● 请问：想fit gamma 并同时用lasso的方法做variable selection
● model的predictors之间有multi-colinearity怎么办？	● 关于lasso的variable selection问题
● 有80个候选Predictors,怎么从中选<10个	● 【大包子】Factor data analysis
● 抓狂！为啥选出来的predictor都这么差	● Gene expression =?= Variable selection
● logistic regression issue	● logistic regression结果释疑，解读

相关话题的讨论汇总
话题: lasso话题: x1话题: x2话题: variable

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天