问个feature selection的问题 - DataSciences版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - 问个feature selection的问题

相关主题
● feature selection的方法求教	● f.t."我不会编程"
● 请问这样的数据应该用什么样的模型适合。	● 报面筋求实习合租 (转载)
● 有没有大牛来classifiy一下 PCA用法吗？	● 刚入行新人的两个问题
● 问题：用VIF做feature selection	● 新手学python，有个简单数据结构问题，在线急等
● 借版面问个machine learning的问题	● model selection problem
● 问个大数据的问题	● 做credit risk scorecard的朋友们，请进来，有问题求教 (转载)
● 新手诚心请教一个deep learning的问题	● 这类问题咋回答
● 有没有做sentiment analysis的，求思路 (转载)	● 评估feature的预测能力

相关话题的讨论汇总
话题: feature话题: selection话题: lasso话题: wrapper话题: methods

进入DataSciences版参与讨论

(共1页)

d******4
发帖数: 132

一次面试被问到如果一个data set有5万个feature,怎么选择？
我回到说lasso，forward stepwise之类常规方法，面试官说不行，说这些都是对付数
量少的feature的。
大家谈谈？

d******e
发帖数: 7844

面试官没见识，5万个feature根本不算个事儿。
不过他应该是想问，feature数量已经超现有计算能力的情况。
这时候需要用更简单的方法，比如regression的话就做marginal correlation的
screening；两类Classification的话，可以做t-test。

【在 d******4 的大作中提到】

: 一次面试被问到如果一个data set有5万个feature,怎么选择？
: 我回到说lasso，forward stepwise之类常规方法，面试官说不行，说这些都是对付数
: 量少的feature的。
: 大家谈谈？

d******4
发帖数: 132

你的意思是用简单标准逐个筛选？

【在 d******e 的大作中提到】

: 面试官没见识，5万个feature根本不算个事儿。
: 不过他应该是想问，feature数量已经超现有计算能力的情况。
: 这时候需要用更简单的方法，比如regression的话就做marginal correlation的
: screening；两类Classification的话，可以做t-test。

d******e
发帖数: 7844

嗯。

【在 d******4 的大作中提到】

: 你的意思是用简单标准逐个筛选？

T*****u
发帖数: 7103

太多的话那就用filter，不用wrapper。

T*****u
发帖数: 7103

如果都是weak feature怎么办

n*****3
发帖数: 1584

我也想知道 LASSO／elastctnet以外的方法。。。。。
先 cluster highly correlated variables？

【在 d******4 的大作中提到】

Z**0
发帖数: 1119

PCA, SVD...?

n*****3
发帖数: 1584

these are hard to interpret the result

【在 Z**0 的大作中提到】

: PCA, SVD...?

T*****u
发帖数: 7103

这个要算死吧。15k长度的data算头几个pca，几百个数据而已，要算好几好几分钟。

【在 Z**0 的大作中提到】

: PCA, SVD...?

相关主题
● 问个大数据的问题	● f.t."我不会编程"
● 新手诚心请教一个deep learning的问题	● 报面筋求实习合租 (转载)
● 有没有做sentiment analysis的，求思路 (转载)	● 刚入行新人的两个问题
进入DataSciences版参与讨论

t*****e
发帖数: 364

feature selection 一般分两类： filtering based and wrapper/embedded based.
forward stepwise 对5万个features 因为计算时间就可以淘汰了，lasso 为什么不行
？面试官说不行是因为计算时间还是因为选出来的feature predictive performance
差？ R 里面的glmnet package 用坐标下降，50k feature 应该挺快的。至于
predictive performance更没有绝对的了, 都是dataset dependent. 如果面试官懂的
话，他应该听说过no free lunch theorem. 也许他希望你说用filtering based
methods like correlation, mutual info, etc?

【在 d******4 的大作中提到】

t*****e
发帖数: 364

They are not really feature selection methods, but dimension reduction
methods. If you mean using loadings of PCA to do feature selection, the
biggest cons are that it's an unsupervised method, which most likely will
give inferior predictive performance.

【在 Z**0 的大作中提到】

: PCA, SVD...?

s*w
发帖数: 729

看别人用　lasso 从 240k feature 里面选
http://fastml.com/large-scale-l1-feature-selection-with-vowpal-

【在 d******4 的大作中提到】

s*w
发帖数: 729

请展开讲下　filtering based methods like correlation, mutual info, etc?
难道是算　feature pairwise computation of correlation/mmi，　然后
thresholding 扔掉其中一些?

【在 t*****e 的大作中提到】

: feature selection 一般分两类： filtering based and wrapper/embedded based.
: forward stepwise 对5万个features 因为计算时间就可以淘汰了，lasso 为什么不行
: ？面试官说不行是因为计算时间还是因为选出来的feature predictive performance
: 差？ R 里面的glmnet package 用坐标下降，50k feature 应该挺快的。至于
: predictive performance更没有绝对的了, 都是dataset dependent. 如果面试官懂的
: 话，他应该听说过no free lunch theorem. 也许他希望你说用filtering based
: methods like correlation, mutual info, etc?

s*w
发帖数: 729

上网查了下，估计这个面试官想听: hashing

【在 t*****e 的大作中提到】

f*****y
发帖数: 822

大牛能不能展开讲讲？hashing用在feature selection还是第一次听说。

【在 s*w 的大作中提到】

: 上网查了下，估计这个面试官想听: hashing

c***z
发帖数: 6348

Lasso should work.
Maybe try deep learning methods for data compression, e.g. Autoencoders,
Restricted Boltzmann Machines

【在 d******4 的大作中提到】

d******e
发帖数: 7844

直接把高维数据直接用hash转成低维。一些特定的应用会比较有效，比如大量的binary
data，可以直接hash成低维连续数据。

【在 f*****y 的大作中提到】

: 大牛能不能展开讲讲？hashing用在feature selection还是第一次听说。

g*****o
发帖数: 812

如果不是binary，也太不靠谱了。。

binary

【在 d******e 的大作中提到】

: 直接把高维数据直接用hash转成低维。一些特定的应用会比较有效，比如大量的binary
: data，可以直接hash成低维连续数据。

t*****e
发帖数: 364

Calculate the correlation/mi/.. (whatever metric you want) between each
covariate and response variable, then pick top several to build your
predictive model. You can do thresholding too.

【在 s*w 的大作中提到】

: 请展开讲下　filtering based methods like correlation, mutual info, etc?
: 难道是算　feature pairwise computation of correlation/mmi，　然后
: thresholding 扔掉其中一些?

相关主题
● 新手学python，有个简单数据结构问题，在线急等	● 这类问题咋回答
● model selection problem	● 评估feature的预测能力
● 做credit risk scorecard的朋友们，请进来，有问题求教 (转载)	● 一道药厂computational biology的面试题
进入DataSciences版参与讨论

n*****3
发帖数: 1584

hash 这种 approach
真是打开眼界，不管work or not
真是燥快猛

【在 g*****o 的大作中提到】

: 如果不是binary，也太不靠谱了。。
:
: binary

n*****3
发帖数: 1584

我们也这么做，
但很多时候 pick one from a kind which
reduce the performance a lot

【在 t*****e 的大作中提到】

: Calculate the correlation/mi/.. (whatever metric you want) between each
: covariate and response variable, then pick top several to build your
: predictive model. You can do thresholding too.

g*****o
发帖数: 812

忽然想起了那个笑话, 说交配插到尿道里→_→

【在 n*****3 的大作中提到】

: hash 这种 approach
: 真是打开眼界，不管work or not
: 真是燥快猛

T*****u
发帖数: 7103

希望大牛能指点一下，feature selection都是在training的时候进行，除非JIT的
sensor，都是选一次的，和性能比起来，速度应该不是决定性因素，所以不太明白出题
人问的是什么。另外把filter和wrapper结合起来也许能折中。

t*****e
发帖数: 364

大牛不敢当。For high dimensional data, most likely people needs to do
performance estimation by cross validation. If feature selection is honest
and nested in cross validation, wrapper 要算死的（当然看什么样的wrapper)。
当然如果你认为算几天到一个星期都不是事，那另当别论。另外，对high dimensional
data, 就直接上filter 吧，速度是一方面，另外wrapper 很容易overfit (当然你如
果是专家，知道怎么regularize/control/penalize, 另当别论）

【在 T*****u 的大作中提到】

: 希望大牛能指点一下，feature selection都是在training的时候进行，除非JIT的
: sensor，都是选一次的，和性能比起来，速度应该不是决定性因素，所以不太明白出题
: 人问的是什么。另外把filter和wrapper结合起来也许能折中。

T*****u
发帖数: 7103

明白，多谢。再问一下，feature selection一般多长时间算是可以容忍的？

dimensional

【在 t*****e 的大作中提到】

: 大牛不敢当。For high dimensional data, most likely people needs to do
: performance estimation by cross validation. If feature selection is honest
: and nested in cross validation, wrapper 要算死的（当然看什么样的wrapper)。
: 当然如果你认为算几天到一个星期都不是事，那另当别论。另外，对high dimensional
: data, 就直接上filter 吧，速度是一方面，另外wrapper 很容易overfit (当然你如
: 果是专家，知道怎么regularize/control/penalize, 另当别论）

d******e
发帖数: 7844

binary是最简单的，可以有效的避免碰撞，可以理论证明。
不是binary一样可以做，只要合理就行。

【在 g*****o 的大作中提到】

: 如果不是binary，也太不靠谱了。。
:
: binary

t*****e
发帖数: 364

这个没有一定吧。我用惯了filtering, 所以都很快。wrapper以前用过，算一次N个小
时，要是在加cross validation, 就太慢了。

【在 T*****u 的大作中提到】

: 明白，多谢。再问一下，feature selection一般多长时间算是可以容忍的？
:
: dimensional

T*****u
发帖数: 7103

我写过一个genetic algorithm给人用，纯属自己要写一个的目的，最少三天三夜，用
的就是上边的说辞。现在想必他是恨死我了。阿弥陀佛。

【在 t*****e 的大作中提到】

: 这个没有一定吧。我用惯了filtering, 所以都很快。wrapper以前用过，算一次N个小
: 时，要是在加cross validation, 就太慢了。

n*****3
发帖数: 1584

what is the size of the datasets and what tools/envirment you use for it?
N个小时 is a lot for just wrapper..

【在 t*****e 的大作中提到】

: 这个没有一定吧。我用惯了filtering, 所以都很快。wrapper以前用过，算一次N个小
: 时，要是在加cross validation, 就太慢了。

相关主题
● 求问编程语言的选择，学stat的往DS努力	● 请问这样的数据应该用什么样的模型适合。
● 现在PYTHON，SAS， R 在工业界怎么个比例？	● 有没有大牛来classifiy一下 PCA用法吗？
● feature selection的方法求教	● 问题：用VIF做feature selection
进入DataSciences版参与讨论

w**2
发帖数: 147

lasso速度可能比较慢，而且可能stuck at local optima。
可以考虑一下用random forest classifier的feature importance帮你选。

c********1
发帖数: 60

您说的方法我也用过。有意思的是，feature importance里有几个很靠前的variable是
之前被我用bivariate test(就是对每个feature和repsone单独做test）给filter掉了
。不太清楚该怎么解决这种conflict

【在 w**2 的大作中提到】

: lasso速度可能比较慢，而且可能stuck at local optima。
: 可以考虑一下用random forest classifier的feature importance帮你选。

d******e
发帖数: 7844

"lasso可能stuck at local optima... .."
你还真是个熊孩子... ...

【在 w**2 的大作中提到】

: lasso速度可能比较慢，而且可能stuck at local optima。
: 可以考虑一下用random forest classifier的feature importance帮你选。

h*********d
发帖数: 109

【在 d******4 的大作中提到】

(共1页)

进入DataSciences版参与讨论

相关主题
● 评估feature的预测能力	● 借版面问个machine learning的问题
● 一道药厂computational biology的面试题	● 问个大数据的问题
● 求问编程语言的选择，学stat的往DS努力	● 新手诚心请教一个deep learning的问题
● 现在PYTHON，SAS， R 在工业界怎么个比例？	● 有没有做sentiment analysis的，求思路 (转载)
● feature selection的方法求教	● f.t."我不会编程"
● 请问这样的数据应该用什么样的模型适合。	● 报面筋求实习合租 (转载)
● 有没有大牛来classifiy一下 PCA用法吗？	● 刚入行新人的两个问题
● 问题：用VIF做feature selection	● 新手学python，有个简单数据结构问题，在线急等

相关话题的讨论汇总
话题: feature话题: selection话题: lasso话题: wrapper话题: methods

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天