面试一个公司，给了一个题目，帮忙看看 - DataSciences版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - 面试一个公司，给了一个题目，帮忙看看

相关主题
● 请问这样的数据应该用什么样的模型适合。	● 借版面问个machine learning的问题
● feature selection的方法求教	● 刚入行新人的两个问题
● 问个feature selection的问题	● model selection problem
● 有没有大牛来classifiy一下 PCA用法吗？	● 这类问题咋回答
● 做credit risk scorecard的朋友们，请进来，有问题求教 (转载)	● 评估feature的预测能力
● 为什么要知道DETAILS OF A MACHINE LEARNING ALGORITHM	● 一道药厂computational biology的面试题
● f.t."我不会编程"	● [Data Science Project Case] Data Monitoring
● 报面筋求实习合租 (转载)	● [经验帖] 我是如何当上DS的

相关话题的讨论汇总
话题: 变量话题: model话题: rfe话题: response话题: 16562

进入DataSciences版参与讨论

(共1页)

N******n
发帖数: 3003

是个临床检测公司：给了个test case：
data is 580 X 16562
The first column of the provided data is the binary variable “response”.
The 16,562 other columns are binary columns that can be used to predict the
“response”.
A description of the predictive model, with a discussion of how well the
model performs.
我的打算是想把数据16562 通过 correlation of response 减到 1000 以内，然后，
做boost trapping of Lasso, 找到 important variable, and then prediction?
3x

E**********e
发帖数: 1736

１．　可以先给每个ｆｅａｔｕｒｅ　分成几个ｇｒｏｕｐ，看ｐｏｓｉｔｉｖｅ　和
ｎｅｇａｔｉｖｅ　在给个ｇｒｏｕｐ　直接的相关性。　这样就可以用自动选ｆｅａ
ｔｕｒｅ。　金融里边
probability of default　ｍｏｄｅｌ常用。
２．　可以用ｐｃａ先给　１６５６２个ｆｅａｔｕｒｅ　姜维。　去ａｃｃｕｍａｌ
ａｔｉｖｅ　ｖａｒｉａｎｃｅ　９０－９５％　或做好能够渠道２，３００的新ｆｅ
ａｔｕｒｅｓ。
接下来就简单了，　用　５－ｆｏｌｄ　ｃｒｏｓｓ　ｖａｌｉｄａｔｉｏｎ。　ａｌ
ｇｏｒｉｔｈｍ　用ｘｇｂｏｏｓｔ，看看是不是ｐｅｒｆｏｒｍａｎｃｅ要好点。
也许ｌｏｇｉｓｔｉｃ　ｒｅｇｒｅｓｓｉｏｎ　已经足够了。　不过姜维就不知道那
些ｖａｒｉａｂｌｅ　重要了。

the

【在 N******n 的大作中提到】

: 是个临床检测公司：给了个test case：
: data is 580 X 16562
: The first column of the provided data is the binary variable “response”.
: The 16,562 other columns are binary columns that can be used to predict the
: “response”.
: A description of the predictive model, with a discussion of how well the
: model performs.
: 我的打算是想把数据16562 通过 correlation of response 减到 1000 以内，然后，
: 做boost trapping of Lasso, 找到 important variable, and then prediction?
: 3x

b**********r
发帖数: 91

Use NN with all features

the

【在 N******n 的大作中提到】

z***t
发帖数: 2261

外星人统治了，地球人都不懂这是什么

N******n
发帖数: 3003

谢谢，外星人，
这个是生物科技公司，是不是解释性，合理性的model要好一些。
deep learning这些就会丧失这些特性。

g****s
发帖数: 1755

Sorta of what I am doing.
All those features are actually gene expressions;
So, 1st reduce dimension (pca will work for sure, but, in bioinformatics we
use Bayesian packages, edgeR or deseq to pick top DE genes.
2nd feature selection by RFE(), further reduce the important genes/features
to ~500.
3rd, svm-RFE(), with some optimizations, further tune the model.
4th, plot roc-auc to see model specificity
5 apply the model to test data, to get the confusion matrix.

m******r
发帖数: 1033

请教一下，你说的rfe是caret::rfe吗？我对于该函数一直有疑问。
http://topepo.github.io/caret/recursive-feature-elimination.html
这个网页里， rfe看来用不同的模型来选择最终变量‘There are a number of pre-
defined sets of functions for several models, including: linear regression (
in the object lmFuncs), random forests (rfFuncs), naive Bayes (nbFuncs),
bagged trees (treebagFuncs) and functions that can be used with caret’s
train function (caretFuncs). The latter is useful if the model has tuning
parameters that must be determined at each iteration.’
我的疑问是，既然模型都造出来了，为什么要谈‘选变量’？举个简化例子。比如
输入100个变量，选用线性回归， alpha = 5%，输出10个变量。与其像rfe()声称
1. ‘在100个变量里，这10个变量最重要‘，
不如直接说:
2‘我用这100个变量，造了某种模型，该模型最终只用了10个变量’
也许我对该文档理解有误，谁来指点指点。另外，为了说明问题，我用了最简单的
解释，如何抽样都省掉了。
另外，我认为正确的‘变量选择’方法是计算以下变量， entropy / gini/ p_value/
chisq/accuracy/auc/kappa/yuden/F1.... 100个输入对应有100个输出。

m*****s
发帖数: 371

16,562特征就是看你懂不懂降维。其中有10个足够判断，所以必然先用pca降维，接
下来用svm， random forest就行了。

g*********3
发帖数: 177

一般都是让你熟悉feature selection。这种面试题就是找免费劳动力的吧。

O*O
发帖数: 2284

如果response variable是continous variable
features是binary variables
怎么做feature selection比较好?
需要知道feature的重要性，也要有解释性(PCA不行)

相关主题
● f.t."我不会编程"	● 刚入行新人的两个问题
● 报面筋求实习合租 (转载)	● model selection problem
● 借版面问个machine learning的问题	● 这类问题咋回答
进入DataSciences版参与讨论

N******n
发帖数: 3003

E**********e
发帖数: 1736

b**********r
发帖数: 91

Use NN with all features

the

【在 N******n 的大作中提到】

z***t
发帖数: 2261

外星人统治了，地球人都不懂这是什么

N******n
发帖数: 3003

谢谢，外星人，
这个是生物科技公司，是不是解释性，合理性的model要好一些。
deep learning这些就会丧失这些特性。

g****s
发帖数: 1755

m******r
发帖数: 1033

m*****s
发帖数: 371

16,562特征就是看你懂不懂降维。其中有10个足够判断，所以必然先用pca降维，接
下来用svm， random forest就行了。

g*********3
发帖数: 177

一般都是让你熟悉feature selection。这种面试题就是找免费劳动力的吧。

O*O
发帖数: 2284

如果response variable是continous variable
features是binary variables
怎么做feature selection比较好?
需要知道feature的重要性，也要有解释性(PCA不行)

i**********8
发帖数: 27

试试 L1 logistic regression
选变量的同时就把模型作了
LibLinear package 挺好用的

the

【在 N******n 的大作中提到】

i**********8
发帖数: 27

试试 L1 logistic regression
选变量的同时就把模型作了
LibLinear package 挺好用的

the

【在 N******n 的大作中提到】

(共1页)

进入DataSciences版参与讨论

相关主题
● Project :advertersiment click prediction	● 做credit risk scorecard的朋友们，请进来，有问题求教 (转载)
● pig能做iterative的问题吗?	● 为什么要知道DETAILS OF A MACHINE LEARNING ALGORITHM
● How to prepare for the DS interview?	● f.t."我不会编程"
● 神经网络原理这门课对统计重要么 (转载)	● 报面筋求实习合租 (转载)
● 请问这样的数据应该用什么样的模型适合。	● 借版面问个machine learning的问题
● feature selection的方法求教	● 刚入行新人的两个问题
● 问个feature selection的问题	● model selection problem
● 有没有大牛来classifiy一下 PCA用法吗？	● 这类问题咋回答

相关话题的讨论汇总
话题: 变量话题: model话题: rfe话题: response话题: 16562

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天