请教一道面试题 - DataSciences版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - 请教一道面试题

相关主题
● 请问关于小的dataset evaluation的问题	● 报面筋求实习合租 (转载)
● 困惑：用cross validationce 来评估performance的时候，还需要把原始的dataset区分为train 和test吗？	● 用10-fold cross-validation 之后怎么挑Model？
● training dataset和unbalanced dataset的设计	● 我觉得neural network应用范围不大啊
● datascientist几个基本问题	● 大数据时代的最大挑战(一）?
● ask for help for R programming (转载)	● SE/Data scientist找工作总结[F/G/L/T/D/P/U…] (转载)
● 紧急求救： SMOTE-NC 处理categorical data for unbalanced class！！！	● 一个面试题（predictive model） (转载)
● Random forests on imbalanced data	● kaggle上这个restaurant-revenue-prediction的题目有人考虑过么?
● [Data Science Project Case] Data Monitoring	● 一般data scientist都是什么背景，一定要phd吗？

相关话题的讨论汇总
话题: sampling话题: imbalanced话题: br话题: class话题: classes

进入DataSciences版参与讨论

1

(共1页)

a****l 发帖数: 21	1 这个题的point是什么？谢谢 Given 4,000,000 samples with 1000 features, y is 2.5% positive and 97.5% negative, how do you take a sample from this datasets to build a reasonable model.
z*******1 发帖数: 206	2 Combat Imbalanced Classes "You can change the dataset that you use to build your predictive model to have more balanced data. This change is called sampling your dataset and there are two main methods that you can use to even-up the classes: You can add copies of instances from the under-represented class called over -sampling (or more formally sampling with replacement), or You can delete instances from the over-represented class, called under- sampling. These approaches are often very easy to implement and fast to run. They are an excellent starting point. In fact, I would advise you to always try both approaches on all of your imbalanced datasets, just to see if it gives you a boost in your preferred accuracy measures. You can learn a little more in the the Wikipedia article titled “ Oversampling and undersampling in data analysis“."
m******r 发帖数: 1033	3 我来抛个砖。看见这个2.5% vs 97.5% 是不是可以imbalanced sampling? 另外，怎么会有这么多feature ? 有的feature一眼看过去就没用直接garbage collection.
y********g 发帖数: 81	4 1. class imbalance决定了你选两类sample的比例 2. feature size决定了你至少应该选多少数据出来才能获得有意义的model 3. n/p 在这个问题里面挺大的，所以regularization不是什么大问题。只要注意不要 overfit就可以了。 reasonable 【在 a****l 的大作中提到】 : 这个题的point是什么？谢谢 : Given 4,000,000 samples with 1000 features, y is 2.5% positive and 97.5% : negative, how do you take a sample from this datasets to build a reasonable : model.
s*****n 发帖数: 134	5 http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
W*******e 发帖数: 590	6 Over sampling under sampling techniques. From the link u provided, this only applies to cases that sampling is biased from population and u know it beforehand. Confusion mertrics and classification report may be one tool with purposely adjusting the class probability and use f score as a measure. The features are big, probably need do Sth on it first. Feeling need reduce the dimensions first instead of only shrinking it. Rookie一个， please feel free to comment . : Combat Imbalanced Classes : "You can change the dataset that you use to build your predictive model to : have more balanced data. : This change is called sampling your dataset and there are two main methods : that you can use to even-up the classes: : You can add copies of instances from the under-represented class called over : -sampling (or more formally sampling with replacement), or : You can delete instances from the over-represented class, called under- : sampling. : These approaches are often very easy to implement and fast to run. They are 【在 z*******1 的大作中提到】 : Combat Imbalanced Classes : "You can change the dataset that you use to build your predictive model to : have more balanced data. : This change is called sampling your dataset and there are two main methods : that you can use to even-up the classes: : You can add copies of instances from the under-represented class called over : -sampling (or more formally sampling with replacement), or : You can delete instances from the over-represented class, called under- : sampling. : These approaches are often very easy to implement and fast to run. They are
b*****s 发帖数: 11267	7 40000002.5% 这个postive size对我来说已经很奢侈了干嘛需要up sampling或者down sampling，虽然我老板就是搞sampling的，但是我个人觉得up sampling或者down sampling之后就没法provide unbiased estimation了 reasonable 【在 a***l 的大作中提到】 : 这个题的point是什么？谢谢 : Given 4,000,000 samples with 1000 features, y is 2.5% positive and 97.5% : negative, how do you take a sample from this datasets to build a reasonable : model.
d****n 发帖数: 12461	8 是不是sampling最后都要搞到1:1?
a*z 发帖数: 294	9 second this one: "4000000*2.5% 这个postive size对我来说已经很奢侈了" would like to do dimension reduction first.
t******g 发帖数: 2253	10 这个是问怎么处理imbalanced samples，然后如何在这种情况下build model
x***t 发帖数: 263	11 尽管4M2.5% 绝对数量很大，但是还是2.5% vs 97.5% 的imbalanced class problem。一般策略是： 1）over sampling on minority class (缺点：overfitting，只是把decision boundary 做细，没有genralize） 2) under sampling on majority class 3) synthesize data points 第三个参考SMOTE和ADASYN 两种方法。python有现成package：imbalanced-learn SMOTE和ADASYN的papers： https://www.jair.org/media/953/live-953-2037-jair.pdf http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2008-He-ieee.pdf reasonable 【在 a***l 的大作中提到】 : 这个题的point是什么？谢谢 : Given 4,000,000 samples with 1000 features, y is 2.5% positive and 97.5% : negative, how do you take a sample from this datasets to build a reasonable : model.
S*****o 发帖数: 715	12 Oversampling is v. bad for decision tree based pipelines, as the decision policy usually based on gini index, info gain or whatever, affected by distribution of classes. But it could work v. well in some cases, penalty based balancing is often upsampling in disguise. 【在 x**t 的大作中提到】 : 尽管4M2.5% 绝对数量很大，但是还是2.5% vs 97.5% 的imbalanced class problem。 : 一般策略是： : 1）over sampling on minority class (缺点：overfitting，只是把decision : boundary 做细，没有genralize） : 2) under sampling on majority class : 3) synthesize data points : 第三个参考SMOTE和ADASYN 两种方法。python有现成package：imbalanced-learn : SMOTE和ADASYN的papers： : https://www.jair.org/media/953/live-953-2037-jair.pdf : http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2008-He-ieee.pdf
x***t 发帖数: 263	13 所以我推荐SMOTE或ADASYN，详细参见原paper 【在 S*****o 的大作中提到】 : Oversampling is v. bad for decision tree based pipelines, as the decision : policy usually based on gini index, info gain or whatever, affected by : distribution of classes. But it could work v. well in some cases, penalty : based balancing is often upsampling in disguise.
a*****s 发帖数: 838	14 Can someone explain more about these steps and where to learn all these? I am learning DS from scratch, and.. mainly self-teaching as well by looking for online resources. Thanks. 【在 x**t 的大作中提到】 : 尽管4M2.5% 绝对数量很大，但是还是2.5% vs 97.5% 的imbalanced class problem。 : 一般策略是： : 1）over sampling on minority class (缺点：overfitting，只是把decision : boundary 做细，没有genralize） : 2) under sampling on majority class : 3) synthesize data points : 第三个参考SMOTE和ADASYN 两种方法。python有现成package：imbalanced-learn : SMOTE和ADASYN的papers： : https://www.jair.org/media/953/live-953-2037-jair.pdf : http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2008-He-ieee.pdf

1

(共1页)

进入DataSciences版参与讨论

相关主题
● 一般data scientist都是什么背景，一定要phd吗？	● ask for help for R programming (转载)
● Colah 关于 neural network 的一篇博客	● 紧急求救： SMOTE-NC 处理categorical data for unbalanced class！！！
● 做credit risk scorecard的朋友们，请进来，有问题求教 (转载)	● Random forests on imbalanced data
● 怎么处理categorical variable有很多个level的	● [Data Science Project Case] Data Monitoring
● 请问关于小的dataset evaluation的问题	● 报面筋求实习合租 (转载)
● 困惑：用cross validationce 来评估performance的时候，还需要把原始的dataset区分为train 和test吗？	● 用10-fold cross-validation 之后怎么挑Model？
● training dataset和unbalanced dataset的设计	● 我觉得neural network应用范围不大啊
● datascientist几个基本问题	● 大数据时代的最大挑战(一）?

相关话题的讨论汇总
话题: sampling话题: imbalanced话题: br话题: class话题: classes

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)